By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,058 Members | 1,209 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,058 IT Pros & Developers. It's quick & easy.

Seeking examples of screen scraping....

P: n/a
Jim
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info). There are no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code (or
controls to accomplush the same thing), I'd really appreciate it. (C# - and
even VS 2003 are OK)

Thanks!
Jan 13 '06 #1
Share this Question
Share on Google+
29 Replies


P: n/a
KJ
A google search on the terms ".net screen scrape html" brings up a
great many options.

Jan 13 '06 #2

P: n/a
Jim,
If you intend to get serious about this you are probably going to want to
learn to use a library. Take a look at Simon Mourier's HtmlAgilityPack.
Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com


"Jim" wrote:
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info). There are no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code (or
controls to accomplush the same thing), I'd really appreciate it. (C# - and
even VS 2003 are OK)

Thanks!

Jan 13 '06 #3

P: n/a
Jim

"KJ" <n_**********@mail.com> wrote in message
news:11*********************@g47g2000cwa.googlegro ups.com...
A google search on the terms ".net screen scrape html" brings up a
great many options.
Gee!! Thanks! I hadn't thought of that.

(Now, for the rest of you with working frontal lobes, I'd still like to see
what you have. Personal recommendations are always better than random
searches.)

Jan 13 '06 #4

P: n/a
Jim,

See this sample on our website.

http://www.vb-tips.com/default.aspx?...f-56dbb63fdf1c

I hope this helps,

Cor
Jan 13 '06 #5

P: n/a
KJ
You know Jim, I actually thought what I wrote was helpful. And I also
think your sarcasm is out of line.

Jan 13 '06 #6

P: n/a
Jim

"KJ" <n_**********@mail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
You know Jim, I actually thought what I wrote was helpful. And I also
think your sarcasm is out of line.


And I think your lazy answer is out of line and sarcastic.

I really get tired of seeing people respond to posts by simply saying
"google it".

If you think the poster is so dense that they don't know how to use search
engines, you should probably skip replying at all as it would do little
good.

Posting a reply like "google it" is a waste of bandwidth and time to those
that view these newsgroups.

Helpful and pertinent posts are welcomed and appreciated. "Google it" is
neither helpful nor pertinent.

How many newsgroup users do you think have not heard of or used Google?

BTW.....your precious Google results only give answers (one of which is
repeated at least 4 times in the first 20 examples - with 2 other repeat
answers accounting for 5 more of the first 20 results) that are very
elementary. The reason for posting the request here is to get more in-depth
answers from the knowledgable people that frequent the newsgroups.

If I have need of a simplistic, irrelevant result I will most assuredly
"Goggle it".

Jim
Jan 13 '06 #7

P: n/a
Steven Smith has a useful article on using HTTPWebRequest to collect the
contents of an HTML site here
http://authors.aspalliance.com/steve.../netscrape.asp
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
"Jim" <re***@groups.please> wrote in message
news:uW*********************@bignews6.bellsouth.ne t...
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of
the multiple websites I have to visit now to get the same info). There are
no open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code
(or controls to accomplush the same thing), I'd really appreciate it.
(C# - and even VS 2003 are OK)

Thanks!

Jan 15 '06 #8

P: n/a
You know, you are just getting help that's worth what you paid for it... If
you disagree with the reply, follow your own advice and skip it, no need to
make frontal lobe comments.
If you think the poster is so dense that they don't know how to use search
engines, you should probably skip replying at all as it would do little
good.

Jan 15 '06 #9

P: n/a
Hello Jim,

Those of us who choose to help others on the newsgroup do it not because we
are paid but out of a desire to help fellow coders and maybe because other
coders help us. It's a chain.

Your attitude leaves a lot to be desired. Your question is un-specific,
about a very broad topic, and you have not presented a particular
programming problem. You want an answer that will give you the complete
overview of the solution without making any effort from your side to write
code or evolve a strategy to solve the problem.

Even a very basic search could tell you that you can retrieve the data of a
webpage using the HttpWebRequest object, and from then on it's a question of
logic.

I don't think you should be so rude on the newsgroup to people who care to
answer, or maybe after a while nobody will care to answer.

Regards
Cyril Gupta
Jan 15 '06 #10

P: n/a
Jim

"Cyril Gupta" <no****@mail.com> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Hello Jim,

Those of us who choose to help others on the newsgroup do it not because
we are paid but out of a desire to help fellow coders and maybe because
other coders help us. It's a chain.
I've probably asnwered more cries for help in ngs than you've ever read. I
am familiar with the concept.

Your attitude leaves a lot to be desired. Your question is un-specific,
about a very broad topic, and you have not presented a particular
programming problem.
Really? What exactly would you call "I want to extract data from several
websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info)." ?

Do you think that the exact websites or page info would alter the answer
given? If so, you don't understand the question.
You want an answer that will give you the complete overview of the solution
without making any effort from your side to write code or evolve a strategy
to solve the problem.
Did Sylvia Brown tell you this, or are you a budding psychic yourself?

Either way, you missed with that assumption completely. I was actively
working on the solution before I made the post and continued to do so
afterwards.

But, let's assume (since you evidentally like to do that) that your
assumption was right. Programmers, like myself, give away code snippets to
others to save them time and effort and as a tool that they can learn from.
We even have entire sites dedicated to the task.

Ever hear of Planet Source Code or The Code Project or SourceForge? Perhaps
you should log on to those sites and tell the users how lazy they all are.
(PLEASE let me know if you do......I wouldn't miss it for the world!)

What if Microsoft put out the .Net 2.0 framework with your "you try and
figure it out" attitude? You'd just have to figure out how the entire .Net
2.0 framework works. And you'd probably be just as productive as your post
to this thread.
Even a very basic search could tell you that you can retrieve the data of
a webpage using the HttpWebRequest object, and from then on it's a
question of logic.
Well, duh. I acknowledged that Google gives simplistic examples (like the
one you suggest) that gets the whole page. What I wanted to know (and if
you'd read the OP, you'd know this) was the most efficient way to extract
data from the page.

I don't think you should be so rude on the newsgroup to people who care to
answer, or maybe after a while nobody will care to answer.


And I don't think that you should appoint yourself the NG-Police. So?
Neither of us cares what the other thinks so why are you wasting even more
bandwidth with your tripe?

If my scolding posters for posting irrelevant, "Google it" posts, or tripe
like you have posted, means that people with no answer (like yourself)
ignore my posts, GREAT! I'm sure others will appreciate your NOT posting
irrelevant material to my threads almost as much a I will.

Have a nice life! And I hope that people post more relevant responses to
your requests than you have to mine.

Jim
Jan 15 '06 #11

P: n/a
Jim
See my reply to Cryil Gupta...

"Gabriel Magana" <no***@nospam.com> wrote in message
news:uN**************@TK2MSFTNGP10.phx.gbl...
You know, you are just getting help that's worth what you paid for it...
If you disagree with the reply, follow your own advice and skip it, no
need to make frontal lobe comments.
If you think the poster is so dense that they don't know how to use
search engines, you should probably skip replying at all as it would do
little good.


Jan 15 '06 #12

P: n/a
Jim
This is an excellent starting point. Thank you for posting it.

What I am wondering is if there is a way to load the results into an object
that allows one to extract data as if it were a recordset. Have you seen
anything like that?

Jim

"Nick Malik [Microsoft]" <ni*******@hotmail.nospam.com> wrote in message
news:c7********************@comcast.com...
Steven Smith has a useful article on using HTTPWebRequest to collect the
contents of an HTML site here
http://authors.aspalliance.com/steve.../netscrape.asp
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
"Jim" <re***@groups.please> wrote in message
news:uW*********************@bignews6.bellsouth.ne t...
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of
the multiple websites I have to visit now to get the same info). There
are no open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code
(or controls to accomplush the same thing), I'd really appreciate it.
(C# - and even VS 2003 are OK)

Thanks!


Jan 15 '06 #13

P: n/a
Jim
Excellent! It not only gets the page, but extracts the text from the page.

But, I am wonderin if there is a way to load a "webpage object" and query it
like a recordset. Seen anything like that?

Jim

"Cor Ligthert [MVP]" <no************@planet.nl> wrote in message
news:eh**************@TK2MSFTNGP12.phx.gbl...
Jim,

See this sample on our website.

http://www.vb-tips.com/default.aspx?...f-56dbb63fdf1c

I hope this helps,

Cor

Jan 15 '06 #14

P: n/a
Jim
Not ignoring you....testing it. Thanks for the link!
"Peter Bromberg [C# MVP]" <pb*******@yahoo.nospammin.com> wrote in message
news:C7**********************************@microsof t.com...
Jim,
If you intend to get serious about this you are probably going to want to
learn to use a library. Take a look at Simon Mourier's HtmlAgilityPack.
Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com


"Jim" wrote:
I want to extract data from several websites that I visit daily. I'd
like
to condense the info into a single web page that I can visit (instead of
the
multiple websites I have to visit now to get the same info). There are
no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code
(or
controls to accomplush the same thing), I'd really appreciate it. (C# -
and
even VS 2003 are OK)

Thanks!

Jan 15 '06 #15

P: n/a
Dude, for some reason you think your reputation precedes you. It does
not...
Jan 15 '06 #16

P: n/a
Look at
SWExplorerAutomation(http://home.comcast.net/~furmana/SWIEAutomation.htm)

SW Explorer Automation (SWEA) creates an object model (automation
interface) for any Web application running in Internet Explorer. The
automation interface consists of pages (scenes) and controls. The page
consists of controls. The following controls are supported:
HtmlContent, HtmlAnchor, HtmlImage, HtmlInputButton, HtmlInputCheckBox,
HtmlInputRadioButton, HtmlInputText, HtmlSelect, HtmlTextArea. The
object model is defined visually by SWEA designer. The designer allows
to record scripts (C# and VB) based on the defined application object
model.

It is very easy to create a scraping solution for any Web site using
SWEA.

Jim wrote:
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info). There are no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code (or
controls to accomplush the same thing), I'd really appreciate it. (C# - and
even VS 2003 are OK)

Thanks!


Jan 15 '06 #17

P: n/a
On Sun, 15 Jan 2006 06:40:35 -0500, "Jim" <re***@groups.please> wrote:
Excellent! It not only gets the page, but extracts the text from the page.

But, I am wonderin if there is a way to load a "webpage object" and query it
like a recordset. Seen anything like that?
Something like the HTMLDocumentClass type perhaps?
If so mshtml.dll is the place to look.

regards
A.G.Jim

"Cor Ligthert [MVP]" <no************@planet.nl> wrote in message
news:eh**************@TK2MSFTNGP12.phx.gbl...
Jim,

See this sample on our website.

http://www.vb-tips.com/default.aspx?...f-56dbb63fdf1c

I hope this helps,

Cor

Jan 15 '06 #18

P: n/a

"Jim" <re***@groups.please> schrieb:
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of
the multiple websites I have to visit now to get the same info). There are
no open APIs or webservices for these websites that I am aware of.


Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Jan 15 '06 #19

P: n/a
I'm affraid that what you are asking for is very difficult. The reason I
think this is the following: ever heard about the semantic web?

In other words: getting all text from a webpage is a peace of cake,
getting a perticular part of a webpage is much more difficult as there
is not point to refer to.

I've read in another post in this question that you want to use a kind
of query. Well here is the problem; you want a query like: get results
form soccer_game. Well the problem is to define soccer_game...

The only thing you can do is trying to find a fixed point (like 5th
<p>-element, or <div> element with id-attribute set to "soccer_game")

So, the way I think you should solve your problem is a. getting the page
as a html (xml) document b. defining a point (tag) to get the data from.

greetz and succes,
Rudderius

Jim wrote:
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info). There are no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code (or
controls to accomplush the same thing), I'd really appreciate it. (C# - and
even VS 2003 are OK)

Thanks!

Jan 15 '06 #20

P: n/a
Jim
If I were to code the solution myself, I would agree.

Starting from scratch....it seems that the best way to get the data will be
to design a UI that (a) shows the web page from which you wish to gather the
data and (b) allows you to select a portion of the web page by simply
drawing a box around the intended elements.

Then, you would need to identify the element in the HTML by name, position,
element type or some other text that is most likely to occur in the element
as a type of tag. A combination of these identifiers would be most helpful,
but most data formatted for the web conatins some type of header (title) in
the text that can be used for the identifier.

There was a software package that did something like this called...EyeOnWeb
(http://www.eyeonweb.com/screen.html). The website has a 2004 date....so I
am not sure about the continuation of this product. There is no mention of
a developer's product here, but I suspect it would be a welcomed addition to
a web developer's Visual Studio Toolbox.

Jim
"Rudderius" <dr***@bestopia.be> wrote in message
news:11***************@seven.kulnet.kuleuven.ac.be ...
I'm affraid that what you are asking for is very difficult. The reason I
think this is the following: ever heard about the semantic web?

In other words: getting all text from a webpage is a peace of cake,
getting a perticular part of a webpage is much more difficult as there is
not point to refer to.

I've read in another post in this question that you want to use a kind of
query. Well here is the problem; you want a query like: get results form
soccer_game. Well the problem is to define soccer_game...

The only thing you can do is trying to find a fixed point (like 5th
<p>-element, or <div> element with id-attribute set to "soccer_game")

So, the way I think you should solve your problem is a. getting the page
as a html (xml) document b. defining a point (tag) to get the data from.

greetz and succes,
Rudderius

Jim wrote:
I want to extract data from several websites that I visit daily. I'd
like to condense the info into a single web page that I can visit
(instead of the multiple websites I have to visit now to get the same
info). There are no open APIs or webservices for these websites that I
am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code
(or controls to accomplush the same thing), I'd really appreciate it.
(C# - and even VS 2003 are OK)

Thanks!

Jan 16 '06 #21

P: n/a
Look at SWExplorerAutomation (SWEA)
(http://home.comcast.net/~furmana/SWIEAutomation.htm). SWEA has
TableDataExtactor and XPathDataExtractor which allows
visually define a data to be extracted. The Table Data Extractor
extracts tabular data from the Web pages. If a Web page contains
repeating information patterns than the data can be transformed into
ADO.NET DataTable object. XPathDataExtractor allows visually define
XPath expressions for the data extraction.

Jan 16 '06 #22

P: n/a
Registered user
On Sun, 15 Jan 2006 06:40:35 -0500, "Jim" <re***@groups.please> wrote:
Excellent! It not only gets the page, but extracts the text from the
page.

But, I am wonderin if there is a way to load a "webpage object" and query
it
like a recordset. Seen anything like that?

Something like the HTMLDocumentClass type perhaps?
If so mshtml.dll is the place to look.


The sample I gave to Jim is about the HTMLDocumentClass and Mshtml, Don't
you think that it is better next time to look first to the given answer
before you reply?

Cor
Jan 16 '06 #23

P: n/a
Jim
The website says "Requires Microsoft .Net framework runtime 1.1." and I am
using 2.0 for this project.

But, it looks cool.

<al*******@hotmail.com> wrote in message
news:11**********************@g43g2000cwa.googlegr oups.com...
Look at SWExplorerAutomation (SWEA)
(http://home.comcast.net/~furmana/SWIEAutomation.htm). SWEA has
TableDataExtactor and XPathDataExtractor which allows
visually define a data to be extracted. The Table Data Extractor
extracts tabular data from the Web pages. If a Web page contains
repeating information patterns than the data can be transformed into
ADO.NET DataTable object. XPathDataExtractor allows visually define
XPath expressions for the data extraction.

Jan 16 '06 #24

P: n/a
SWEXploerAutomation will work with .Net framework runtime 2.0. You can
only have installation problems. To install current version:

1. Unzip the downloaded exe file. Use MSI to install.

2. Update swdesigner.exe.config:

<startup>
<supportedRuntime version="v2.0.50727"/>
<supportedRuntime version="v1.1.4322"/>
</startup>

I will post a new release which will install on machines with only .Net
framework 2.0 this week.

Jan 16 '06 #25

P: n/a
Jim
Sweet! I'll poke it in the eye this afternoon.

Thanks!
<al*******@hotmail.com> wrote in message
news:11**********************@g14g2000cwa.googlegr oups.com...
SWEXploerAutomation will work with .Net framework runtime 2.0. You can
only have installation problems. To install current version:

1. Unzip the downloaded exe file. Use MSI to install.

2. Update swdesigner.exe.config:

<startup>
<supportedRuntime version="v2.0.50727"/>
<supportedRuntime version="v1.1.4322"/>
</startup>

I will post a new release which will install on machines with only .Net
framework 2.0 this week.

Jan 16 '06 #26

P: n/a
On Mon, 16 Jan 2006 08:31:58 +0100, "Cor Ligthert [MVP]"
<no************@planet.nl> wrote:
Registered user
On Sun, 15 Jan 2006 06:40:35 -0500, "Jim" <re***@groups.please> wrote:
Excellent! It not only gets the page, but extracts the text from the
page.

But, I am wonderin if there is a way to load a "webpage object" and query
it
like a recordset. Seen anything like that?

Something like the HTMLDocumentClass type perhaps?
If so mshtml.dll is the place to look.


The sample I gave to Jim is about the HTMLDocumentClass and Mshtml, Don't
you think that it is better next time to look first to the given answer
before you reply?


A bit cantankerous eh? I was responding to the quoted follow-up.
Apparently the given answer was not sufficient hence Jim's subsequent
question.

regards
A.G.
Jan 16 '06 #27

P: n/a
"Jim" <re***@groups.please> wrote in message
news:Ms*******************@bignews3.bellsouth.net. ..
This is an excellent starting point. Thank you for posting it.

What I am wondering is if there is a way to load the results into an
object that allows one to extract data as if it were a recordset. Have
you seen anything like that?


Hi Jim,

I have seen numerous controls in the third party space where you can load an
HTML page and then move through it as an object heirarchy.

The problem with HTML is that it is a text markup language. It is not
really useful for describing data as an object. Therefore tools that read
HTML (including the app you are writing) have to cope with this lack of
structure by using patterns to find the relevant sections of text.

It sounds like the sites you are visiting are updated daily. This nearly
always means that they are program-generated (ASP, PHP, etc). Using regular
expressions, and examples from a couple of days of pulling the page down,
you should be able to isolate the strings that never change from the data
that does. That information can help you to produce a regular expression
that will isolate the data you want.

I wrote a little app like this a couple of years ago that would pull the
dilbert of the day down to my hard drive and set it up to be in my
screensaver. (See what happens when programmers get bored?)

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--
Jan 16 '06 #28

P: n/a
Jim

"Nick Malik [Microsoft]" <ni*******@hotmail.nospam.com> wrote in message
news:t4********************@comcast.com...
"Jim" <re***@groups.please> wrote in message
news:Ms*******************@bignews3.bellsouth.net. ..
This is an excellent starting point. Thank you for posting it.

What I am wondering is if there is a way to load the results into an
object that allows one to extract data as if it were a recordset. Have
you seen anything like that?


Hi Jim,

I have seen numerous controls in the third party space where you can load
an HTML page and then move through it as an object heirarchy.

The problem with HTML is that it is a text markup language. It is not
really useful for describing data as an object. Therefore tools that read
HTML (including the app you are writing) have to cope with this lack of
structure by using patterns to find the relevant sections of text.

It sounds like the sites you are visiting are updated daily. This nearly
always means that they are program-generated (ASP, PHP, etc). Using
regular expressions, and examples from a couple of days of pulling the
page down, you should be able to isolate the strings that never change
from the data that does. That information can help you to produce a
regular expression that will isolate the data you want.

I wrote a little app like this a couple of years ago that would pull the
dilbert of the day down to my hard drive and set it up to be in my
screensaver. (See what happens when programmers get bored?)


Excellent use of resources!

I have Dilbert as a page in my news (real news not newsgroups) group of
pages that I open first thing every morning.

Jim
Jan 17 '06 #29

P: n/a
Come on, the guy posted a reasonable question for help and some jerk said
GOOGLE IT.

Was that person trying to be helpful? NO.
It was a petty, passive aggressive flame.

People like that are a waste of time and dilute the quality of the newsgroups.
Feb 20 '06 #30

This discussion thread is closed

Replies have been disabled for this discussion.