By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,364 Members | 1,254 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,364 IT Pros & Developers. It's quick & easy.

Web page screen scraping?

P: n/a
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron
Feb 12 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Hi Ronald,
what you basically have to do is use an object like the HttpRequest class,
which will be able to pull back the web page html for you to process, then
when the results come back you wil have to parse it. Since every website is
different you will have to write a custom scraper for each site you want to
scrape. Scraping will involve locating the pieces of information in the HTML
that you want to extract. If you get lucky and the webpage conforms to XHTML
standards then you can use the standard System.Xml objects to parse and find
the information you want which should be pretty simple. If the we page is
not XHTML compliant then you will have to perform some string manipulation,
using regular expressions or just plain old coding to find the correct
location in the HTML string you want.

Hope that helps
Mark Dawson
http://www.markdawson.org
"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 12 '06 #2

P: n/a
Ronald,
I'd recommend that you take a look at Simon Mourier's HtmlAgilityPack.

http://blogs.msdn.com/smourier/archi...6/04/8265.aspx

Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com


"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 13 '06 #3

P: n/a
Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

I used an open source HTML parser a while back, but can't find it now.
I did find this, however, though I can't say I have any experience with it.

http://www.codeproject.com/dotnet/apmilhtml.asp
scott
Feb 14 '06 #4

P: n/a
Take a look at SWExplorerAutomation
(http://home.comcast.net/~furmana/SWI...ion.htm)(SWEA). SWEA
creates an object model (automation interface) for any Web application
running in Internet Explorer. It uses XPath expressions to extract data
from the Web pages and the expressions can be visually defined using
SWEA designer.

Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron


Feb 15 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.