471,319 Members | 1,676 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,319 software developers and data experts.

Web page screen scraping?

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron
Feb 12 '06 #1
4 9274
Hi Ronald,
what you basically have to do is use an object like the HttpRequest class,
which will be able to pull back the web page html for you to process, then
when the results come back you wil have to parse it. Since every website is
different you will have to write a custom scraper for each site you want to
scrape. Scraping will involve locating the pieces of information in the HTML
that you want to extract. If you get lucky and the webpage conforms to XHTML
standards then you can use the standard System.Xml objects to parse and find
the information you want which should be pretty simple. If the we page is
not XHTML compliant then you will have to perform some string manipulation,
using regular expressions or just plain old coding to find the correct
location in the HTML string you want.

Hope that helps
Mark Dawson
http://www.markdawson.org
"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 12 '06 #2
Ronald,
I'd recommend that you take a look at Simon Mourier's HtmlAgilityPack.

http://blogs.msdn.com/smourier/archi...6/04/8265.aspx

Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com


"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 13 '06 #3
Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

I used an open source HTML parser a while back, but can't find it now.
I did find this, however, though I can't say I have any experience with it.

http://www.codeproject.com/dotnet/apmilhtml.asp
scott
Feb 14 '06 #4
Take a look at SWExplorerAutomation
(http://home.comcast.net/~furmana/SWI...ion.htm)(SWEA). SWEA
creates an object model (automation interface) for any Web application
running in Internet Explorer. It uses XPath expressions to extract data
from the Web pages and the expressions can be visually defined using
SWEA designer.

Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron


Feb 15 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Roland Hall | last post: by
5 posts views Thread by Lorenzo | last post: by
3 posts views Thread by Jim Giblin | last post: by
reply views Thread by Steve | last post: by
4 posts views Thread by lucavilla | last post: by
4 posts views Thread by different.engine | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.