473,562 Members | 2,822 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Web page screen scraping?

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron
Feb 12 '06 #1
4 9441
Hi Ronald,
what you basically have to do is use an object like the HttpRequest class,
which will be able to pull back the web page html for you to process, then
when the results come back you wil have to parse it. Since every website is
different you will have to write a custom scraper for each site you want to
scrape. Scraping will involve locating the pieces of information in the HTML
that you want to extract. If you get lucky and the webpage conforms to XHTML
standards then you can use the standard System.Xml objects to parse and find
the information you want which should be pretty simple. If the we page is
not XHTML compliant then you will have to perform some string manipulation,
using regular expressions or just plain old coding to find the correct
location in the HTML string you want.

Hope that helps
Mark Dawson
http://www.markdawson.org
"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 12 '06 #2
Ronald,
I'd recommend that you take a look at Simon Mourier's HtmlAgilityPack .

http://blogs.msdn.com/smourier/archi...6/04/8265.aspx

Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com


"Ronald S. Cook" wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 13 '06 #3
Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

I used an open source HTML parser a while back, but can't find it now.
I did find this, however, though I can't say I have any experience with it.

http://www.codeproject.com/dotnet/apmilhtml.asp
scott
Feb 14 '06 #4
Take a look at SWExplorerAutom ation
(http://home.comcast.net/~furmana/SWI...ion.htm)(SWEA). SWEA
creates an object model (automation interface) for any Web application
running in Internet Explorer. It uses XPath expressions to extract data
from the Web pages and the expressions can be visually defined using
SWEA designer.

Ronald S. Cook wrote:
I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron


Feb 15 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
5730
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var tagType1 = '"mail'+'to:', tagType2 = '">', tagType3 = '<\/a>'; var user1 = 'web', user2 = 'master', user3 = '@'; var dom1 = 'danger', dom2 = 'ous',...
5
1623
by: Lorenzo | last post by:
I've a web site with a classic asp login page (https), another where in a textbox i write a sql query and a third that shows the resulset of the query.... Now i want to create an asp.net application that have only one page with 2 textbox for login, a textbox where i'll write the sql query and a textbox where shows the html of the previuos...
3
2350
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not want to use the format of their pages, and want the inforamtion consolidated to a single page of my design. I did something like this for a...
6
1881
by: Skeptical | last post by:
Hello, I am trying to embed html output into my webform but could not figure out how to so far. The form will execute a Perl script with some parameters, and script will output some html code. I need to capture and render this html into my webform. Any ideas?
2
3093
by: Paul W | last post by:
Hi - I want to be able to capture the html generated by one of my pages. Is there any way to do this from within the application, or must I use some form of 'screen-scraping'. If screen-scraping, can someone point me in the right direction and indicate how I get past the login screen, etc.(uses forms authentication)? Thanks, Paul.
0
3633
by: Steve | last post by:
I am working on an application to screen scrape information from a web page. I have the base code working but the problem is I have to login before I can get the info I need. The page is hosted on my Router. When I go to the IP of the router I get the following page. <HTML> <head> <meta http-equiv="content-type"...
4
449
by: lucavilla | last post by:
If you go to http://europe.nokia.com/A4305060, fill the "Enter your product code:" field with the value "0523183" and press "Go" (the ending page URL varies because there's a variable session-ID in the URL-link associated to "Go") you will obtain this string: "Version: RM43_V1.10.030" Is it possible to have a string.php page that just...
4
3311
by: different.engine | last post by:
Folks: I am screen scraping a large volume of data from Yahoo Finance each evening, and parsing with Beautiful Soup. I was wondering if anyone could give me some pointers on how to make it less obvious to Yahoo that this is what I am doing, as I fear that they probably monitor for this type of activity, and will soon ban my IP.
3
5157
by: WFDGW2 | last post by:
I want to write or obtain C++ code that will scrape text from a dialog box within a poker client, and then record that text somewhere else. What do I do? Thanks.
0
8101
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7627
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6221
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5477
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5193
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3623
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3608
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2073
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
903
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.