Web page screen scraping?

Ronald S. Cook

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 12 '06 #1

Subscribe Post Reply

9432

Mark R. Dawson

Hi Ronald,
what you basically have to do is use an object like the HttpRequest class,
which will be able to pull back the web page html for you to process, then
when the results come back you wil have to parse it. Since every website is
different you will have to write a custom scraper for each site you want to
scrape. Scraping will involve locating the pieces of information in the HTML
that you want to extract. If you get lucky and the webpage conforms to XHTML
standards then you can use the standard System.Xml objects to parse and find
the information you want which should be pretty simple. If the we page is
not XHTML compliant then you will have to perform some string manipulation,
using regular expressions or just plain old coding to find the correct
location in the HTML string you want.

Hope that helps
Mark Dawson
http://www.markdawson.org
"Ronald S. Cook" wrote:

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 12 '06 #2

Peter Bromberg [C# MVP]

Ronald,
I'd recommend that you take a look at Simon Mourier's HtmlAgilityPack.

http://blogs.msdn.com/smourier/archi...6/04/8265.aspx

Peter

--
Co-founder, Eggheadcafe.com developer portal:
http://www.eggheadcafe.com
UnBlog:
http://petesbloggerama.blogspot.com

"Ronald S. Cook" wrote:

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 13 '06 #3

Scott C

Ronald S. Cook wrote:

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

I used an open source HTML parser a while back, but can't find it now.
I did find this, however, though I can't say I have any experience with it.

http://www.codeproject.com/dotnet/apmilhtml.asp
scott

Feb 14 '06 #4

alex_f_il

Take a look at SWExplorerAutomation
(http://home.comcast.net/~furmana/SWI...ion.htm)(SWEA). SWEA
creates an object model (automation interface) for any Web application
running in Internet Explorer. It uses XPath expressions to extract data
from the Web pages and the expressions can be visually defined using
SWEA designer.

Ronald S. Cook wrote:

I've been asked to extract data from web pages. Given that they are
rendered in HTML and not any sort of XML I'm wondering how to go about
"scraping" such a web page of data.

Can anyone give me any starting place?

Thanks,
Ron

Feb 15 '06 #5

Similar topics

screen scraping

by: Roland Hall | last post by:

Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...

ASP / Active Server Pages

Catch html of a page..

by: Lorenzo | last post by:

I've a web site with a classic asp login page (https), another where in a textbox i write a sql query and a third that shows the resulset of the query.... Now i want to create an asp.net...

C# / C Sharp

Screen scraping in ASP.NET

by: Jim Giblin | last post by:

I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not...

ASP.NET

Embedding a web page output into asp.net webform (C#)

by: Skeptical | last post by:

Hello, I am trying to embed html output into my webform but could not figure out how to so far. The form will execute a Perl script with some parameters, and script will output some html...

ASP.NET

How to capture the rendered HTML for asp.net page?

by: Paul W | last post by:

Hi - I want to be able to capture the html generated by one of my pages. Is there any way to do this from within the application, or must I use some form of 'screen-scraping'. If screen-scraping,...

ASP.NET

Screen Scraping a web page

by: Steve | last post by:

I am working on an application to screen scrape information from a web page. I have the base code working but the problem is I have to login before I can get the info I need. The page is hosted on...

Visual Basic .NET

PHP script that displays another page partial content

by: lucavilla | last post by:

If you go to http://europe.nokia.com/A4305060, fill the "Enter your product code:" field with the value "0523183" and press "Go" (the ending page URL varies because there's a variable session-ID in...

PHP

stealth screen scraping with python?

by: different.engine | last post by:

Folks: I am screen scraping a large volume of data from Yahoo Finance each evening, and parsing with Beautiful Soup. I was wondering if anyone could give me some pointers on how to make it...

Python

how to screen scrape from a windows text box

by: WFDGW2 | last post by:

I want to write or obtain C++ code that will scrape text from a dialog box within a poker client, and then record that text somewhere else. What do I do? Thanks.

C# / C Sharp

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++