By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,364 Members | 1,254 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,364 IT Pros & Developers. It's quick & easy.

Screen scraping in ASP.NET

P: n/a
I need to scrape specific information from another website, specifically the
prices of precious metals from several different vendors. While I will
credit the vendors as the data source, I do not want to use the format of
their pages, and want the inforamtion consolidated to a single page of my
design.

I did something like this for a client a couple of years ago in ASP, but it
was complex, and I do not have access to the code. A colleague advised me
that ASP.Net could accomplish this task much easier, but I have little
experience with it.

Can anyone guide me in the right direction.

Thanks,
J. Giblin
Nov 18 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
"Jim Giblin" <jg*****@community.nospam> skrev i en meddelelse
news:Oc**************@TK2MSFTNGP09.phx.gbl...
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will
credit the vendors as the data source, I do not want to use the format of
their pages, and want the inforamtion consolidated to a single page of my
design.

I did something like this for a client a couple of years ago in ASP, but it was complex, and I do not have access to the code. A colleague advised me
that ASP.Net could accomplish this task much easier, but I have little
experience with it.

Can anyone guide me in the right direction.

Thanks,
J. Giblin


Retrieving the HTML is done using the WebRequest class, but you probably
already knew that.

For getting the data from the HTML, I would recommend using regular
expressions with named capturing groups. It is a very reliable and flexible
way of implementing screen scraping.

Try googling on screen scraping and regex or regular expressions, there are
several articles on this.

/Jens

Nov 18 '05 #2

P: n/a
Jens,

I was aware of the WebRequest class as well as the DownloadData method for
pulling in the HTML, but did not have any direction on searching the HTML,
or parsing out the individual expressions once I have metatext in memory.

REGEX was the key!!!! I did attempt to search Google for "screen scraping"
and got referenced to several products like ASPTear which perform the same
function as the DownloadData method just bundled in a class.

In anyone has any code examples, I would really appreciate a different
implementation of this class.

Thanks,
Jim

"Jens Christian Mikkelsen" <je*********@jcmikkelsenNoSpamPlease.dk> wrote in
message news:us*************@TK2MSFTNGP11.phx.gbl...

Retrieving the HTML is done using the WebRequest class, but you probably
already knew that.

For getting the data from the HTML, I would recommend using regular
expressions with named capturing groups. It is a very reliable and flexible way of implementing screen scraping.

Try googling on screen scraping and regex or regular expressions, there are several articles on this.

/Jens

Nov 18 '05 #3

P: n/a
"Jim Giblin" <jg*****@community.nospam> skrev i en meddelelse
news:uR*************@TK2MSFTNGP11.phx.gbl...
In anyone has any code examples, I would really appreciate a different
implementation of this class.


Hi Jim,

Here is an example, which gets the latest news from a Danish travel news
site, generates an RSS feed from it and writes it to the ASP.NET Response
output stream.

Dim sPage As String
Dim oWriter As XmlTextWriter

sPage = WebRequest.GetPage("http://www.standby.dk/")

Dim sPattern As String
sPattern = "<img src=fileadmin/tmpl/standby_pil_red.jpeg border=0>"
sPattern &= "<A HREF=""(?<url>[^""]+)"">"
sPattern &= "(?<title>[^<]+)</a>"

Dim oRegex As New Regex(sPattern, RegexOptions.ExplicitCapture)

Dim oMatches As MatchCollection
oMatches = oRegex.Matches(sPage)

oWriter = New XmlTextWriter(Response.OutputStream,
System.Text.Encoding.UTF8)

oWriter.WriteStartElement("rss")
oWriter.WriteAttributeString("version", "2.0")
oWriter.WriteStartElement("channel")
oWriter.WriteElementString("title", "STAND BY")
oWriter.WriteElementString("link", "http://www.standby.dk")
oWriter.WriteElementString("description", "The Scandinavian Travel Trade
Journal")
oWriter.WriteElementString("language", "da")
For Each oMatch As Match In oMatches
oWriter.WriteStartElement("item")
oWriter.WriteElementString("title", oMatch.Groups("title").ToString)
oWriter.WriteElementString("link", "http://www.standby.dk/" &
oMatch.Groups("url").ToString)
oWriter.WriteEndElement() ' item
Next
oWriter.WriteEndElement() ' channel
oWriter.WriteEndElement() ' rss
oWriter.Flush()
oWriter.Close()

/Jens

--
Jens Christian Mikkelsen
http://www.jcmikkelsen.dk
Nov 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.