473,406 Members | 2,620 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Screen Scraper

A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:

Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
Me.TextBox1.Multiline = True
Me.TextBox1.ScrollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDocument2
Doc = New mshtml.HTMLDocumentClass
Dim wbReq As Net.HttpWebRequest = _
DirectCast(Net.WebRequest.Create("http://
start.csail.mit.edu/startfarm.cgi?query=USA"), _
Net.HttpWebRequest)
Dim wbResp As Net.HttpWebResponse = _
DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetResponseStream()
Dim myreader As New IO.StreamReader(myStream)
Doc.write(myreader.ReadToEnd())
Doc.close()
wbResp.Close()

'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.

Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLElement = _
DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
Select Case hElm.tagName.ToLower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <"" Then
sb.Append(hElm.innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub

the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.

K.
Aug 11 '08 #1
1 2121
Kronecker,

The HttpRequest gives you only back the HTML content of the document that is
in the URL, that is not a page as you see it.

If you want to do as I understand you need to use the DOM (Document Object
Model) represented by MSHTML and learn what MSHTML is (in fact it has all
elements from DHTML).

As you know that, then you can use the Document property from the WebBrowser
to get that HTML. Be aware that one page can be made from more Frames and so
called IFrames. As it is like that, you have to evaluate all documents
(every frame contains a document). Therefore the AXWebbrowser has a
document.complete event and a download.complete event (for the webbrowser
there is an other way).

If you look in at the bottom of IE, you see that downloading happens,
because images and more things as like flash are also seperated downloaded.

Working with MSHTML is not an easy thing, because it has classes, which
should be often casted and sometimes even very deep, because the casted
class uses members which too should be casted.

The last thing is that most webcreaters are not always as correct as it
should be and there are on many pages, including from very profesional
companies, often many errors. Often they are created like: "As it works on
my screen then it is correct".

Cor

<kr*******@yahoo.co.ukschreef in bericht
news:02**********************************@k36g2000 pri.googlegroups.com...
>A screen scraper is a program that removes text only from a web site.
I pinched this one from the web:

Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles MyBase.Load
Me.TextBox1.Multiline = True
Me.TextBox1.ScrollBars = ScrollBars.Both
'above only for showing the sample
Dim Doc As mshtml.IHTMLDocument2
Doc = New mshtml.HTMLDocumentClass
Dim wbReq As Net.HttpWebRequest = _
DirectCast(Net.WebRequest.Create("http://
start.csail.mit.edu/startfarm.cgi?query=USA"), _
Net.HttpWebRequest)
Dim wbResp As Net.HttpWebResponse = _
DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
Dim myStream As IO.Stream = wbResp.GetResponseStream()
Dim myreader As New IO.StreamReader(myStream)
Doc.write(myreader.ReadToEnd())
Doc.close()
wbResp.Close()

'the part below is not completly done for all tags.
'it can (will) be necessary to tailor that to your needs.

Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Doc.all.length - 1
Dim hElm As mshtml.IHTMLElement = _
DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
Select Case hElm.tagName.ToLower
Case "body" '"html" ' "head" ' "form"
Case Else
If hElm.innerText <"" Then
sb.Append(hElm.innerText & vbCrLf)
End If
End Select
Next
TextBox1.Text = sb.ToString
End Sub

the trouble is that it gives text out that is duplicated in multiple
lines of the same info.
I explored this in a separate thread where I tried to fix it by
writing it to a text file and looking for duplicates. however, it
would be far easier to fix teh scraper itself.
I am unfamiliar with mshtml coding but essentially it is looking for
Tags "body of text html,head etc. Any suggestions as to why it
duplicates would be great.

K.
Aug 12 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Brian W | last post by:
I was hoping someone here would know something about screen scrapers, spiders, bots. Perhaps it's a stupid question, but here it goes anyway... Is there any way to make data harder for a screen...
3
by: _eee_ | last post by:
Does anyone know of a simple code module that can do screen scraping, including simulating user-entered pushbuttons, etc. I can get the first screen on a website with HttpWebRequest, but I need...
6
by: _eee_ | last post by:
I recently posted a query about screen scraping, but haven't turned up any leads yet. Here's what I need to do: The first screen is retrieved via HttpWebRequest/Response. Easy enough, as no...
14
by: n8 | last post by:
Hi, Hi have to do the followign and have been racking my brain with various solutions that have had no so great results. I want to use the System.Net.WebClient to submit data to a form (log a...
4
by: Ronald S. Cook | last post by:
I've been asked to extract data from web pages. Given that they are rendered in HTML and not any sort of XML I'm wondering how to go about "scraping" such a web page of data. Can anyone give me...
1
by: swestenra | last post by:
I am trying to build a screen scraper. But not just a plain screen scraper, it must also automate the entry of data. Background: We have a new intranet system that goes in to production soon. ...
4
by: onetitfemme | last post by:
Say, people would like to log into their hotmail, yahoo and gmail accounts and "keep an eye" on some text/part of a site .. I think something like that should be out there, since not all sites...
7
by: James Stroud | last post by:
Hello, Does anyone know of an example, however modest, of a screenscraper authored in python? I am using Firefox. Basically, I am answering problems via my browser and being scored for each...
2
by: voroojak | last post by:
Hi Does any one have any idead about screen scraper? Or wher can i find good information? thanks alot
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.