473,386 Members | 1,846 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

HTML Parsing in VB.Net

Does anyone have any good examples of parsing WebPages in VB.Net. My
application needs to get information from certain HTML tables and I haven't
been able to find a good way to approach the problem. I have researched
RegularExpressions but have found it to be rather complicated for what I am
attempting to accomplish. I was hoping that there would be some type of
utility that would allow me to parse through the webpage in a tree like
structure. I found one utility called HTMLAgilityPack but it seems to be a
little difficult to work with. If any one has any suggestions or examples I
would really appreciate it.

Thanks
Curtis
Nov 21 '05 #1
10 29395
Hi Curtis

If you want to access a web page as a tree structure then you are probably
looking at something like Microsoft's mshtml. It makes a page accessible as
a document object model (DOM), just like you might use automation to read a
Word document as a DOM. You can either get the page by using the WebBrowser
control (which wraps mshtml) in your application, or using mshtml directly.

Do you have your web page in memory, or are you also wanting to retrieve it
from the internet?

HTH

Charles
"Curtis" <cs*****@hotmail.com> wrote in message
news:e0**************@TK2MSFTNGP09.phx.gbl...
Does anyone have any good examples of parsing WebPages in VB.Net. My
application needs to get information from certain HTML tables and I
haven't been able to find a good way to approach the problem. I have
researched RegularExpressions but have found it to be rather complicated
for what I am attempting to accomplish. I was hoping that there would be
some type of utility that would allow me to parse through the webpage in a
tree like structure. I found one utility called HTMLAgilityPack but it
seems to be a little difficult to work with. If any one has any
suggestions or examples I would really appreciate it.

Thanks
Curtis

Nov 21 '05 #2
Hi Charles,

What now?

I am now telling to everybody to use the AgilityPack, just because that I
know that you know MsHtml well and found that AgilityPack easier to use.

How would I place this message from you now?

For me it is easier to tell MsHTML because than I can tell something more
about it if there are questions.

:-)

Cor
Nov 21 '05 #3
Hi Cor

I'm not sure I completely follow you ... I downloaded the AgilityPack some
time ago, but never really used it as I already had a lot invested in mshtml
and the WebBrowser control. Perhaps I should leave it to you or others to
recommend the AgilityPack?

Charles
"Cor Ligthert [MVP]" <no************@planet.nl> wrote in message
news:uE****************@TK2MSFTNGP10.phx.gbl...
Hi Charles,

What now?

I am now telling to everybody to use the AgilityPack, just because that I
know that you know MsHtml well and found that AgilityPack easier to use.

How would I place this message from you now?

For me it is easier to tell MsHTML because than I can tell something more
about it if there are questions.

:-)

Cor

Nov 21 '05 #4
Charles,

Than I probably misunderstood you a while ago.

Never mind next time I advice use MSHTML again.

Cor
Nov 21 '05 #5
Thanks for the reply guys.

I have the HTML document stored in memory and I would need to load it in
that way. I don't didn't see a way to load an html document from a string in
the MSHTML class. I am already retrieving the web page through
HTTPWebRequest and HTTPWebResponse classes. The purpose of my application is
to scrape a particaluar webpage for contents of its tables which means the
programs end user doesn't need to see a the actual HTML documents coming
back to my application.

I have tried to use HTMLAgilityPack but found it a little bit difficult to
use. Does anyone have any good examples using the HTMLAgilityPack?

Thanks,
Curtis

"Cor Ligthert [MVP]" <no************@planet.nl> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
Charles,

Than I probably misunderstood you a while ago.

Never mind next time I advice use MSHTML again.

Cor

Nov 21 '05 #6
Charles,

Thank you very much for your suggestions. I ended up using MSHTML after I
found an example that loads the HTML from memory. I used the following code,
any suggestions are welcome.

Public Function GetTableText(ByVal sHTML As String) As String
Dim myDoc As mshtml.IHTMLDocument2 = New mshtml.HTMLDocument
Dim mElement As mshtml.IHTMLElement
Dim mElement2 As mshtml.IHTMLElement
Dim mECol As mshtml.IHTMLElementCollection
Dim I As Integer

'initialize the document object within the HTMLDocument class...
myDoc.close()
myDoc.open("about:blank")
'write the HTML to the document using the MSHTML "write" method...

Dim clsHTML() As Object = {sHTML}
myDoc.write(clsHTML)
clsHTML = Nothing
mElement = myDoc.body()
mECol = mElement.getElementsByTagName("TD")
For I = 0 To mECol.length - 1
mElement2 = mECol.item(I)
lstResults.Items.Add(mElement2.tagName & " : " & mElement2.innerText)
Next
End Function

Thanks,
Curtis

"Charles Law" <bl***@nowhere.com> wrote in message
news:O6**************@TK2MSFTNGP09.phx.gbl...
Hi Curtis

If you want to access a web page as a tree structure then you are probably
looking at something like Microsoft's mshtml. It makes a page accessible
as a document object model (DOM), just like you might use automation to
read a Word document as a DOM. You can either get the page by using the
WebBrowser control (which wraps mshtml) in your application, or using
mshtml directly.

Do you have your web page in memory, or are you also wanting to retrieve
it from the internet?

HTH

Charles
"Curtis" <cs*****@hotmail.com> wrote in message
news:e0**************@TK2MSFTNGP09.phx.gbl...
Does anyone have any good examples of parsing WebPages in VB.Net. My
application needs to get information from certain HTML tables and I
haven't been able to find a good way to approach the problem. I have
researched RegularExpressions but have found it to be rather complicated
for what I am attempting to accomplish. I was hoping that there would be
some type of utility that would allow me to parse through the webpage in
a tree like structure. I found one utility called HTMLAgilityPack but it
seems to be a little difficult to work with. If any one has any
suggestions or examples I would really appreciate it.

Thanks
Curtis


Nov 21 '05 #7
Hi Curtis

That looks like a perfectly legitimate way to do it. The only thing that I
might perhaps do is [after open("about.blank")] check that, or wait for the
document to become ready. Html documents are loaded asynchronously, so the
Open can return before the operation is actually complete. Since the purpose
of opening about:blank it to initialise the document correctly, you could
start to write the html string before initialisation is complete, and upset
things.

HTH

Charles
"Curtis" <cs*****@hotmail.com> wrote in message
news:OL*************@TK2MSFTNGP10.phx.gbl...
Charles,

Thank you very much for your suggestions. I ended up using MSHTML after I
found an example that loads the HTML from memory. I used the following
code, any suggestions are welcome.

Public Function GetTableText(ByVal sHTML As String) As String
Dim myDoc As mshtml.IHTMLDocument2 = New mshtml.HTMLDocument
Dim mElement As mshtml.IHTMLElement
Dim mElement2 As mshtml.IHTMLElement
Dim mECol As mshtml.IHTMLElementCollection
Dim I As Integer

'initialize the document object within the HTMLDocument class...
myDoc.close()
myDoc.open("about:blank")
'write the HTML to the document using the MSHTML "write" method...

Dim clsHTML() As Object = {sHTML}
myDoc.write(clsHTML)
clsHTML = Nothing
mElement = myDoc.body()
mECol = mElement.getElementsByTagName("TD")
For I = 0 To mECol.length - 1
mElement2 = mECol.item(I)
lstResults.Items.Add(mElement2.tagName & " : " & mElement2.innerText)
Next
End Function

Thanks,
Curtis

"Charles Law" <bl***@nowhere.com> wrote in message
news:O6**************@TK2MSFTNGP09.phx.gbl...
Hi Curtis

If you want to access a web page as a tree structure then you are
probably looking at something like Microsoft's mshtml. It makes a page
accessible as a document object model (DOM), just like you might use
automation to read a Word document as a DOM. You can either get the page
by using the WebBrowser control (which wraps mshtml) in your application,
or using mshtml directly.

Do you have your web page in memory, or are you also wanting to retrieve
it from the internet?

HTH

Charles
"Curtis" <cs*****@hotmail.com> wrote in message
news:e0**************@TK2MSFTNGP09.phx.gbl...
Does anyone have any good examples of parsing WebPages in VB.Net. My
application needs to get information from certain HTML tables and I
haven't been able to find a good way to approach the problem. I have
researched RegularExpressions but have found it to be rather complicated
for what I am attempting to accomplish. I was hoping that there would be
some type of utility that would allow me to parse through the webpage in
a tree like structure. I found one utility called HTMLAgilityPack but it
seems to be a little difficult to work with. If any one has any
suggestions or examples I would really appreciate it.

Thanks
Curtis



Nov 21 '05 #8
Charles and Curtis,

Tomorrow I have probably ready an almost complete sample using MSHTML.

Now it is Ajax time. Charles knows probably what that means.

Cor
Nov 21 '05 #9
Curtis,

Here is the sample I was talking about yesterday.

http://www.windowsformsdatagridhelp....f-56dbb63fdf1c

I hope it helps,

Cor
Nov 21 '05 #10
Cor,

Thanks for the example

Curtis
"Cor Ligthert [MVP]" <no************@planet.nl> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
Curtis,

Here is the sample I was talking about yesterday.

http://www.windowsformsdatagridhelp....f-56dbb63fdf1c

I hope it helps,

Cor

Nov 21 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: SKG | last post by:
can any one help me with HTML Parsing Regular Expression or class. I need to parse all href tags with <A> elements. Thanks
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...
0
by: rufus | last post by:
I need to parse HTML output and find all instances of a word/phrase and then convert it to a link. We have a reasonably large product catalogue. If a particular product page contains the name...
6
by: g_no_mail_please | last post by:
Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside <!-- --> is a comment in HTML, even if this comment is inside <script> </script>,...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
5
by: mailtogops | last post by:
Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our...
1
by: charlesvc | last post by:
Hallo I am working on web design on mobile. So i came to know for webpage in html or xhtml needs to be parsed to show on mobile So my doubt is the parsing will give only tags text...
1
by: worlman385 | last post by:
I need to parse the following HTML page and extract TV listing data using VC++ http://tvlistings.zap2it.com/tvlistings/ZCGrid.do any good way to extract the data? is easy for VC++ to call...
3
by: Kreauchee | last post by:
am at my wits end here (no stranger to JavaScripts but no guru by any stretch of the imagination). I've been struggling with this page for some time now and can't figure out why it just flat out does...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.