By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,290 Members | 1,263 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,290 IT Pros & Developers. It's quick & easy.

Parse HTML DOM document in console application

P: n/a
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I've successfully done all this in a Windows Application by using the
WebBrowser control, calling the Navigate method on the specified URL, and
then, within the DocumentComplete event, parsing the HTML page using
mshtml.HTMLDocument.

I'm writing it as a console app because I don't need to display the HTML,
just search for a specific tag and retrieve a href value from it.

Thanks for any help on this.

Nov 21 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a

"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:%2***************@TK2MSFTNGP11.phx.gbl...
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?


I found the following thread (note the * at the end is part of the URL)

http://groups.google.com/groups?hl=e...languages.vb.*

but was unable to make the solution by Charles Law work on my m/c (I have
defined the IPersistStreamInit interface). In my code the readstate is
always 'loading' and therefore it loops indefinitely at:

Do Until objDocument.readyState = "complete"
Application.DoEvents()
Loop
Nov 21 '05 #2

P: n/a
You may be interested in this article, albeit with examples in C#

http://www.vsj.co.uk/articles/display.asp?id=389
"John Williams" wrote:
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I've successfully done all this in a Windows Application by using the
WebBrowser control, calling the Navigate method on the specified URL, and
then, within the DocumentComplete event, parsing the HTML page using
mshtml.HTMLDocument.

I'm writing it as a console app because I don't need to display the HTML,
just search for a specific tag and retrieve a href value from it.

Thanks for any help on this.

Nov 21 '05 #3

P: n/a
"Charles Law" <bl***@nowhere.com> wrote in message
news:uF**************@TK2MSFTNGP10.phx.gbl...
Hi John

I have made a simple console app that demonstrates the loading of HTML from a url, based on the thread you found below. It works on my m/c, but gives an unrelated error about being unable to set focus. Just ignore the error and
it will continue normally.

Let me know if you have problems getting the zip file and I will mail it
instead.

HTH


Charles, thanks for your reply and the sample code. Your code works fine
when run in the VS IDE, however when run from a command window it sits in
the loop:
Do Until objDocument.readyState = "complete"

Application.DoEvents()

Loop

because readyState is "loading", then "uninitialized", never "complete". If
I comment out Application.DoEvents(), readyState stays "loading". I don't
understand this!

Thanks.
Nov 21 '05 #4

P: n/a

"Rowland Shaw" <Ro*********@discussions.microsoft.com> wrote in message
news:5B**********************************@microsof t.com...
You may be interested in this article, albeit with examples in C#

http://www.vsj.co.uk/articles/display.asp?id=389
"John Williams" wrote:
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I've successfully done all this in a Windows Application by using the
WebBrowser control, calling the Navigate method on the specified URL, and then, within the DocumentComplete event, parsing the HTML page using
mshtml.HTMLDocument.


Thanks Rowland, it looks promising, particularly the use of HttpWebRequest
and HttpWebResponse to get the web page in the first place. I'll have a
play around with the VB version.

Thanks again for responding.
Nov 21 '05 #5

P: n/a
Hi John

Unfortunately I don't get the same problem. I opened a command window and
ran the executable. I have ZoneAlarm running, so it warned me that the
application was trying to access the internet. I allowed it to continue and
then I got an error about setting focus (as I mentioned). I clicked on No
and the command window filled with the HTML.

I am running XP Pro with SP2, and .NET Framework 1.1 SP1. I also have IE6
installed. What are you running with?

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:eC**************@TK2MSFTNGP10.phx.gbl...
"Charles Law" <bl***@nowhere.com> wrote in message
news:uF**************@TK2MSFTNGP10.phx.gbl...
Hi John

I have made a simple console app that demonstrates the loading of HTML

from
a url, based on the thread you found below. It works on my m/c, but gives

an
unrelated error about being unable to set focus. Just ignore the error
and
it will continue normally.

Let me know if you have problems getting the zip file and I will mail it
instead.

HTH


Charles, thanks for your reply and the sample code. Your code works fine
when run in the VS IDE, however when run from a command window it sits in
the loop:
Do Until objDocument.readyState = "complete"

Application.DoEvents()

Loop

because readyState is "loading", then "uninitialized", never "complete".
If
I comment out Application.DoEvents(), readyState stays "loading". I don't
understand this!

Thanks.

Nov 21 '05 #6

P: n/a
Just start with a windows app, then delete the code that the wizard
generates, and put the code that you normally get from the
console wizard, because I don't think you will be saving anything
by not using a window, the .net overhead is there whether you
create windows or not, I think?

"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:%2***************@TK2MSFTNGP11.phx.gbl...
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I've successfully done all this in a Windows Application by using the
WebBrowser control, calling the Navigate method on the specified URL, and
then, within the DocumentComplete event, parsing the HTML page using
mshtml.HTMLDocument.

I'm writing it as a console app because I don't need to display the HTML,
just search for a specific tag and retrieve a href value from it.

Thanks for any help on this.


Nov 21 '05 #7

P: n/a
Hi Charles,

After more investigation, my Debug version works fine from a command window.
It's my Release version which sits in the loop, which probably means
something isn't being initialised. I then found this:

http://www.google.com/groups?hl=zh-c...TNGP10.phx.gbl

which says:
<quote>
I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.

I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following,
everything started working fine:

CoInitializeEx(NULL,COINIT_MULTITHREADED);
</quote>

Do you know how to implement or call (?) CoInitializeEx in a VB .Net
program, if in fact that is what I need?

Thanks.
"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
Hi John

Unfortunately I don't get the same problem. I opened a command window and
ran the executable. I have ZoneAlarm running, so it warned me that the
application was trying to access the internet. I allowed it to continue and then I got an error about setting focus (as I mentioned). I clicked on No
and the command window filled with the HTML.

I am running XP Pro with SP2, and .NET Framework 1.1 SP1. I also have IE6
installed. What are you running with?

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:eC**************@TK2MSFTNGP10.phx.gbl...
"Charles Law" <bl***@nowhere.com> wrote in message
news:uF**************@TK2MSFTNGP10.phx.gbl...
Hi John

I have made a simple console app that demonstrates the loading of HTML

from
a url, based on the thread you found below. It works on my m/c, but gives
an
unrelated error about being unable to set focus. Just ignore the error
and
it will continue normally.

Let me know if you have problems getting the zip file and I will mail
it instead.

HTH


Charles, thanks for your reply and the sample code. Your code works

fine when run in the VS IDE, however when run from a command window it sits in the loop:
Do Until objDocument.readyState = "complete"

Application.DoEvents()

Loop

because readyState is "loading", then "uninitialized", never "complete".
If
I comment out Application.DoEvents(), readyState stays "loading". I don't understand this!

Thanks.


Nov 21 '05 #8

P: n/a
Hi John

Yes, I see what you mean. I have modified the application slightly so it now
works in release build outside the IDE. I have removed the DoEvents because
that requires the Windows forms assembly, and HTML documents are loaded
asynchronously (on another thread), so all we need really is to set the
apartment to multithreaded and then go to sleep in the loop while we are
waiting for the document to load.

HTH

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:eA*************@tk2msftngp13.phx.gbl...
Hi Charles,

After more investigation, my Debug version works fine from a command
window.
It's my Release version which sits in the loop, which probably means
something isn't being initialised. I then found this:

http://www.google.com/groups?hl=zh-c...TNGP10.phx.gbl

which says:
<quote>
I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.

I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following,
everything started working fine:

CoInitializeEx(NULL,COINIT_MULTITHREADED);
</quote>

Do you know how to implement or call (?) CoInitializeEx in a VB .Net
program, if in fact that is what I need?

Thanks.
"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
Hi John

Unfortunately I don't get the same problem. I opened a command window and
ran the executable. I have ZoneAlarm running, so it warned me that the
application was trying to access the internet. I allowed it to continue

and
then I got an error about setting focus (as I mentioned). I clicked on No
and the command window filled with the HTML.

I am running XP Pro with SP2, and .NET Framework 1.1 SP1. I also have IE6
installed. What are you running with?

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:eC**************@TK2MSFTNGP10.phx.gbl...
> "Charles Law" <bl***@nowhere.com> wrote in message
> news:uF**************@TK2MSFTNGP10.phx.gbl...
>> Hi John
>>
>> I have made a simple console app that demonstrates the loading of HTML
> from
>> a url, based on the thread you found below. It works on my m/c, but gives > an
>> unrelated error about being unable to set focus. Just ignore the error
>> and
>> it will continue normally.
>>
>> Let me know if you have problems getting the zip file and I will mail it >> instead.
>>
>> HTH
>
> Charles, thanks for your reply and the sample code. Your code works fine > when run in the VS IDE, however when run from a command window it sits in > the loop:
> Do Until objDocument.readyState = "complete"
>
> Application.DoEvents()
>
> Loop
>
> because readyState is "loading", then "uninitialized", never
> "complete".
> If
> I comment out Application.DoEvents(), readyState stays "loading". I don't > understand this!
>
> Thanks.
>
>





Nov 21 '05 #9

P: n/a
Thank you, Charles, that works perfectly now :)

I've come up with another version which uses HTTPWebRequest/HTTPWebResponse,
which has the advantage of providing a timeout property, though a timeout
would be easy to implement in your version. I'm not sure of the pros and
cons of either method but it was an interesting exercise!

Thanks again for replying and helping out.

"Charles Law" <bl***@nowhere.com> wrote in message
news:uy**************@TK2MSFTNGP10.phx.gbl...
Hi John

Yes, I see what you mean. I have modified the application slightly so it now works in release build outside the IDE. I have removed the DoEvents because that requires the Windows forms assembly, and HTML documents are loaded
asynchronously (on another thread), so all we need really is to set the
apartment to multithreaded and then go to sleep in the loop while we are
waiting for the document to load.

HTH

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message
news:eA*************@tk2msftngp13.phx.gbl...
Hi Charles,

After more investigation, my Debug version works fine from a command
window.
It's my Release version which sits in the loop, which probably means
something isn't being initialised. I then found this:

http://www.google.com/groups?hl=zh-c...TNGP10.phx.gbl
which says:
<quote>
I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.

I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following, everything started working fine:

CoInitializeEx(NULL,COINIT_MULTITHREADED);
</quote>

Do you know how to implement or call (?) CoInitializeEx in a VB .Net
program, if in fact that is what I need?

Thanks.
"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
Hi John

Unfortunately I don't get the same problem. I opened a command window and ran the executable. I have ZoneAlarm running, so it warned me that the
application was trying to access the internet. I allowed it to continue

and
then I got an error about setting focus (as I mentioned). I clicked on No and the command window filled with the HTML.

I am running XP Pro with SP2, and .NET Framework 1.1 SP1. I also have IE6 installed. What are you running with?

Charles
"John Williams" <jo******************@NOhotmailSPAM.com> wrote in message news:eC**************@TK2MSFTNGP10.phx.gbl...
> "Charles Law" <bl***@nowhere.com> wrote in message
> news:uF**************@TK2MSFTNGP10.phx.gbl...
>> Hi John
>>
>> I have made a simple console app that demonstrates the loading of HTML > from
>> a url, based on the thread you found below. It works on my m/c, but

gives
> an
>> unrelated error about being unable to set focus. Just ignore the error >> and
>> it will continue normally.
>>
>> Let me know if you have problems getting the zip file and I will
mail it
>> instead.
>>
>> HTH
>
> Charles, thanks for your reply and the sample code. Your code works

fine
> when run in the VS IDE, however when run from a command window it
sits in
> the loop:
> Do Until objDocument.readyState = "complete"
>
> Application.DoEvents()
>
> Loop
>
> because readyState is "loading", then "uninitialized", never
> "complete".
> If
> I comment out Application.DoEvents(), readyState stays "loading". I

don't
> understand this!
>
> Thanks.
>
>





Nov 21 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.