473,785 Members | 2,432 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HTML parsing

Hi all,
I have to get a HTML content of a given URL, to inspect links and images and
to change something in there, and to save the result. I have done it already
with:
System.Net.WebC lient source = new System.Net.WebC lient();

StreamReader mr = null;

try

{

mr = new StreamReader(so urce.OpenRead(s Url));

sWebPage = mr.ReadToEnd();

}

catch

{

oParent.PagesDo ne++;

return;

}

finally

{

if (mr != null)

mr.Close();

}

After that I make a lot of IndexOfs' and Replaces to change the things I
want, and as it is sooooo ugly :( and slow.

So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and just
to find and change values I need, but ... so far I couldn't imagine how to
place the string in the IHTMLDocument2 object and how to retrieve it from
there after that in order to save it.

Any clues will be highly appreciated.

Thanks

Sunny

Nov 15 '05 #1
5 2444
Well, I'm no mshtml expert, but you might be able to get away with just
using IHTMLDocument4. createDocumentF romUrl to load the document directly
from the url into an IHTMLDocument class (any given html document class
should support IHTMLDocument through IHTMLDocument5) I'm not sure if it'll
work, but you can try.

If there is a mshtml specific newsgroup, you should be able to ask there as
the managed mshtml is just a PIA wrapper around the com objects. I'm sure
the talent pool in relation to mshtml would be far greater there.
"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ut******** ******@TK2MSFTN GP09.phx.gbl...
Hi all,
I have to get a HTML content of a given URL, to inspect links and images and to change something in there, and to save the result. I have done it already with:
System.Net.WebC lient source = new System.Net.WebC lient();

StreamReader mr = null;

try

{

mr = new StreamReader(so urce.OpenRead(s Url));

sWebPage = mr.ReadToEnd();

}

catch

{

oParent.PagesDo ne++;

return;

}

finally

{

if (mr != null)

mr.Close();

}

After that I make a lot of IndexOfs' and Replaces to change the things I
want, and as it is sooooo ugly :( and slow.

So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and just to find and change values I need, but ... so far I couldn't imagine how to
place the string in the IHTMLDocument2 object and how to retrieve it from
there after that in order to save it.

Any clues will be highly appreciated.

Thanks

Sunny

Nov 15 '05 #2
Thanks Daniel,
but I'm a little bit confused, because all examples so far I have found are
using IHTMLDocument2 interface, so I'm worried about compatibility, i.e. on
what systems is available 5, and on what 2 ?
Also, you are talking about PIA, are there official PIAs, maybe I'm not
searching right, but I can not find any. I just made a reference to the
mshtml.dll, and VS .net had made some interop assembly, but because MS are
providing official PIAs for some products (I know for sure there are for
Office XP), I wondered are there any for mshtml.
I'll try to search again, and also to post this question in other groups
also, but one little :) push will be very helpful :)

Thanks again
Sunny

P.S. If there is any other way to solve may problem - I'm open to any
suggestion :)

"Daniel O'Connell" <on******@comca st.net> wrote in message
news:Ng******** ***********@rwc rnsc52.ops.asp. att.net...
Well, I'm no mshtml expert, but you might be able to get away with just
using IHTMLDocument4. createDocumentF romUrl to load the document directly
from the url into an IHTMLDocument class (any given html document class
should support IHTMLDocument through IHTMLDocument5) I'm not sure if it'll
work, but you can try.

If there is a mshtml specific newsgroup, you should be able to ask there as the managed mshtml is just a PIA wrapper around the com objects. I'm sure
the talent pool in relation to mshtml would be far greater there.
"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ut******** ******@TK2MSFTN GP09.phx.gbl...
Hi all,
I have to get a HTML content of a given URL, to inspect links and images

and
to change something in there, and to save the result. I have done it

already
with:
System.Net.WebC lient source = new System.Net.WebC lient();

StreamReader mr = null;

try

{

mr = new StreamReader(so urce.OpenRead(s Url));

sWebPage = mr.ReadToEnd();

}

catch

{

oParent.PagesDo ne++;

return;

}

finally

{

if (mr != null)

mr.Close();

}

After that I make a lot of IndexOfs' and Replaces to change the things I
want, and as it is sooooo ugly :( and slow.

So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and

just
to find and change values I need, but ... so far I couldn't imagine how to place the string in the IHTMLDocument2 object and how to retrieve it from there after that in order to save it.

Any clues will be highly appreciated.

Thanks

Sunny



Nov 15 '05 #3
Well, I can't remember if there is a mshtml interop dll for vs.net 2002, but
I am very sure it is there for vs.net 2003. In your add reference dialog,
scroll down and look for Microsoft.mshtm l.dll, it should be there.
Likewise for borland C# builder, there is a Borland.mshtml. dll included in
the install folder (just incase you or anyone who reads this is curious).

Anyway, as for the IHTMLDocument interfaces, use of:
IHTMLDocument and IHTMLDocument2 requires IE 4.
IHTMLDocument3 requires IE 5.
IHTMLDocument4 requires IE 5.5.
IHTMLDocument5 requires IE 6.
I consider it a pretty safe bet that IE 5.5+ is installed, and 6 has been
out a while and windows update pushes it.

IHTMLDocument2 provides most of the functionality you need, hence it gets
most of the exposure in examples. Due to COM issues, each individual
interface is seperate. As such, you will need both a IHTMLDocument2 & a
IHTMLDocument4 typed reference to your document object. It is a pain but it
is part of the price of COM interop.
So, something like

IHTMLDocument2 myDocument = <get your document>;
IHTMLDocument4 myDocument4 = (IHTMLDocument4 )myDocument;

should do the job of giving you both typed references.

"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ee******** ******@TK2MSFTN GP09.phx.gbl...
Thanks Daniel,
but I'm a little bit confused, because all examples so far I have found are using IHTMLDocument2 interface, so I'm worried about compatibility, i.e. on what systems is available 5, and on what 2 ?
Also, you are talking about PIA, are there official PIAs, maybe I'm not
searching right, but I can not find any. I just made a reference to the
mshtml.dll, and VS .net had made some interop assembly, but because MS are
providing official PIAs for some products (I know for sure there are for
Office XP), I wondered are there any for mshtml.
I'll try to search again, and also to post this question in other groups
also, but one little :) push will be very helpful :)

Thanks again
Sunny

P.S. If there is any other way to solve may problem - I'm open to any
suggestion :)

"Daniel O'Connell" <on******@comca st.net> wrote in message
news:Ng******** ***********@rwc rnsc52.ops.asp. att.net...
Well, I'm no mshtml expert, but you might be able to get away with just
using IHTMLDocument4. createDocumentF romUrl to load the document directly
from the url into an IHTMLDocument class (any given html document class
should support IHTMLDocument through IHTMLDocument5) I'm not sure if it'll
work, but you can try.

If there is a mshtml specific newsgroup, you should be able to ask there as
the managed mshtml is just a PIA wrapper around the com objects. I'm sure the talent pool in relation to mshtml would be far greater there.
"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ut******** ******@TK2MSFTN GP09.phx.gbl...
Hi all,
I have to get a HTML content of a given URL, to inspect links and images
and
to change something in there, and to save the result. I have done it

already
with:
System.Net.WebC lient source = new System.Net.WebC lient();

StreamReader mr = null;

try

{

mr = new StreamReader(so urce.OpenRead(s Url));

sWebPage = mr.ReadToEnd();

}

catch

{

oParent.PagesDo ne++;

return;

}

finally

{

if (mr != null)

mr.Close();

}

After that I make a lot of IndexOfs' and Replaces to change the things
I want, and as it is sooooo ugly :( and slow.

So I decided to see if I can use MSHTML's IHTMLDocument2 interface,

and just
to find and change values I need, but ... so far I couldn't imagine

how to place the string in the IHTMLDocument2 object and how to retrieve it from there after that in order to save it.

Any clues will be highly appreciated.

Thanks

Sunny


Nov 15 '05 #4
Hi Daniel,
I have solved the problem. I'll post the solution in case someone needs it
(it was little bit hard for me to find it).

So, I am reading the URL with the code already posted in the beginning.
Now in sWebPage I have the text content of the page. This text is placed in
a HTMLDocumentCla ss object as follows:

HTMLDocumentCla ss myDoc;

try

{

object[] oPageText = {sWebPage};

myDoc = new HTMLDocumentCla ss();

IHTMLDocument2 oMyDoc = (IHTMLDocument2 )myDoc;

oMyDoc.write(oP ageText);

}

catch

{

//handle

}

And now you may do to the document whatever you want. And after all the HTML
text is in myDoc.documenyE lement.outerHTM L.

This link was very useful in solving that problem (Thanks Alex):

http://www.csharphelp.com/archives/archive146.html

Sunny
"Daniel O'Connell" <on******@comca st.net> wrote in message
news:ko******** *************@r wcrnsc51.ops.as p.att.net...
Well, I can't remember if there is a mshtml interop dll for vs.net 2002, but I am very sure it is there for vs.net 2003. In your add reference dialog,
scroll down and look for Microsoft.mshtm l.dll, it should be there.
Likewise for borland C# builder, there is a Borland.mshtml. dll included in
the install folder (just incase you or anyone who reads this is curious).

Anyway, as for the IHTMLDocument interfaces, use of:
IHTMLDocument and IHTMLDocument2 requires IE 4.
IHTMLDocument3 requires IE 5.
IHTMLDocument4 requires IE 5.5.
IHTMLDocument5 requires IE 6.
I consider it a pretty safe bet that IE 5.5+ is installed, and 6 has been
out a while and windows update pushes it.

IHTMLDocument2 provides most of the functionality you need, hence it gets
most of the exposure in examples. Due to COM issues, each individual
interface is seperate. As such, you will need both a IHTMLDocument2 & a
IHTMLDocument4 typed reference to your document object. It is a pain but it is part of the price of COM interop.
So, something like

IHTMLDocument2 myDocument = <get your document>;
IHTMLDocument4 myDocument4 = (IHTMLDocument4 )myDocument;

should do the job of giving you both typed references.

"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ee******** ******@TK2MSFTN GP09.phx.gbl...
Thanks Daniel,
but I'm a little bit confused, because all examples so far I have found are
using IHTMLDocument2 interface, so I'm worried about compatibility, i.e.

on
what systems is available 5, and on what 2 ?
Also, you are talking about PIA, are there official PIAs, maybe I'm not
searching right, but I can not find any. I just made a reference to the
mshtml.dll, and VS .net had made some interop assembly, but because MS are
providing official PIAs for some products (I know for sure there are for
Office XP), I wondered are there any for mshtml.
I'll try to search again, and also to post this question in other groups
also, but one little :) push will be very helpful :)

Thanks again
Sunny

P.S. If there is any other way to solve may problem - I'm open to any
suggestion :)

"Daniel O'Connell" <on******@comca st.net> wrote in message
news:Ng******** ***********@rwc rnsc52.ops.asp. att.net...
Well, I'm no mshtml expert, but you might be able to get away with just using IHTMLDocument4. createDocumentF romUrl to load the document directly from the url into an IHTMLDocument class (any given html document class should support IHTMLDocument through IHTMLDocument5) I'm not sure if it'll work, but you can try.

If there is a mshtml specific newsgroup, you should be able to ask
there as
the managed mshtml is just a PIA wrapper around the com objects. I'm

sure the talent pool in relation to mshtml would be far greater there.
"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ut******** ******@TK2MSFTN GP09.phx.gbl...
> Hi all,
> I have to get a HTML content of a given URL, to inspect links and images and
> to change something in there, and to save the result. I have done it
already
> with:
> System.Net.WebC lient source = new System.Net.WebC lient();
>
> StreamReader mr = null;
>
> try
>
> {
>
> mr = new StreamReader(so urce.OpenRead(s Url));
>
> sWebPage = mr.ReadToEnd();
>
> }
>
> catch
>
> {
>
> oParent.PagesDo ne++;
>
> return;
>
> }
>
> finally
>
> {
>
> if (mr != null)
>
> mr.Close();
>
> }
>
> After that I make a lot of IndexOfs' and Replaces to change the
things I > want, and as it is sooooo ugly :( and slow.
>
> So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and just
> to find and change values I need, but ... so far I couldn't imagine

how
to
> place the string in the IHTMLDocument2 object and how to retrieve it

from
> there after that in order to save it.
>
> Any clues will be highly appreciated.
>
> Thanks
>
> Sunny
>



Nov 15 '05 #5
Ahh, much cleaner solution in your particular case. Hopefully the next time
someone asks this question (and someone will), I may actually remember the
solution you found.
"Sunny" <su******@icebe rgwireless.com> wrote in message
news:uh******** *****@tk2msftng p13.phx.gbl...
Hi Daniel,
I have solved the problem. I'll post the solution in case someone needs it
(it was little bit hard for me to find it).

So, I am reading the URL with the code already posted in the beginning.
Now in sWebPage I have the text content of the page. This text is placed in a HTMLDocumentCla ss object as follows:

HTMLDocumentCla ss myDoc;

try

{

object[] oPageText = {sWebPage};

myDoc = new HTMLDocumentCla ss();

IHTMLDocument2 oMyDoc = (IHTMLDocument2 )myDoc;

oMyDoc.write(oP ageText);

}

catch

{

//handle

}

And now you may do to the document whatever you want. And after all the HTML text is in myDoc.documenyE lement.outerHTM L.

This link was very useful in solving that problem (Thanks Alex):

http://www.csharphelp.com/archives/archive146.html

Sunny
"Daniel O'Connell" <on******@comca st.net> wrote in message
news:ko******** *************@r wcrnsc51.ops.as p.att.net...
Well, I can't remember if there is a mshtml interop dll for vs.net 2002,

but
I am very sure it is there for vs.net 2003. In your add reference dialog,
scroll down and look for Microsoft.mshtm l.dll, it should be there.
Likewise for borland C# builder, there is a Borland.mshtml. dll included in the install folder (just incase you or anyone who reads this is curious).
Anyway, as for the IHTMLDocument interfaces, use of:
IHTMLDocument and IHTMLDocument2 requires IE 4.
IHTMLDocument3 requires IE 5.
IHTMLDocument4 requires IE 5.5.
IHTMLDocument5 requires IE 6.
I consider it a pretty safe bet that IE 5.5+ is installed, and 6 has been out a while and windows update pushes it.

IHTMLDocument2 provides most of the functionality you need, hence it gets most of the exposure in examples. Due to COM issues, each individual
interface is seperate. As such, you will need both a IHTMLDocument2 & a
IHTMLDocument4 typed reference to your document object. It is a pain but

it
is part of the price of COM interop.
So, something like

IHTMLDocument2 myDocument = <get your document>;
IHTMLDocument4 myDocument4 = (IHTMLDocument4 )myDocument;

should do the job of giving you both typed references.

"Sunny" <su******@icebe rgwireless.com> wrote in message
news:ee******** ******@TK2MSFTN GP09.phx.gbl...
Thanks Daniel,
but I'm a little bit confused, because all examples so far I have found
are
using IHTMLDocument2 interface, so I'm worried about compatibility,
i.e. on
what systems is available 5, and on what 2 ?
Also, you are talking about PIA, are there official PIAs, maybe I'm
not searching right, but I can not find any. I just made a reference to the mshtml.dll, and VS .net had made some interop assembly, but because MS

are providing official PIAs for some products (I know for sure there are for Office XP), I wondered are there any for mshtml.
I'll try to search again, and also to post this question in other groups also, but one little :) push will be very helpful :)

Thanks again
Sunny

P.S. If there is any other way to solve may problem - I'm open to any
suggestion :)

"Daniel O'Connell" <on******@comca st.net> wrote in message
news:Ng******** ***********@rwc rnsc52.ops.asp. att.net...
> Well, I'm no mshtml expert, but you might be able to get away with just > using IHTMLDocument4. createDocumentF romUrl to load the document directly > from the url into an IHTMLDocument class (any given html document class > should support IHTMLDocument through IHTMLDocument5) I'm not sure if

it'll
> work, but you can try.
>
> If there is a mshtml specific newsgroup, you should be able to ask there as
> the managed mshtml is just a PIA wrapper around the com objects. I'm

sure
> the talent pool in relation to mshtml would be far greater there.
> "Sunny" <su******@icebe rgwireless.com> wrote in message
> news:ut******** ******@TK2MSFTN GP09.phx.gbl...
> > Hi all,
> > I have to get a HTML content of a given URL, to inspect links and

images
> and
> > to change something in there, and to save the result. I have done it > already
> > with:
> > System.Net.WebC lient source = new System.Net.WebC lient();
> >
> > StreamReader mr = null;
> >
> > try
> >
> > {
> >
> > mr = new StreamReader(so urce.OpenRead(s Url));
> >
> > sWebPage = mr.ReadToEnd();
> >
> > }
> >
> > catch
> >
> > {
> >
> > oParent.PagesDo ne++;
> >
> > return;
> >
> > }
> >
> > finally
> >
> > {
> >
> > if (mr != null)
> >
> > mr.Close();
> >
> > }
> >
> > After that I make a lot of IndexOfs' and Replaces to change the

things
I
> > want, and as it is sooooo ugly :( and slow.
> >
> > So I decided to see if I can use MSHTML's IHTMLDocument2 interface, and
> just
> > to find and change values I need, but ... so far I couldn't
imagine how
to
> > place the string in the IHTMLDocument2 object and how to retrieve

it from
> > there after that in order to save it.
> >
> > Any clues will be highly appreciated.
> >
> > Thanks
> >
> > Sunny
> >
>
>


Nov 15 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
2907
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.
82
6356
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility of getting wrong it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have read tells me I must include <html>, <head> and <body> tag pairs. I have always done that, and never questioned it. ...
59
7044
by: Lennart Björk | last post by:
Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta http-equiv="Content-Type" content="text/html;
1
2429
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail, outlook, etc) - and i want to create some rules of how to get them out of the mail's html body. so at first i tried using regular expressions: for example - "any pattern that starts with a <p> and contains "from:"..." etc.
4
4865
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr> </table> With an XSLT styles sheet, I can use for-each to grab the values in
0
9645
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10325
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10091
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7499
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6739
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5381
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4050
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2879
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.