By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,656 Members | 1,327 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,656 IT Pros & Developers. It's quick & easy.

html parser

P: n/a
I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #2

P: n/a
You can't just load HTML into the XmlDocument. There is no guarantee that
the HTML is well formed. For example, can the following be loaded as XML?
No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)
cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this

with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #3

P: n/a
However, you can use the results of the load to reformat your HTML and reapply
it to the DOM.
This type of massaging can be extremely useful. However, some default
transformations on the HTML
first enable most HTML to be loaded. For instance, terminating elements that
don't have end tags like
<img> and <br>. For the harder cases the exceptions help:

<li>Some stuff
<li>Some more stuff

The above can be a bit harder to write default transforms for, so the exception
generated by the parser might
just help you create a transform for the particular HTML you are trying to load.
--
Justin Rogers
DigiTec Web Consultants, LLC.
Blog: http://weblogs.asp.net/justin_rogers

"Peter Rilling" <pe***@nospam.rilling.net> wrote in message
news:uk**************@TK2MSFTNGP09.phx.gbl...
You can't just load HTML into the XmlDocument. There is no guarantee that
the HTML is well formed. For example, can the following be loaded as XML?
No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library

(like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this

with system.mshtml? or there is other way to do it p;ease help me?

Nov 15 '05 #4

P: n/a
Have a look at these links, they may help you get started.

This is an SGML to XML converter written by Chris Lovett (MSFT), and source
code is included. Once you're (hopefully well formed) HTML is in XML, they
you can parse it with the Xml classes of .NET. That's how I would probably
do it.
http://www.gotdotnet.com/Community/U...4-c3bd760564bc

I don't know anything about this link, but I ran across it while trying to
refind the above link. It may help you. It should show you how to use MSHTML
control.
http://www.itwriting.com/htmleditor/index.php

Hope that helps,
Mike Mayer - Visual C# MVP

"majid" <ma************@hotmail.com> wrote in message
news:6F**********************************@microsof t.com...
I want write a program with c# to pars a html file how ccan i do this with

system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #5

P: n/a
True, but the overhead involved with exception handling might not be worth
the effort. When loading a document, only the first error is identified.
If you have a standard web page, there may be hundereds of non-wellformed
tags. To process an exception for each one of the instances would be
cumbersome and expensive.

It might be better to invest in some library that makes the HTML content
well-formed such as Tidy (http://sourceforge.net/projects/ntidy/).

"Justin Rogers" <Ju****@games4dotnet.com> wrote in message
news:ey**************@TK2MSFTNGP10.phx.gbl...
However, you can use the results of the load to reformat your HTML and reapply it to the DOM.
This type of massaging can be extremely useful. However, some default
transformations on the HTML
first enable most HTML to be loaded. For instance, terminating elements that don't have end tags like
<img> and <br>. For the harder cases the exceptions help:

<li>Some stuff
<li>Some more stuff

The above can be a bit harder to write default transforms for, so the exception generated by the parser might
just help you create a transform for the particular HTML you are trying to load.

--
Justin Rogers
DigiTec Web Consultants, LLC.
Blog: http://weblogs.asp.net/justin_rogers

"Peter Rilling" <pe***@nospam.rilling.net> wrote in message
news:uk**************@TK2MSFTNGP09.phx.gbl...
You can't just load HTML into the XmlDocument. There is no guarantee that the HTML is well formed. For example, can the following be loaded as XML? No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class
library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do
this with system.mshtml? or there is other way to do it p;ease help me?


Nov 15 '05 #6

P: n/a
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Nov 15 '05 #7

P: n/a
SHDocVw.InternetExplorer IE = new SHDocVw.InternetExplorer();
IE.Visible = false;
object Dummy = System.Type.Missing;
IE.Navigate(http://www.google.com, ref Dummy, ref Dummy, ref Dummy, ref
Dummy);
' this makes it run around till the page is loaded
' (so a seperate thread might be a good idea)
while (IE.Busy && !IE.ReadyState.Equals("4"))
' it's loaded now, fire your event with IE.Document as data and do whatever
you want with it (including parsing)

Yves

"pchapman" <Ch*******@msn-dot-com.no-spam.invalid> schreef in bericht
news:40**********@Usenet.com...
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Nov 15 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.