473,386 Members | 1,785 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

html parser

I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #1
7 3619
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #2
You can't just load HTML into the XmlDocument. There is no guarantee that
the HTML is well formed. For example, can the following be loaded as XML?
No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library (like SelectNodes etc.)
cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this

with system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #3
However, you can use the results of the load to reformat your HTML and reapply
it to the DOM.
This type of massaging can be extremely useful. However, some default
transformations on the HTML
first enable most HTML to be loaded. For instance, terminating elements that
don't have end tags like
<img> and <br>. For the harder cases the exceptions help:

<li>Some stuff
<li>Some more stuff

The above can be a bit harder to write default transforms for, so the exception
generated by the parser might
just help you create a transform for the particular HTML you are trying to load.
--
Justin Rogers
DigiTec Web Consultants, LLC.
Blog: http://weblogs.asp.net/justin_rogers

"Peter Rilling" <pe***@nospam.rilling.net> wrote in message
news:uk**************@TK2MSFTNGP09.phx.gbl...
You can't just load HTML into the XmlDocument. There is no guarantee that
the HTML is well formed. For example, can the following be loaded as XML?
No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class library

(like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do this

with system.mshtml? or there is other way to do it p;ease help me?

Nov 15 '05 #4
Have a look at these links, they may help you get started.

This is an SGML to XML converter written by Chris Lovett (MSFT), and source
code is included. Once you're (hopefully well formed) HTML is in XML, they
you can parse it with the Xml classes of .NET. That's how I would probably
do it.
http://www.gotdotnet.com/Community/U...4-c3bd760564bc

I don't know anything about this link, but I ran across it while trying to
refind the above link. It may help you. It should show you how to use MSHTML
control.
http://www.itwriting.com/htmleditor/index.php

Hope that helps,
Mike Mayer - Visual C# MVP

"majid" <ma************@hotmail.com> wrote in message
news:6F**********************************@microsof t.com...
I want write a program with c# to pars a html file how ccan i do this with

system.mshtml? or there is other way to do it p;ease help me?
Nov 15 '05 #5
True, but the overhead involved with exception handling might not be worth
the effort. When loading a document, only the first error is identified.
If you have a standard web page, there may be hundereds of non-wellformed
tags. To process an exception for each one of the instances would be
cumbersome and expensive.

It might be better to invest in some library that makes the HTML content
well-formed such as Tidy (http://sourceforge.net/projects/ntidy/).

"Justin Rogers" <Ju****@games4dotnet.com> wrote in message
news:ey**************@TK2MSFTNGP10.phx.gbl...
However, you can use the results of the load to reformat your HTML and reapply it to the DOM.
This type of massaging can be extremely useful. However, some default
transformations on the HTML
first enable most HTML to be loaded. For instance, terminating elements that don't have end tags like
<img> and <br>. For the harder cases the exceptions help:

<li>Some stuff
<li>Some more stuff

The above can be a bit harder to write default transforms for, so the exception generated by the parser might
just help you create a transform for the particular HTML you are trying to load.

--
Justin Rogers
DigiTec Web Consultants, LLC.
Blog: http://weblogs.asp.net/justin_rogers

"Peter Rilling" <pe***@nospam.rilling.net> wrote in message
news:uk**************@TK2MSFTNGP09.phx.gbl...
You can't just load HTML into the XmlDocument. There is no guarantee that the HTML is well formed. For example, can the following be loaded as XML? No.

<html><body><img src="myimage.jpg"></body></html>
"Aditya Ghuwalewala" <ad*****@mail.microsoft.com> wrote in message
news:A4**********************************@microsof t.com...
Consider the HTML file as a simple XML.
You can do it by loading the HTML file in an XML document object--
XMLDocument doc = new XMLDocument();
doc.Load("yourfile");

Now you can parse through the doc object using XMLDocument class
library (like SelectNodes etc.)

cheers
Aditya

----- majid wrote: -----

I want write a program with c# to pars a html file how ccan i do
this with system.mshtml? or there is other way to do it p;ease help me?


Nov 15 '05 #6
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Nov 15 '05 #7
SHDocVw.InternetExplorer IE = new SHDocVw.InternetExplorer();
IE.Visible = false;
object Dummy = System.Type.Missing;
IE.Navigate(http://www.google.com, ref Dummy, ref Dummy, ref Dummy, ref
Dummy);
' this makes it run around till the page is loaded
' (so a seperate thread might be a good idea)
while (IE.Busy && !IE.ReadyState.Equals("4"))
' it's loaded now, fire your event with IE.Document as data and do whatever
you want with it (including parsing)

Yves

"pchapman" <Ch*******@msn-dot-com.no-spam.invalid> schreef in bericht
news:40**********@Usenet.com...
Embed a IE hidden browser control in your application.

It has a navigate method that takes an url and an event that fires
when the page has finished loading...

On finished loading, use the document object to get the dom. The
browser will have parsed it all for you...

50 lines of code...at most...parse any html the browser can
display...

Peter
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Nov 15 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...
4
by: Leif K-Brooks | last post by:
I'm writing a site with mod_python which will have, among other things, forums. I want to allow users to use some HTML (<em>, <strong>, <p>, etc.) on the forums, but I don't want to allow bad...
0
by: Himanshu Garg | last post by:
Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The...
3
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory...
4
by: bariole | last post by:
Hi I am trying to make lexical analysis of some simplified html code with flex tool. However that kind of work is new to me and I don't know where to start. I have searched a web but I didn't...
82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
8
by: Lachlan Hunt | last post by:
Hi, I'm interested in finding out how erroneous comment syntax within an HTML document should be handled by a parser, according to SGML rules. At present, every browser handles comments in...
2
by: David Virgil Hobbs | last post by:
Loading text strings containing HTML code into an HTML parser in a Javascript/Jscript I would like to know, how one would go about loading a text string containing HTML code, so as to be able to...
0
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile =...
4
by: Jackie | last post by:
Hi, all, I want to get the information of the professors (name,title) from the following link: "http://www.economics.utoronto.ca/index.php/index/person/faculty/" Ideally, I'd like to have a...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.