473,394 Members | 1,718 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Convert HTML to XML or Paser HTML

Hello,

Does anybody know is there a .NET or COM based library to
parse HTML or convert html to xml so I can use xpath to
parse it?

Thanks
Qin Zhou
Nov 18 '05 #1
7 1676
It looks like you can use the COM wrapper around Tidy to get there...

http://perso.wanadoo.fr/ablavier/TidyCOM/

http://www.15seconds.com/Issue/010601.htm

"Q.Z." <zh**@netquote.com> wrote in message
news:05****************************@phx.gbl...
Hello,

Does anybody know is there a .NET or COM based library to
parse HTML or convert html to xml so I can use xpath to
parse it?

Thanks
Qin Zhou


Nov 18 '05 #2
Hi Q.Z,
Thank you for using Microsoft Newsgroup Service. Based on your description,
you are looking for some COM or dotnet components which can convert the
html document into XML (XHTML) style document. Is my understanding correct?

If so, I think Ken Cox've provided some good sites on this topic, they
shows two components of COM. You may have a try on them to see whether they
help.

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Nov 18 '05 #3
"Ken Cox [Microsoft MVP]" wrote:
It looks like you can use the COM wrapper around Tidy to get there...

http://perso.wanadoo.fr/ablavier/TidyCOM/

http://www.15seconds.com/Issue/010601.htm


Uh -- why COM when there's Chris Lovett's SgmlReader at
www.gotdotnet.com?

Also note that TidyLib can be easily used through P/Invoke.

Cheers,
--
Joerg Jooss
jo*********@gmx.net
Nov 18 '05 #4
Q.Z
Ken and Steven,

Thanks a a lot! Looks like it will do the trick.

Qin ZHou
-----Original Message-----
It looks like you can use the COM wrapper around Tidy to get there...
http://perso.wanadoo.fr/ablavier/TidyCOM/

http://www.15seconds.com/Issue/010601.htm

"Q.Z." <zh**@netquote.com> wrote in message
news:05****************************@phx.gbl...
Hello,

Does anybody know is there a .NET or COM based library to parse HTML or convert html to xml so I can use xpath to
parse it?

Thanks
Qin Zhou


.

Nov 18 '05 #5
I have tried the SgmlReader but am having difficultly with some sites, such as www.msn.com

If I could find a way to do parsing on HTML using C/C++/C# I would be happy. All I really
need is a way to have an array of <tag> and <data>. Finer grainularity is not necessary. Just
the raw information. I do need the entire page though from opening <html> to the closing </html>.

I would prefer an HTML to XML conversion, but as time is limited, any solution would be
appreciated.

Thanks,
Dave

On Fri, 09 Jan 2004 03:23:29 GMT, v-******@online.microsoft.com (Steven Cheng[MSFT]) wrote:
Hi Q.Z,
Thank you for using Microsoft Newsgroup Service. Based on your description,
you are looking for some COM or dotnet components which can convert the
html document into XML (XHTML) style document. Is my understanding correct?

If so, I think Ken Cox've provided some good sites on this topic, they
shows two components of COM. You may have a try on them to see whether they
help.

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)


Nov 18 '05 #6
If you load you page to WebBrowser control you can parse you page using DOM,
this is work slow, but works.
"David Elliott" <Da**********@BellSouth.net.nospam> wrote in message
news:1i********************************@4ax.com...
I have tried the SgmlReader but am having difficultly with some sites, such as www.msn.com
If I could find a way to do parsing on HTML using C/C++/C# I would be happy. All I really need is a way to have an array of <tag> and <data>. Finer grainularity is not necessary. Just the raw information. I do need the entire page though from opening <html> to the closing </html>.
I would prefer an HTML to XML conversion, but as time is limited, any solution would be appreciated.

Thanks,
Dave

On Fri, 09 Jan 2004 03:23:29 GMT, v-******@online.microsoft.com (Steven Cheng[MSFT]) wrote:
Hi Q.Z,
Thank you for using Microsoft Newsgroup Service. Based on your description,you are looking for some COM or dotnet components which can convert the
html document into XML (XHTML) style document. Is my understanding correct?
If so, I think Ken Cox've provided some good sites on this topic, they
shows two components of COM. You may have a try on them to see whether theyhelp.

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Nov 18 '05 #7
Take a look
http://blogs.msdn.com/smourier/archi...6/04/8265.aspx

George.

"David Elliott" <Da**********@BellSouth.net.nospam> wrote in message
news:1i********************************@4ax.com...
I have tried the SgmlReader but am having difficultly with some sites, such as www.msn.com
If I could find a way to do parsing on HTML using C/C++/C# I would be happy. All I really need is a way to have an array of <tag> and <data>. Finer grainularity is not necessary. Just the raw information. I do need the entire page though from opening <html> to the closing </html>.
I would prefer an HTML to XML conversion, but as time is limited, any solution would be appreciated.

Thanks,
Dave

On Fri, 09 Jan 2004 03:23:29 GMT, v-******@online.microsoft.com (Steven Cheng[MSFT]) wrote:
Hi Q.Z,
Thank you for using Microsoft Newsgroup Service. Based on your description,you are looking for some COM or dotnet components which can convert the
html document into XML (XHTML) style document. Is my understanding correct?
If so, I think Ken Cox've provided some good sites on this topic, they
shows two components of COM. You may have a try on them to see whether theyhelp.

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Nov 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
6
by: PenguinPig | last post by:
Dear All Experts I would like to know how to convert a HTML into Image using C#. Or allow me contains HTML code (parsed) in Image? I also tried this way but it just display the character "<" &...
5
by: Just Another Victim of the Ambient Morality | last post by:
I've done a google search on this but, amazingly, I'm the first guy to ever need this! Everyone else seems to need the reverse of this. Actually, I did find some people who complained about this...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.