473,915 Members | 7,703 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Convert HTML to XML

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.

Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
Nov 12 '05 #1
9 13721
"MLibby" <ml****@nospam. nospam> wrote in message news:90******** *************** ***********@mic rosoft.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
Nov 12 '05 #2
Hi Derek,

I read the article you referenced and xhtml looks like what I'm looking for but I still need some help putting all the pieces together...

Once my document conforms to XHTML standards how do I read it into an xml document? For example, XmlTextReader xmlReader = new XmlTextReader(h tmlfile);

If this is true then does XmlTextReader know how to read the file based on the DOCTYPE declaration in the document prior to the root element?

The article you posted recommended 3 DOCTYPES. What is the most common DOCTYPE for simple HTMLs?

Thank you for your help,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Derek Harmon" wrote:
"MLibby" <ml****@nospam. nospam> wrote in message news:90******** *************** ***********@mic rosoft.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon

Nov 12 '05 #3
>Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).


And for that, you could use something like HTML Tidy, which offers
these kind of conversions right out of the box!

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/
http://users.rcn.com/creitzel/tidy.html#dotnet

Marc
=============== =============== =============== =============== ====
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)i nova.ch
Nov 12 '05 #4
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcfr ame>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #5
Hi Oleg,

I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.

Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcfr ame>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #6
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #7
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?

Mike

--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #8
* MLibby wrote in microsoft.publi c.dotnet.xml:
I just downloaded HTMLTidy and as you say it does a great job
converting html to xhtml. However, I forgot to mention one minor
point;):) I'm trying to convert aspx and ascx files to xml!!!
Sorry about that. I tried to run aspx and ascx files through
HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!


You need to declare custom elements using the new-...-tags configuration
options. For example

% tidy -asxml --new-blocklevel-tags a:b
<div>...<a:b>.. .</a:b>...</div>
^Z
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 9 - Warning: <a:b> is not approved by W3C
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator " content=
"HTML Tidy for Windows (vers 1st June 2004), see www.w3.org" />
<title></title>
</head>
<body>
<div>...
<a:b>...</a:b>
...</div>
</body>
</html>
Nov 12 '05 #9
MLibby wrote:
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?


Good idea. Either wrap them into CDATA or translate into something
XMLish, e.g. <%=foo> into <asp:expressi on value="foo"/>

PS. JSP has alternative XML syntax for years now. ASP.NET still hasn't.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1887
by: Matt | last post by:
I want to use XML to store a document's configurations. And I can convert to different file format by using XSL. For example, convert to HTML, PDF, or RTF. But the contents are all stored in single XML file. Are they any documentation management tools/web site that help people do that? please help. thanks!!
4
7059
by: Dominic | last post by:
Hi guys, In .NET, how can I convert a HTML file to TIFF efficiently? One possible way is that I can first use word automatation to load the HTML up and then print it to TIFF. Is that right? However, even if it is technically feasible, it can take long time to do the conversion, especially the HTML is complicated. Is there any other more efficient way? Thanks!
3
7786
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
3
3400
by: iwdu15 | last post by:
hi, how can i convert rtf encoding to HTML? i looked at previous posts and i cant seem to find the HTTPUtility in vbexpress 2005. Is it not there, and is there another way? thanks -- -iwdu15
5
2584
by: melickas | last post by:
We designed a custom application using Office Developer Tools '97 which included a Run-time version of Access--- so it would not matter if our customer even had any version of Access on their computer. The application ran without problems on our customer's computer for 2-3 years. Then our customer bought a new computer and we had to reinstall the application. Everything was ok for approximately 6 months until our customer was "cleaning up"...
6
29458
by: PenguinPig | last post by:
Dear All Experts I would like to know how to convert a HTML into Image using C#. Or allow me contains HTML code (parsed) in Image? I also tried this way but it just display the character "<" & ">" directly.... I have done googling, but all return shareware. I would like to know how to programming...but not using shareware... Thanks all.
2
4608
by: csgraham74 | last post by:
Hi, I have a requirement in work that i give a person the ability to create a html document using a richt text editor. What i then want to do is save the HTML doct to my server & insert reference in a database so that i can retrieve the html document. Basically i dont have a clue how to convert a string in HTML format (taken from the RTE) to an actual document and then save it to a folder. Can anyone point me in the correct direction.
4
4482
by: csgraham74 | last post by:
Hi, Ive posted on this previously but had no response. Basically i need to build some html using a rich text editor. Then i want to actually create an html document from this and save it to my server. Does anyone have any examples on how to do this ?? I dont really understand how to get from HTML string to HTML document. I can probably work out how to save this. do i create an html object ??
4
39365
by: perryclisbee via AccessMonster.com | last post by:
I have dates of service for several people that range all over each month. ie: patient had dates of service of: 7/3/2006, 7/24/2006 and 7/25/2006. I need to create a new field via a query that will convert each of the records of these service dates to the first date of that month, with results showing: 7/1/2006, 7/1/2006, 7/1/2006. How would you place an expression on a query that will convert any given date to the first day of the month...
5
3655
by: Just Another Victim of the Ambient Morality | last post by:
I've done a google search on this but, amazingly, I'm the first guy to ever need this! Everyone else seems to need the reverse of this. Actually, I did find some people who complained about this and rolled their own solution but I refuse to believe that Python doesn't have a built-in solution to what must be a very common problem. So, how do I convert HTML to plaintext? Something like this: <div>This&nbsp;is&nbsp;a&nbsp;string.</div>
0
10039
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9883
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11359
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10543
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9734
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
8102
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5944
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
6149
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4346
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.