472,096 Members | 2,286 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,096 software developers and data experts.

Convert HTML to XML

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.

Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
Nov 12 '05 #1
9 13566
"MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
Nov 12 '05 #2
Hi Derek,

I read the article you referenced and xhtml looks like what I'm looking for but I still need some help putting all the pieces together...

Once my document conforms to XHTML standards how do I read it into an xml document? For example, XmlTextReader xmlReader = new XmlTextReader(htmlfile);

If this is true then does XmlTextReader know how to read the file based on the DOCTYPE declaration in the document prior to the root element?

The article you posted recommended 3 DOCTYPES. What is the most common DOCTYPE for simple HTMLs?

Thank you for your help,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Derek Harmon" wrote:
"MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon

Nov 12 '05 #3
>Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).


And for that, you could use something like HTML Tidy, which offers
these kind of conversions right out of the box!

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/
http://users.rcn.com/creitzel/tidy.html#dotnet

Marc
================================================== ==============
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch
Nov 12 '05 #4
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #5
Hi Oleg,

I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.

Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #6
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #7
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?

Mike

--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #8
* MLibby wrote in microsoft.public.dotnet.xml:
I just downloaded HTMLTidy and as you say it does a great job
converting html to xhtml. However, I forgot to mention one minor
point;):) I'm trying to convert aspx and ascx files to xml!!!
Sorry about that. I tried to run aspx and ascx files through
HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!


You need to declare custom elements using the new-...-tags configuration
options. For example

% tidy -asxml --new-blocklevel-tags a:b
<div>...<a:b>...</a:b>...</div>
^Z
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 9 - Warning: <a:b> is not approved by W3C
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 1st June 2004), see www.w3.org" />
<title></title>
</head>
<body>
<div>...
<a:b>...</a:b>
...</div>
</body>
</html>
Nov 12 '05 #9
MLibby wrote:
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?


Good idea. Either wrap them into CDATA or translate into something
XMLish, e.g. <%=foo> into <asp:expression value="foo"/>

PS. JSP has alternative XML syntax for years now. ASP.NET still hasn't.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by iwdu15 | last post: by
6 posts views Thread by PenguinPig | last post: by
2 posts views Thread by csgraham74 | last post: by
4 posts views Thread by csgraham74 | last post: by
4 posts views Thread by perryclisbee via AccessMonster.com | last post: by
5 posts views Thread by Just Another Victim of the Ambient Morality | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.