473,322 Members | 1,431 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Convert HTML to XML

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.

Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
Nov 12 '05 #1
9 13652
"MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
Nov 12 '05 #2
Hi Derek,

I read the article you referenced and xhtml looks like what I'm looking for but I still need some help putting all the pieces together...

Once my document conforms to XHTML standards how do I read it into an xml document? For example, XmlTextReader xmlReader = new XmlTextReader(htmlfile);

If this is true then does XmlTextReader know how to read the file based on the DOCTYPE declaration in the document prior to the root element?

The article you posted recommended 3 DOCTYPES. What is the most common DOCTYPE for simple HTMLs?

Thank you for your help,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Derek Harmon" wrote:
"MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon

Nov 12 '05 #3
>Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).


And for that, you could use something like HTML Tidy, which offers
these kind of conversions right out of the box!

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/
http://users.rcn.com/creitzel/tidy.html#dotnet

Marc
================================================== ==============
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch
Nov 12 '05 #4
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #5
Hi Oleg,

I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.

Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #6
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #7
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?

Mike

--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #8
* MLibby wrote in microsoft.public.dotnet.xml:
I just downloaded HTMLTidy and as you say it does a great job
converting html to xhtml. However, I forgot to mention one minor
point;):) I'm trying to convert aspx and ascx files to xml!!!
Sorry about that. I tried to run aspx and ascx files through
HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu>
line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!


You need to declare custom elements using the new-...-tags configuration
options. For example

% tidy -asxml --new-blocklevel-tags a:b
<div>...<a:b>...</a:b>...</div>
^Z
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 9 - Warning: <a:b> is not approved by W3C
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 1st June 2004), see www.w3.org" />
<title></title>
</head>
<body>
<div>...
<a:b>...</a:b>
...</div>
</body>
</html>
Nov 12 '05 #9
MLibby wrote:
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?


Good idea. Either wrap them into CDATA or translate into something
XMLish, e.g. <%=foo> into <asp:expression value="foo"/>

PS. JSP has alternative XML syntax for years now. ASP.NET still hasn't.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Matt | last post by:
I want to use XML to store a document's configurations. And I can convert to different file format by using XSL. For example, convert to HTML, PDF, or RTF. But the contents are all stored in single...
4
by: Dominic | last post by:
Hi guys, In .NET, how can I convert a HTML file to TIFF efficiently? One possible way is that I can first use word automatation to load the HTML up and then print it to TIFF. Is that right?...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
3
by: iwdu15 | last post by:
hi, how can i convert rtf encoding to HTML? i looked at previous posts and i cant seem to find the HTTPUtility in vbexpress 2005. Is it not there, and is there another way? thanks -- -iwdu15
5
by: melickas | last post by:
We designed a custom application using Office Developer Tools '97 which included a Run-time version of Access--- so it would not matter if our customer even had any version of Access on their...
6
by: PenguinPig | last post by:
Dear All Experts I would like to know how to convert a HTML into Image using C#. Or allow me contains HTML code (parsed) in Image? I also tried this way but it just display the character "<" &...
2
by: csgraham74 | last post by:
Hi, I have a requirement in work that i give a person the ability to create a html document using a richt text editor. What i then want to do is save the HTML doct to my server & insert...
4
by: csgraham74 | last post by:
Hi, Ive posted on this previously but had no response. Basically i need to build some html using a rich text editor. Then i want to actually create an html document from this and save it to my...
4
by: perryclisbee via AccessMonster.com | last post by:
I have dates of service for several people that range all over each month. ie: patient had dates of service of: 7/3/2006, 7/24/2006 and 7/25/2006. I need to create a new field via a query that...
5
by: Just Another Victim of the Ambient Morality | last post by:
I've done a google search on this but, amazingly, I'm the first guy to ever need this! Everyone else seems to need the reverse of this. Actually, I did find some people who complained about this...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.