473,785 Members | 2,460 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Convert HTML to XML

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.

Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
Nov 12 '05 #1
9 13710
"MLibby" <ml****@nospam. nospam> wrote in message news:90******** *************** ***********@mic rosoft.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
Nov 12 '05 #2
Hi Derek,

I read the article you referenced and xhtml looks like what I'm looking for but I still need some help putting all the pieces together...

Once my document conforms to XHTML standards how do I read it into an xml document? For example, XmlTextReader xmlReader = new XmlTextReader(h tmlfile);

If this is true then does XmlTextReader know how to read the file based on the DOCTYPE declaration in the document prior to the root element?

The article you posted recommended 3 DOCTYPES. What is the most common DOCTYPE for simple HTMLs?

Thank you for your help,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Derek Harmon" wrote:
"MLibby" <ml****@nospam. nospam> wrote in message news:90******** *************** ***********@mic rosoft.com...
How do I convert an HTML page into XML?


There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.

Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).

For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL,

http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon

Nov 12 '05 #3
>Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).


And for that, you could use something like HTML Tidy, which offers
these kind of conversions right out of the box!

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/
http://users.rcn.com/creitzel/tidy.html#dotnet

Marc
=============== =============== =============== =============== ====
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)i nova.ch
Nov 12 '05 #4
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcfr ame>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #5
Hi Oleg,

I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.

Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!
line 6 column 1 - Warning: discarding unexpected <summitsw:ppcfr ame>
line 4 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 2 errors were found!


HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.

--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #6
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #7
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?

Mike

--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote:
MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...

<?xml version="1.0" encoding="IBM43 7"?>Error: Token CData in state Prolog would
result in an invalid XML document.

Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.


I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com

Nov 12 '05 #8
* MLibby wrote in microsoft.publi c.dotnet.xml:
I just downloaded HTMLTidy and as you say it does a great job
converting html to xhtml. However, I forgot to mention one minor
point;):) I'm trying to convert aspx and ascx files to xml!!!
Sorry about that. I tried to run aspx and ascx files through
HTMLTidy and received the following:

line 4 column 1 - Warning: missing <!DOCTYPE> declaration
line 4 column 1 - Warning: discarding unexpected <cyberakt:aspne tmenu>
line 6 column 1 - Error: <summitsw:ppcfr ame> is not recognized!


You need to declare custom elements using the new-...-tags configuration
options. For example

% tidy -asxml --new-blocklevel-tags a:b
<div>...<a:b>.. .</a:b>...</div>
^Z
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 9 - Warning: <a:b> is not approved by W3C
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator " content=
"HTML Tidy for Windows (vers 1st June 2004), see www.w3.org" />
<title></title>
</head>
<body>
<div>...
<a:b>...</a:b>
...</div>
</body>
</html>
Nov 12 '05 #9
MLibby wrote:
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?


Good idea. Either wrap them into CDATA or translate into something
XMLish, e.g. <%=foo> into <asp:expressi on value="foo"/>

PS. JSP has alternative XML syntax for years now. ASP.NET still hasn't.
--
Oleg Tkachenko [XML MVP]
http://blog.tkachenko.com
Nov 12 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1878
by: Matt | last post by:
I want to use XML to store a document's configurations. And I can convert to different file format by using XSL. For example, convert to HTML, PDF, or RTF. But the contents are all stored in single XML file. Are they any documentation management tools/web site that help people do that? please help. thanks!!
4
7049
by: Dominic | last post by:
Hi guys, In .NET, how can I convert a HTML file to TIFF efficiently? One possible way is that I can first use word automatation to load the HTML up and then print it to TIFF. Is that right? However, even if it is technically feasible, it can take long time to do the conversion, especially the HTML is complicated. Is there any other more efficient way? Thanks!
3
7773
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
3
3391
by: iwdu15 | last post by:
hi, how can i convert rtf encoding to HTML? i looked at previous posts and i cant seem to find the HTTPUtility in vbexpress 2005. Is it not there, and is there another way? thanks -- -iwdu15
5
2577
by: melickas | last post by:
We designed a custom application using Office Developer Tools '97 which included a Run-time version of Access--- so it would not matter if our customer even had any version of Access on their computer. The application ran without problems on our customer's computer for 2-3 years. Then our customer bought a new computer and we had to reinstall the application. Everything was ok for approximately 6 months until our customer was "cleaning up"...
6
29450
by: PenguinPig | last post by:
Dear All Experts I would like to know how to convert a HTML into Image using C#. Or allow me contains HTML code (parsed) in Image? I also tried this way but it just display the character "<" & ">" directly.... I have done googling, but all return shareware. I would like to know how to programming...but not using shareware... Thanks all.
2
4595
by: csgraham74 | last post by:
Hi, I have a requirement in work that i give a person the ability to create a html document using a richt text editor. What i then want to do is save the HTML doct to my server & insert reference in a database so that i can retrieve the html document. Basically i dont have a clue how to convert a string in HTML format (taken from the RTE) to an actual document and then save it to a folder. Can anyone point me in the correct direction.
4
4477
by: csgraham74 | last post by:
Hi, Ive posted on this previously but had no response. Basically i need to build some html using a rich text editor. Then i want to actually create an html document from this and save it to my server. Does anyone have any examples on how to do this ?? I dont really understand how to get from HTML string to HTML document. I can probably work out how to save this. do i create an html object ??
4
39345
by: perryclisbee via AccessMonster.com | last post by:
I have dates of service for several people that range all over each month. ie: patient had dates of service of: 7/3/2006, 7/24/2006 and 7/25/2006. I need to create a new field via a query that will convert each of the records of these service dates to the first date of that month, with results showing: 7/1/2006, 7/1/2006, 7/1/2006. How would you place an expression on a query that will convert any given date to the first day of the month...
5
3646
by: Just Another Victim of the Ambient Morality | last post by:
I've done a google search on this but, amazingly, I'm the first guy to ever need this! Everyone else seems to need the reverse of this. Actually, I did find some people who complained about this and rolled their own solution but I refuse to believe that Python doesn't have a built-in solution to what must be a very common problem. So, how do I convert HTML to plaintext? Something like this: <div>This&nbsp;is&nbsp;a&nbsp;string.</div>
0
9645
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10091
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9950
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8972
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7499
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6739
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5381
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.