How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.
Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net 9 13652
"MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com... How do I convert an HTML page into XML?
There are many remarkable differences between HTML and XML, including
HTML is not required to be well-formed (having balanced begin and end tags),
and HTML is case-insensitive.
Your conversion must correct for these shortcomings (and several others)
in the HTML before an XML processor will accept it. For guidance, look at
XHTML, which is HTML as an XML vocabulary (XHTML has extra features
which need not concern you, but understanding the differences between
HTML and XHTML will probably help you do your conversion).
For complete details on the differences between XHTML and HTML, see
Section 4 of the XHTML 1.0 Specification at the following URL, http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
Hi Derek,
I read the article you referenced and xhtml looks like what I'm looking for but I still need some help putting all the pieces together...
Once my document conforms to XHTML standards how do I read it into an xml document? For example, XmlTextReader xmlReader = new XmlTextReader(htmlfile);
If this is true then does XmlTextReader know how to read the file based on the DOCTYPE declaration in the document prior to the root element?
The article you posted recommended 3 DOCTYPES. What is the most common DOCTYPE for simple HTMLs?
Thank you for your help,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Derek Harmon" wrote: "MLibby" <ml****@nospam.nospam> wrote in message news:90**********************************@microsof t.com... How do I convert an HTML page into XML?
There are many remarkable differences between HTML and XML, including HTML is not required to be well-formed (having balanced begin and end tags), and HTML is case-insensitive.
Your conversion must correct for these shortcomings (and several others) in the HTML before an XML processor will accept it. For guidance, look at XHTML, which is HTML as an XML vocabulary (XHTML has extra features which need not concern you, but understanding the differences between HTML and XHTML will probably help you do your conversion).
For complete details on the differences between XHTML and HTML, see Section 4 of the XHTML 1.0 Specification at the following URL,
http://www.w3.org/TR/xhtml1/#diffs
Derek Harmon
>Your conversion must correct for these shortcomings (and several others) in the HTML before an XML processor will accept it. For guidance, look at XHTML, which is HTML as an XML vocabulary (XHTML has extra features which need not concern you, but understanding the differences between HTML and XHTML will probably help you do your conversion).
And for that, you could use something like HTML Tidy, which offers
these kind of conversions right out of the box! http://www.w3.org/People/Raggett/tidy/ http://tidy.sourceforge.net/ http://users.rcn.com/creitzel/tidy.html#dotnet
Marc
================================================== ==============
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch
MLibby wrote: I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:
line 4 column 1 - Warning: missing <!DOCTYPE> declaration line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu> line 6 column 1 - Error: <summitsw:ppcframe> is not recognized! line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe> line 4 column 1 - Warning: inserting missing 'title' element Info: Document content looks like HTML 3.2 4 warnings, 2 errors were found!
HTMLTidy is trying to make given document well-formed (X)HTML. If you
want arbitrary XML, you better try SgmlReader.
--
Oleg Tkachenko [XML MVP] http://blog.tkachenko.com
Hi Oleg,
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...
<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would
result in an invalid XML document.
Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote: MLibby wrote:
I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:
line 4 column 1 - Warning: missing <!DOCTYPE> declaration line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu> line 6 column 1 - Error: <summitsw:ppcframe> is not recognized! line 6 column 1 - Warning: discarding unexpected <summitsw:ppcframe> line 4 column 1 - Warning: inserting missing 'title' element Info: Document content looks like HTML 3.2 4 warnings, 2 errors were found!
HTMLTidy is trying to make given document well-formed (X)HTML. If you want arbitrary XML, you better try SgmlReader.
-- Oleg Tkachenko [XML MVP] http://blog.tkachenko.com
MLibby wrote: I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...
<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would result in an invalid XML document.
Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.
I don't think you can parse aspx with any XML or HTML tool. ASP.NET
syntax isn't XML and moreover it allows constructs (such as <%), which
are forbidden in both XML and HTML.
You can try to extend SgmlReader or to preprocess aspx by regexp before
parsing.
--
Oleg Tkachenko [XML MVP] http://blog.tkachenko.com
What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
"Oleg Tkachenko [MVP]" wrote: MLibby wrote:
I just tried SgmlReader and though it too works good on html it doesn't work on ascx or aspx source. Here's results from a simple ascx file...
<?xml version="1.0" encoding="IBM437"?>Error: Token CData in state Prolog would result in an invalid XML document.
Have you tried SgmlReader with ascx or aspx source? Am I missing something. Please let me know.
I don't think you can parse aspx with any XML or HTML tool. ASP.NET syntax isn't XML and moreover it allows constructs (such as <%), which are forbidden in both XML and HTML. You can try to extend SgmlReader or to preprocess aspx by regexp before parsing. -- Oleg Tkachenko [XML MVP] http://blog.tkachenko.com
* MLibby wrote in microsoft.public.dotnet.xml: I just downloaded HTMLTidy and as you say it does a great job converting html to xhtml. However, I forgot to mention one minor point;):) I'm trying to convert aspx and ascx files to xml!!! Sorry about that. I tried to run aspx and ascx files through HTMLTidy and received the following:
line 4 column 1 - Warning: missing <!DOCTYPE> declaration line 4 column 1 - Warning: discarding unexpected <cyberakt:aspnetmenu> line 6 column 1 - Error: <summitsw:ppcframe> is not recognized!
You need to declare custom elements using the new-...-tags configuration
options. For example
% tidy -asxml --new-blocklevel-tags a:b
<div>...<a:b>...</a:b>...</div>
^Z
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 9 - Warning: <a:b> is not approved by W3C
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 1st June 2004), see www.w3.org" />
<title></title>
</head>
<body>
<div>...
<a:b>...</a:b>
...</div>
</body>
</html>
MLibby wrote: What are you thoughts on using regexp to preprocess the aspx or ascx file? Would it be a matter of placing the contents of <% into a cdata field?
Good idea. Either wrap them into CDATA or translate into something
XMLish, e.g. <%=foo> into <asp:expression value="foo"/>
PS. JSP has alternative XML syntax for years now. ASP.NET still hasn't.
--
Oleg Tkachenko [XML MVP] http://blog.tkachenko.com This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Matt |
last post by:
I want to use XML to store a document's configurations. And I can convert
to different file format by using XSL. For example, convert to HTML,
PDF, or RTF. But the contents are all stored in single...
|
by: Dominic |
last post by:
Hi guys,
In .NET, how can I convert a HTML file to TIFF efficiently?
One possible way is that I can first use word automatation to load the
HTML up and then print it to TIFF. Is that right?...
|
by: hunterb |
last post by:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a...
|
by: iwdu15 |
last post by:
hi, how can i convert rtf encoding to HTML? i looked at previous posts and i
cant seem to find the HTTPUtility in vbexpress 2005. Is it not there, and is
there another way? thanks
--
-iwdu15
|
by: melickas |
last post by:
We designed a custom application using Office Developer Tools '97 which
included a Run-time version of Access--- so it would not matter if our
customer even had any version of Access on their...
|
by: PenguinPig |
last post by:
Dear All Experts
I would like to know how to convert a HTML into Image using C#. Or allow me
contains HTML code (parsed) in Image? I also tried this way but it just
display the character "<" &...
|
by: csgraham74 |
last post by:
Hi,
I have a requirement in work that i give a person the ability to
create a html document using a richt text editor. What i then want to
do is save the HTML doct to my server & insert...
|
by: csgraham74 |
last post by:
Hi,
Ive posted on this previously but had no response. Basically i need to
build some html using a rich text editor. Then i want to actually
create an html document from this and save it to my...
|
by: perryclisbee via AccessMonster.com |
last post by:
I have dates of service for several people that range all over each month.
ie: patient had dates of service of: 7/3/2006, 7/24/2006 and 7/25/2006. I
need to create a new field via a query that...
|
by: Just Another Victim of the Ambient Morality |
last post by:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |