473,385 Members | 1,730 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

processing XHTML1.1 documents with xml.sax

Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?

BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>

<body>
<p>hello world!</p>
</body>
</html>

and the script:

import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)

the error:

SAXParseException:
http://www.w3.org/TR/xhtml-modulariz...rk-1.mod:89:0:
error in processing external entity reference

will be thrown.
Jul 18 '05 #1
1 2137
we*******@yahoo.com wrote in message news:<ma**************************************@pyt hon.org>...
Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?

BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>

<body>
<p>hello world!</p>
</body>
</html>

and the script:

import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)

the error:

SAXParseException:
http://www.w3.org/TR/xhtml-modulariz...rk-1.mod:89:0:
error in processing external entity reference

will be thrown.


Ouch. I took a brief look at this and expat has a problem here. I
should note that there are few more hairy stress tests of DTD
conformance than XHTMLMOD (the basis of XHTML 1.1).

Using the most recent expat, 1.95.8, something weird happens:

[uogbuji@borgia xmlwf]$ xmlwf -p ~/foo.xhtml
/home/uogbuji/http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd: No such
file or directory
/home/uogbuji/foo.xhtml:3:52: error in processing external entity
reference

It's a little confused about the fact that http:// starts a URL. I
tried as much fiddling as I had time to, but I think there's little
recourse but for you to submit a bug report to the expat project:

http://sourceforge.net/tracker/?grou...27&atid=110127

And change your DTD to use XHTML 1.0 (which *does* work with expat)
rather than 1.1

Good luck.
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Decomposition, Process, Recomposition -
http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google -
http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" -
http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML -
http://www.ibm.com/developerworks/xm...x-think25.html
A survey of XML standards -
http://www-106.ibm.com/developerwork...rary/x-stand4/
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: christof hoeke | last post by:
hi, i wrote a small application which extracts a javadoc similar documentation for xslt stylesheets using python, xslt and pyana. using non-ascii characters was a problem. so i set the...
2
by: Albert Leibbrandt | last post by:
Hi Just want to check which xml parser you guys have found to be the quickest. I have xml documents with 250 000 records or more and the processing of these documents are taking way to long. The...
3
by: Filip Hendrickx | last post by:
Hi all. I am processing two documents with an XSLT, one is the input document, the other is loaded with the document() function in a variable $doc2. Inside a template that matches an element...
5
by: Robbert van Geldrop | last post by:
Hello, I have a problem consuming a WebService which appears to be dependent of the type of network I am on: The following SOAP response is not processed when I am connected over the internet...
17
by: Luc Mercier | last post by:
Hi Folks, I'm new here, and I need some advice for what tool to use. I'm using XML for benchmarking purposes. I'm writing some scientific programs which I want to analyze. My program generates...
0
by: Ladislav.Urban | last post by:
Webswell Inc. introduces Webswell financeConnect, a comprehensive B2B solution based on Web Services, EbXML and AS2 dedicated for financial industry. Implementing the financeConnect solution...
2
by: Don Giroux | last post by:
In an earlier post that I am not able, for some reason, post a reply to, Martin Honnen responded to a Porthos question about processing XSDs using XSL. I was able to get it to work as described....
1
by: EpicOfChaos | last post by:
Hey guys I am having a bit of trouble, I am trying to process a form(to upload a image) to a hidden iframe. and I am getting a error. The error is : Method Not Allowed The requested method POST is...
0
by: TommyC | last post by:
The attachment "m-004-1.zip" contains "ipskin.pgm" and "m-004-1.pgm". "m-004-1.pgm" is the mouth region after opening(3x3) and closing(2x2) and labeling. "ipskin.pgm" is the raw image. Both of the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.