473,395 Members | 1,568 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Problem with xml.dom parser and xmlns attribute

Hi,

I have a problem parsing html text with xmldom. The following code
runs well:

--------------------------------------------
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint

r = HtmlLib.Reader()
doc = r.fromString(
'''
<html>
<head>
</head>
<body>
<p>hallo welt
</body>
</html>
''')
PrettyPrint(doc)
--------------------------------------------

but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
I get the error

Traceback (most recent call last):
File "xhtml.py", line 5, in ?
doc = r.fromString(
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
self.parser.parse(stream)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
self._parser.parse(stream.read())
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
unicode(value, self._charset))
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
raise NamespaceErr()
xml.dom.NamespaceErr: Invalid or illegal namespace operation
Exit code: 1


A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

Mit freundlichen Gruessen,

Peter Maas

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
Tel +49-241-93878-0 Fax +49-241-93878-20 eMail pe********@mplusr.de
-------------------------------------------------------------------

Jul 18 '05 #1
4 4460

"Peter Maas" <pe********@mplusr.de> wrote in message news:c6**********@swifty.westend.com...
but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml"> A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?


If they are genuine XHTML documents, they should be well-formed XML,
so you should be able to use an XML rather than an SGML parser.

from xml.dom.ext.reader import Sax2
r = Sax2.Reader()

Jul 18 '05 #2
Richard Brodie wrote:
"Peter Maas" <pe********@mplusr.de> wrote in message news:c6**********@swifty.westend.com...

[...]
but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml"> [...]A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

If they are genuine XHTML documents, they should be well-formed XML,
so you should be able to use an XML rather than an SGML parser.

from xml.dom.ext.reader import Sax2
r = Sax2.Reader()


Thanks, Richard. But in the Internet most of the time I don't know
what kind of document I'm dealing with when I start parsing. I guess
I should use HTMLParser (?).

Mit freundlichen Gruessen,

Peter Maas

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
Tel +49-241-93878-0 Fax +49-241-93878-20 eMail pe********@mplusr.de
-------------------------------------------------------------------
Jul 18 '05 #3

"Peter Maas" <pe********@mplusr.de> wrote in message news:c6**********@swifty.westend.com...
Thanks, Richard. But in the Internet most of the time I don't know
what kind of document I'm dealing with when I start parsing. I guess
I should use HTMLParser (?).


If you're dealing with a wide range of web pages, chances are they
will have all manner of rubbish in them. I would probably feed the
stuff through Tidy (or uTidyLib) first, to convert to cleanish XHTML,
then use an XML parser.
Jul 18 '05 #4
Peter Maas <pe********@mplusr.de> wrote in message news:<c6**********@swifty.westend.com>...
Hi,

I have a problem parsing html text with xmldom. The following code
runs well:

--------------------------------------------
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint

r = HtmlLib.Reader()
doc = r.fromString(
'''
<html>
<head>
</head>
<body>
<p>hallo welt
</body>
</html>
''')
PrettyPrint(doc)
--------------------------------------------

but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
I get the error

Traceback (most recent call last):
File "xhtml.py", line 5, in ?
doc = r.fromString(
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
self.parser.parse(stream)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
self._parser.parse(stream.read())
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
unicode(value, self._charset))
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
raise NamespaceErr()
xml.dom.NamespaceErr: Invalid or illegal namespace operation
>Exit code: 1


A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?


This looks like a 4DOM bug. What are you hoping to do once you've
parsed these documents? If we know we can either suggest an
alternative tool to use or perhaps a workaround.

--Uche
Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: annoyingmouse2002 | last post by:
Hi there, sorry if this a long post but I'm really just starting out. I've been using MSXML to parse an OWL but would like to use a different solution. Basically it reads the OWL (Based on XML)...
0
by: WebHouse.Co | last post by:
Hi Sir I'm in my 2nd year in M.Sc. degree & I made a project about the powerful tools SQLXML 3.0 & updategram, so I made a list of programs which r they so similar to the example that using...
4
by: johkar | last post by:
When the output method is set to xml, even though I have CDATA around my JavaScript, the operaters of && and < are converted to XML character entities which causes errors in my JavaScript. I know...
4
by: Jens | last post by:
Hello, i am trying to call a Apache WebService, which accepts NULL-Values for some Parameters of a specific Web-Method. NULL-Values are mapped within the soap-request by the .NET Client...
0
by: Rajesh Jain | last post by:
I Have 2 separate schemas. --------------Schema 1 is defined as below----------- <xs:schema targetNamespace="http://Schemas/1" xmlns="http://Schemas/1" xmlns:xs="http://www.w3.org/2001/XMLSchema"...
3
by: Michael Skulsky | last post by:
Hi all, I've got the following validation problem. There are 2 schemas and a document: ----------------------------------------------------------------- bar.xsd ====== <?xml version="1.0"...
2
by: yqlu | last post by:
I hava developed a client in C# that is connected to a 3-party XML Web Services developed in Java based on the AXIS 1.1. Most methods call are successful except for one method named "findObjects"...
4
by: infiniti | last post by:
Hi, I am coming across problems in trying to EFFICIENTLY merge to XML files into one which involves transposing the rows into columns so that I can either generate a single flat xml file or store...
1
by: reddyth | last post by:
Dear All, I wanted to parse an XML file and print the element's content. I have the following code for the same. I have printed the ourput too. The problem is it is printing unwanted spaces and...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.