By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,050 Members | 1,009 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,050 IT Pros & Developers. It's quick & easy.

XML Parsing

P: n/a
Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.
Jul 18 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
Tyler Eaves <ty***@ml1.net> wrote:
Are there any other XML modules that offer the same interface minidom
does, but are faster?


It's not totally the same interface as minidom, but cDomlette offers a
fast set of XML operations through an incomplete DOM interface. See
http://www.4suite.org/ .

With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #2

P: n/a
Tyler Eaves wrote:
Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.


PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API... it
produces tuple-based output that's easy enough to dig through in Python.

I'm probably going to be working on a pyRXP -> DOM translator (I've got an
existing DOM app that uses XPath; I don't want to rewrite it to use tuples),
but no idea if/when it'll be in a working state.

--
Chris Herborth ch****@cryptocard.com
Documentation Overlord, CRYPTOCard Corp. http://www.cryptocard.com/
Never send a monster to do the work of an evil scientist.
Jul 18 '05 #3

P: n/a
Andrew Clover wrote:
With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.


Ain't standards great? ;-)

--
Chris Herborth ch****@cryptocard.com
Documentation Overlord, CRYPTOCard Corp. http://www.cryptocard.com/
Never send a monster to do the work of an evil scientist.
Jul 18 '05 #4

P: n/a
"Chris Herborth" <ch****@cryptocard.com> wrote in message
news:5z*******************@news20.bellglobal.com.. .

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?
pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...


And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

James
Jul 18 '05 #5

P: n/a
"James Kew" <ja*******@btinternet.com> wrote in message news:<c0*************@ID-71831.news.uni-berlin.de>...

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).


I made a PyXML-style wrapper for libxml2, although it works above the
existing wrapper and therefore isn't very fast. However, if you just
want to access various parts of your documents before getting libxslt
to do the real work, you might find it convenient. Here it is:

http://www.boddie.org.uk/python/down...dom-0.1.tar.gz

I also made a wrapper around qtxml/KHTML which gives the same
PyXML-style conveniences:

http://www.boddie.org.uk/python/down...dom-0.1.tar.gz

Obviously, if you don't mind writing to a specific API, then neither
of these packages is the way to go. However, XML processing is quite
often a tradeoff between compliance, convenience and performance, as
the recent PyRXP debate demonstrates. ;-)

Paul
Jul 18 '05 #6

P: n/a
Chris Herborth <ch****@cryptocard.com> wrote:
Ain't standards great? ;-)


Heh. Quite so, although to be fair cDomlette and FtMiniDom don't actually
claim to be full DOM implementations.

It was frustration with this rather uneven state of affairs that led me
to roll my own. Speaking of which, I'm happy to announce that pxdom
1.0 [final] has been released:

http://www.doxdesk.com/software/py/pxdom.html

This implements the February 2004 Proposed Recommendations for DOM Level
3 Core/XML and Load/Save completely (except for the optional external
entity support, which will be coming in 1.1 [beta], and optional DTD
validation, which is unlikely to happen any time soon I'm afraid.)

Hurrah!

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #7

P: n/a
Tyler Eaves wrote:
Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.


Take a look at Fredrik Lundh's element tree:
http://effbot.org/zone/element-index.htm

It's fast and very pythonic... I use it all the time.

Regards,
Nicodemus.
Jul 18 '05 #8

P: n/a
Chris Herborth <ch****@cryptocard.com> wrote in message news:<Gz*******************@news20.bellglobal.com> ...
Andrew Clover wrote:
With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.


Ain't standards great? ;-)


We never claim cDomlette to be a DOM implementation. The main page
for cDomlette info is:

http://uche.ogbuji.net/tech/akara/no...1-01/domlettes

It starts with:

"Domlette is 4Suite's lightweight DOM implementation. It is optimized
for XPath operations, speed, and relatively low memory overhead, at
least when compared to 4DOM and minidom. It is not fully DOM
compliant, but it does provide an interface very close to DOM Level 2.
In Domlette, where DOM and XPath disagree, XPath wins."

That last point is the salient one. We wrote cDomlette for a reason:
4XSLT was *way* too slow operating on standard DOM and we needed a
super-fast alternative specialized for XPath processing. The emphasis
was on XPath data model rather than DOM. Both, BTW, are W3C standards
and yet they conflict in a few key ways. Go figure.

Anyway, cDomlette is a useful and very fast general API for XML
processing. You can use it if you don't need full DOM support.

--Uche
http://uche.ogbuji.net
Jul 18 '05 #9

P: n/a
"James Kew" <ja*******@btinternet.com> wrote in message news:<c0*************@ID-71831.news.uni-berlin.de>...
"Chris Herborth" <ch****@cryptocard.com> wrote in message
news:5z*******************@news20.bellglobal.com.. .

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?
pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...


And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.


Yes, and this is a very serious problem. Anyone entering into XML
processing with the belief that they'll never need anything but
Unicode characters under U+256 is fooling himself. Heck, even XML
exports from MS Office will generate high Unicode characters for
"smart" quotes, em nd en dashes, ellipses and a lot of other comon
punctuation. All of these will blow up with PyRXP.

You can use PyRXPU, which is compliant but indications are that it
isn't as fast.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.


This was my biggest problem with libxml2/Python as documented here:

http://www.xml.com/pub/a/2003/05/14/py-xml.html

If documentation for Python users is improved, it will be hard to beat
that package.

But your criteria lead me to suggest that you give cDomlette a try. I
is also implemented in C for performance. It's as much DOM compliant
as libxml2's DOM API (which is to say not fully so), but we do try to
document it from the Python POV. See:

http://uche.ogbuji.net/tech/akara/no...1-01/domlettes

--Uche
http://uche.ogbuji.net
Jul 18 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.