XML Parsing

Tyler Eaves

Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

Jul 18 '05 #1

Subscribe Post Reply

2186

Andrew Clover

Tyler Eaves <ty***@ml1.net> wrote:

Are there any other XML modules that offer the same interface minidom
does, but are faster?

It's not totally the same interface as minidom, but cDomlette offers a
fast set of XML operations through an incomplete DOM interface. See
http://www.4suite.org/ .

With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Jul 18 '05 #2

Chris Herborth

Tyler Eaves wrote:

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API... it
produces tuple-based output that's easy enough to dig through in Python.

I'm probably going to be working on a pyRXP -> DOM translator (I've got an
existing DOM app that uses XPath; I don't want to rewrite it to use tuples),
but no idea if/when it'll be in a working state.

--
Chris Herborth ch****@cryptocard.com
Documentation Overlord, CRYPTOCard Corp. http://www.cryptocard.com/
Never send a monster to do the work of an evil scientist.

Jul 18 '05 #3

Chris Herborth

Andrew Clover wrote:

With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

Ain't standards great? ;-)

--
Chris Herborth ch****@cryptocard.com
Documentation Overlord, CRYPTOCard Corp. http://www.cryptocard.com/
Never send a monster to do the work of an evil scientist.

Jul 18 '05 #4

James Kew

"Chris Herborth" <ch****@cryptocard.com> wrote in message
news:5z*******************@news20.bellglobal.com.. .

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?
pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...

And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

James

Jul 18 '05 #5

Paul Boddie

"James Kew" <ja*******@btinternet.com> wrote in message news:<c0*************@ID-71831.news.uni-berlin.de>...

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I made a PyXML-style wrapper for libxml2, although it works above the
existing wrapper and therefore isn't very fast. However, if you just
want to access various parts of your documents before getting libxslt
to do the real work, you might find it convenient. Here it is:

http://www.boddie.org.uk/python/down...dom-0.1.tar.gz

I also made a wrapper around qtxml/KHTML which gives the same
PyXML-style conveniences:

http://www.boddie.org.uk/python/down...dom-0.1.tar.gz

Obviously, if you don't mind writing to a specific API, then neither
of these packages is the way to go. However, XML processing is quite
often a tradeoff between compliance, convenience and performance, as
the recent PyRXP debate demonstrates. ;-)

Paul

Jul 18 '05 #6

Andrew Clover

Chris Herborth <ch****@cryptocard.com> wrote:

Ain't standards great? ;-)

Heh. Quite so, although to be fair cDomlette and FtMiniDom don't actually
claim to be full DOM implementations.

It was frustration with this rather uneven state of affairs that led me
to roll my own. Speaking of which, I'm happy to announce that pxdom
1.0 [final] has been released:

http://www.doxdesk.com/software/py/pxdom.html

This implements the February 2004 Proposed Recommendations for DOM Level
3 Core/XML and Load/Save completely (except for the optional external
entity support, which will be coming in 1.1 [beta], and optional DTD
validation, which is unlikely to happen any time soon I'm afraid.)

Hurrah!

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Jul 18 '05 #7

Nicodemus

Tyler Eaves wrote:

Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

Take a look at Fredrik Lundh's element tree:
http://effbot.org/zone/element-index.htm

It's fast and very pythonic... I use it all the time.

Regards,
Nicodemus.

Jul 18 '05 #8

Uche Ogbuji

Chris Herborth <ch****@cryptocard.com> wrote in message news:<Gz*******************@news20.bellglobal.com> ...

Andrew Clover wrote:
With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

Ain't standards great? ;-)

We never claim cDomlette to be a DOM implementation. The main page
for cDomlette info is:

http://uche.ogbuji.net/tech/akara/no...1-01/domlettes

It starts with:

"Domlette is 4Suite's lightweight DOM implementation. It is optimized
for XPath operations, speed, and relatively low memory overhead, at
least when compared to 4DOM and minidom. It is not fully DOM
compliant, but it does provide an interface very close to DOM Level 2.
In Domlette, where DOM and XPath disagree, XPath wins."

That last point is the salient one. We wrote cDomlette for a reason:
4XSLT was *way* too slow operating on standard DOM and we needed a
super-fast alternative specialized for XPath processing. The emphasis
was on XPath data model rather than DOM. Both, BTW, are W3C standards
and yet they conflict in a few key ways. Go figure.

Anyway, cDomlette is a useful and very fast general API for XML
processing. You can use it if you don't need full DOM support.

--Uche
http://uche.ogbuji.net

Jul 18 '05 #9

Uche Ogbuji

"James Kew" <ja*******@btinternet.com> wrote in message news:<c0*************@ID-71831.news.uni-berlin.de>...

"Chris Herborth" <ch****@cryptocard.com> wrote in message
news:5z*******************@news20.bellglobal.com.. .

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?
pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...

And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

Yes, and this is a very serious problem. Anyone entering into XML
processing with the belief that they'll never need anything but
Unicode characters under U+256 is fooling himself. Heck, even XML
exports from MS Office will generate high Unicode characters for
"smart" quotes, em nd en dashes, ellipses and a lot of other comon
punctuation. All of these will blow up with PyRXP.

You can use PyRXPU, which is compliant but indications are that it
isn't as fast.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

This was my biggest problem with libxml2/Python as documented here:

http://www.xml.com/pub/a/2003/05/14/py-xml.html

If documentation for Python users is improved, it will be hard to beat
that package.

But your criteria lead me to suggest that you give cDomlette a try. I
is also implemented in C for performance. It's as much DOM compliant
as libxml2's DOM API (which is to say not fully so), but we do try to
document it from the Python POV. See:

http://uche.ogbuji.net/tech/akara/no...1-01/domlettes

--Uche
http://uche.ogbuji.net

Jul 18 '05 #10

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...

Python

XML file parsing/validating with xerces-j

by: Cigdem | last post by:

Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home...

.NET Framework

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

SQL re-parsing on query executed against a remote database usingdatabase links

by: Pentti | last post by:

Can anyone help to understand why re-parsing occurs on a remote database (using database links), even though we are using a prepared statement on the local database: Scenario: ======== We...

Oracle Database

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...

Python

Parsing XML

by: randy | last post by:

Can some point me to a good example of parsing XML using C# 2.0? Thanks

C# / C Sharp

parsing an ifstream to get some specific text

by: toton | last post by:

Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in...

C / C++

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...

C / C++

user friendly datetime features

by: Daniel Fetchinson | last post by:

Many times a more user friendly date format is convenient than the pure date and time. For example for a date that is yesterday I would like to see "yesterday" instead of the date itself. And for...

Python

String converting to Stack/ Parsing

by: eyeore | last post by:

Hello everyone my String reverse code works but my professor wants me to use pop top push or Stack code and parsing code could you please teach me how to make this code work with pop top push or...

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Similar topics