python xml DOM? pulldom? SAX?

jog

Hi,
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
<parent>
<page>
<title>bla</title>
<id></id>
<revision>
<id></id>
<text>blablabla </text>
<revision>
</page>
<page>
</page>
....
</parent>
I want to combine the text out of page:title and page:revision:t ext for
every single page element. One by one I want to index these combined
texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
or pulldom?
Or should I just use Xpath somehow.
I don`t want to do anything else with his xml file afterwards.
I hope someone will understand me.....
Thank you very much
Jog

Aug 29 '05 #1

Subscribe Reply

4284

tooper

Hi,

I'd advocate for using SAX, as DOM related methods implies loading the
complete XML content in memory whereas SAX grab things on the fly.
SAX method should therefore be faster and less memory consuming...

By the way, if your goal is to just "combine the text out of page:title
and page:revision:t ext for every single page element", maybe you should
also consider an XSLT filter.

Regards,
Thierry

Aug 29 '05 #2

Michael Ekstrand

On 29 Aug 2005 08:17:04 -0700
"jog" <jo@johannageis s.de> wrote:

I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
[structure snipped]
I want to combine the text out of page:title and page:revision:t ext
for every single page element. One by one I want to index these
combined texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
or pulldom?
Definitely SAX IMHO, or xml.parsers.exp at. For what you're doing, an
event-driven interface is ideal. DOM parses the *entire* XML tree into
memory at once, before you can do anything - highly inefficient for a
large data set like this. I've never used pulldom, it might have
potential, but from my (limited and flawed) understanding of it, I
think it may also wind up loading most of the file into memory by the
time you're done.

SAX will not build any memory structures other than the ones you
explicitly create (SAX is commonly used to build DOM trees). With SAX,
you can just watch for any tags of interest (and perhaps some
surrounding tags to provide context), extract the desired data, and all
that very efficiently.

It took me a bit to get the hang of SAX, but once I did, I haven't
looked back. Event-driven parsing is a brilliant solution to this
problem domain.
Or should I just use Xpath somehow.

XPath usually requires a DOM tree on which it can operate. The Python
XPath implementation (in PyXML) requires DOM objects. I see this as
being a highly inefficient solution.

Another potential solution, if the data file has extraneous
information: run the source file through an XSLT transform that strips
it down to only the data you need, and then apply SAX to parse it.

- Michael

Aug 29 '05 #3

Fredrik Lundh

"jog" wrote:

I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this I want to combine the text out of page:title and page:revision:t ext for
every single page element. One by one I want to index these combined
texts (so for each page one index)

here's one way to do it:

try:
import cElementTree as ET
except ImportError:
from elementtree import ElementTree as ET

for event, elem in ET.iterparse(fi le):
if elem.tag == "page":
title = elem.findtext(" title")
revision = elem.findtext(" revision/text")
print title, revision
elem.clear() # won't need this any more

references:

http://effbot.org/zone/element-index.htm
http://effbot.org/zone/celementtree.htm (for best performance)
http://effbot.org/zone/element-iterparse.htm

</F>

Aug 29 '05 #4

Alan Kennedy

[jog]

I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
[snip]
I want to combine the text out of page:title and page:revision:t ext
for every single page element. One by one I want to index these
combined texts (so for each page one index)
What is the most efficient API for that?:
SAX ( I don´t thonk so)
SAX is perfect for the job. See code below.
DOM
If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory
to load it into a DOM.
or pulldom?
Not sure how pulldom does it's pull "optimizations" , but I think it
still builds an in-memory object structure for your document, which will
still take buckets of memory for such a big document. I could be wrong
though.
Or should I just use Xpath somehow.

Using xpath normally requires building a (D)OM, which will consume
*lots* of memory for your document, regardless of how efficient the OM is.

Best to use SAX and XPATH-style expressions.

You can get a limited subset of xpath using a SAX handler and a stack.
Your problem is particularly well suited to that kind of solution. Code
that does a basic job of this for your specific problem is given below.

Note that there are a number of caveats with this code

1. characterdata handlers may get called multiple times for a single xml
text() node. This is permitted in the SAX spec, and is basically a
consequence of using buffered IO to read the contents of the xml file,
e.g. the start of a text node is at the end of the last buffer read, and
the rest of the text node is at the beginning of the next buffer.

2. This code assumes that your "revision/text" nodes do not contain
mixed content, i.e. a mixture of elements and text, e.g.
"<revision><tex t>This is a piece of <b>revision</b>
text</text></revision>. The below code will fail to extract all
character data in that case.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax

class Page:

def append(self, field_name, new_value):
old_value = ""
if hasattr(self, field_name):
old_value = getattr(self, field_name)
setattr(self, field_name, "%s%s" % (old_value, new_value))

class page_matcher(xm l.sax.handler.C ontentHandler):

def __init__(self, page_handler=No ne):
xml.sax.handler .ContentHandler .__init__(self)
self.page_handl er = page_handler
self.stack = []

def check_stack(sel f):
stack_expr = "/" + "/".join(self.sta ck)
if '/parent/page' == stack_expr:
self.page = Page()
elif '/parent/page/title/text()' == stack_expr:
self.page.appen d('title', self.chardata)
elif '/parent/page/revision/id/text()' == stack_expr:
self.page.appen d('revision_id' , self.chardata)
elif '/parent/page/revision/text/text()' == stack_expr:
self.page.appen d('revision_tex t', self.chardata)
else:
pass

def startElement(se lf, elemname, attrs):
self.stack.appe nd(elemname)
self.check_stac k()

def endElement(self , elemname):
if elemname == 'page' and self.page_handl er:
self.page_handl er(self.page)
self.page = None
self.stack.pop( )

def characters(self , data):
self.chardata = data
self.stack.appe nd('text()')
self.check_stac k()
self.stack.pop( )

testdoc = """
<parent>
<page>
<title>Page number 1</title>
<id>p1</id>
<revision>
<id>r1</id>
<text>revisio n one</text>
</revision>
</page>
<page>
<title>Page number 2</title>
<id>p2</id>
<revision>
<id>r2</id>
<text>revisio n two</text>
</revision>
</page>
</parent>
"""

def page_handler(ne w_page):
print "New page"
print "title\t\t% s" % new_page.title
print "revision_id\t% s" % new_page.revisi on_id
print "revision_text\ t%s" % new_page.revisi on_text
print

if __name__ == "__main__":
parser = xml.sax.make_pa rser()
parser.setConte ntHandler(page_ matcher(page_ha ndler))
parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
parser.feed(tes tdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,

--
alan kennedy
------------------------------------------------------
email alan: http://xhaus.com/contact/alan

Aug 29 '05 #5

Fredrik Lundh

Alan Kennedy wrote:

SAX is perfect for the job. See code below.

depends on your definition of perfect...

using a 20 MB version of jog's sample, and having replaced
the print statements with local variable assignments, I get the
following timings:

5 lines of cElementTree code: 7.2 seconds
60+ lines of xml.sax code: 63 seconds

(Python 2.4.1, Windows XP, Pentium 3 GHz)

</F>

Aug 29 '05 #6

Alan Kennedy

[Alan Kennedy]

SAX is perfect for the job. See code below.

[Fredrik Lundh] depends on your definition of perfect...
Obviously, perfect is the eye of the beholder ;-)

[Fredrik Lundh] using a 20 MB version of jog's sample, and having replaced
the print statements with local variable assignments, I get the
following timings:

5 lines of cElementTree code: 7.2 seconds
60+ lines of xml.sax code: 63 seconds

(Python 2.4.1, Windows XP, Pentium 3 GHz)

Impressive!

At first, I thought your code sample was building a tree for the entire
document, so I checked the API docs. It appeared to me that an event
processing model *couldn't* obtain the text for the node when notified
of the node: the text node is still in the future.

That's when I understood the nature of iterparse, which must generate an
event *after* the node is complete, and it's subdocument reified. That's
also when I understood the meaning of the "elem.clear ()" call at the
end. Only the required section of the tree is modelled in memory at any
given time. Nice.

There are some minor inefficiencies in my pure python sax code, e.g.
building the stack expression for every evaluation, but I left them in
for didactic reasons. But even if every possible speed optimisation was
added to my python code, I doubt it would be able to match your code.

I'm guessing that a lot of the reason why cElementTree performs best is
because the model-building is primarily implemented in C: Both of our
solutions run python code for every node in the tree, i.e. are O(N). But
yours also avoids the overhead of having function-calls/stack-frames for
every single node event, by processing all events inside a single function.

If the SAX algorithm were implemented in C (or Java) for that matter, I
wonder if it might give comparable performance to the cElementTree code,
primarily because the data structures it is building are simpler,
compared to the tree-subsections being reified and discarded by
cElementTree. But that's not of relevance, because we're looking for
python solutions. (Aside: I can't wait to run my solution on a
fully-optimising PyPy :-)

That's another nice thing I didn't know (c)ElementTree could do.

enlightened-ly'yrs,

--
alan kennedy
------------------------------------------------------
email alan: http://xhaus.com/contact/alan

Aug 29 '05 #7

William Park

jog <jo@johannageis s.de> wrote:

Hi,
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
<parent>
<page>
<title>bla</title>
<id></id>
<revision>
<id></id>
<text>blablabla </text>
<revision>
</page>
<page>
</page>
....
</parent>
I want to combine the text out of page:title and page:revision:t ext for
every single page element. One by one I want to index these combined
texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
or pulldom?
Or should I just use Xpath somehow.
I don`t want to do anything else with his xml file afterwards.
I hope someone will understand me.....
Thank you very much
Jog

I would use Expat interface from Python, Awk, or even Bash shell. I'm
most familiar with shell interface to Expat, which would go something
like

start() # Usage: start tag att=value ...
{
case $1 in
page) unset title text ;;
esac
}
data() # Usage: data text
{
case ${XML_TAG_STACK[0]}.${XML_TAG_STA CK[1]}.${XML_TAG_STA CK[2]} in
title.page.*) title=$1 ;;
text.revision.p age) text=$1 ;;
esac
}
end() # Usage: end tag
{
case $1 in
page) echo "title=$tit le text=$text" ;;
esac
}
expat -s start -d data -e end < file.xml

--
William Park <op**********@y ahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

Aug 29 '05 #8

jog

Thanks a lot for all your replies that was really great and helpfull.
Now I have some problems with the indexing, it takes to much memory and
akes to long. I have to look into it.....

Sep 6 '05 #9

Similar topics

1717

minidom and pulldom

by: David Pinto | last post by:

I'm trying to use either the minidom or pulldom to find table tags in html web pages. I've tried parsing two web pages that show up fine in my browser, but I get errors when I call minidom.parse, or try to get events with pulldom. Is there a parser that is as forgiving as web browsers?

Python

1850

high level, fast XML package for Python?

by: Gleb Rybkin | last post by:

I searched online, but couldn't really find a standard package for working with Python and XML -- everybody seems to suggest different ones. Is there a standard xml package for Python? Preferably high-level, fast and that can parse in-file, not in-memory since I have to deal with potentially MBs of data. Thanks.

Python

1008

how can pulldom save to xml file?

by: sun | last post by:

I have a large xml file parsed by pulldom. I did some editing on some node,change attributes and remove some child node, how do I save it back to this xml file or write a new xml file? The save method in minidom does not work for pulldom. Thanks

Python

811

How do I create a new Node using pulldom?

by: susan_ali | last post by:

I'm using xml.dom.pulldom to parse through an XML file. I use expandNode() to scrutinize certain blocks of it that I'm interested in. Once I find a block of XML in the input file that I'm interested in, I need to add my own block <MyTag>.....</MyTagto the pulldom tree I'm building in memory. The documentation on PullDom is worse than atrocious. It is simply non-

Python

8969

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8788

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9335

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9263

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8210

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6053

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4825

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3279

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2193

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General