By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,534 Members | 1,299 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,534 IT Pros & Developers. It's quick & easy.

iterate over a series of nodes in an XML file

P: n/a
Hi, I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
.....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myIDentries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myIDnodes. Some thing
like:

for myid in getMyIDList(document):
print myid

I realize that I can do this with generators, but I can't see how I can
incorporate generators into my handler class (which is a subclass of
xml.sax.ContentHandler).

Any pointers would be appreciated

Thanks,
Rajarshi

Jul 5 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
ra***********@gmail.com wrote:
Hi, I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myIDentries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myIDnodes. Some thing
like:

for myid in getMyIDList(document):
print myid

I realize that I can do this with generators, but I can't see how I can
incorporate generators into my handler class (which is a subclass of
xml.sax.ContentHandler).

Any pointers would be appreciated
Use ElementTree. Or one of the other packages that implement its very
pythonic interface, lxml or cElementTree.

Otherwise, you don't have much chances of using SAX to create a generator
besides reading the whole document into memory (which somehow defeats the
purpose of SAX in the first place) or creating a separate thread that
communicates with an iterable over a queue.

Alternatively, there are parsers out there that implement a PULL style of
parsing instead of the PUSH SAX does. Butr before you start with theses -
take ElementTree.

Diez
Jul 5 '06 #2

P: n/a
ra***********@gmail.com wrote:
I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myIDentries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myIDnodes. Some thing
like:

for myid in getMyIDList(document):
print myid
You can try lxml 1.1.

http://cheeseshop.python.org/pypi/lxml/1.1alpha

Some documentation is here:
http://codespeak.net/svn/lxml/trunk/doc/api.txt

I haven't tested it, but you should be able to do this:

from lxml.etree import iterparse
last = None
for event, myid in iterparse(document_url, tag="myID"):
print myid.text
if last is not None:
last.getparent().remove(last)
last = myid

Internally, iterparse builds up a tree, so the last three lines are there to
remove the myid elements from the tree that were already handled. This saves a
lot of memory for large documents.

Stefan
Jul 5 '06 #3

P: n/a

Stefan Behnel wrote:
ra***********@gmail.com wrote:
I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
....
<myID>10000</myID>
</idlist>

Thanks to everybody for the pointers. ElementTree is what I ended up
using and my looks like this (based on the ElementTree tutorial code):

def extractIds(filename):
f = open(filename,'r')
context = ET.iterparse(f, events=('start','end'))
context = iter(context)
even, root = context.next()

for event, elem in context:
if event == 'end' and elem.tag == 'Id':
yield elem.text
root.clear()

As a result I can do:

for id in extractIds(someFileName):
do something

Jul 5 '06 #4

P: n/a
I see you've had success with elementtree, but in case you are still
thinking about SAX, here is an approach that might interest you. The
idea is basically to turn your program inside-out by writing a
standalone function to process one myID node. This function has nothing
to do with SAX or parsing the XML tree. This function becomes a
callback that you pass to your SAX handler to call on each node.

import xml.sax

def myID_callback(data):
"""Process the text of one myID node - boil it, mash it, stick it
in a list..."""
print data

class MyHandler(xml.sax.ContentHandler):
def __init__(self, myID_callback):
#a buffer to collect text data that may or may not be needed
later
self.current_text_data = []
self.myID_callback = myID_callback

def characters(self, data):
"""Accumulate characters. startElement("myID") resets it."""
self.current_text_data.append(data)

def startElement(self, name, attributes):
if name == 'myID':
self.current_text_data = []

def endElement(self, name):
if name == 'myID':
data = "".join(self.current_text_data)
self.myID_callback(data)

filename = 'idlist.xml'
xml.sax.parse(filename, MyHandler(myID_callback))

Jul 5 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.