By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,928 Members | 1,173 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,928 IT Pros & Developers. It's quick & easy.

xml.parsers.expat loading xml into a dict and whitespace

P: n/a
Hey everyone, this may be a stupid question, but I noticed the
following and as I'm pretty new to using xml and python, I was
wondering if I could get an explanation.

Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)

so we have
<parent>
<option1>foo</option1>
<option2>bar</option2>
. . .
</parent>

(I'm using xml.parsers.expat)
the parser sets a flag that says it's in the parent, and sets the
value of the current tag it's processing in the start tag handler.
The character data handler sets a dictionary value like so:

dictName[curTag] = data

after I'm done processing the file, I print out the dict, and the first value is
<a few bits of whitespace: <a whole bunch of whitespace>

There are comments in the xml file - is this what is causing this?
There are also blank lines. . .but I don't see how a blank line would
be interpreted as a tag. Comments though, I could see that happening.

Actually, I just did a test on an xml file that had no comments or
whitespace and got the same behaviour.

If I feed it the following xml file:

<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

wtf.

For reference, here's the handler functions:

def handleCharacterData(self, data):
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data

def handleStartElement(self, name, attributes):
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
def handleEndElement(self, name):
if name == "options":
self.inOptions = False
self.curTag = ""

Sorry if the whitespace in the code got mangled (fingers crossed...)
May 23 '07 #1
Share this Question
Share on Google+
6 Replies


P: n/a
kaens wrote:
Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)
[snip]
<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"
I don't have a good answer for your expat code, but if you're not
married to that, I strongly suggest you look into ElementTree[1]::
>>xml = '''\
.... <options>
.... <one>hey</one>
.... <two>bee</two>
.... <three>eff</three>
.... </options>
.... '''
>>import xml.etree.cElementTree as etree
tree = etree.fromstring(xml)
d = {}
for child in tree:
.... d[child.tag] = child.text
....
>>d
{'three': 'eff', 'two': 'bee', 'one': 'hey'}
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

STeVe
May 23 '07 #2

P: n/a
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions
I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #3

P: n/a
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

On 5/23/07, kaens <ap***************@gmail.comwrote:
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #4

P: n/a
kaens wrote:
Now the code looks like this:
[snip ElementTree code]
>
freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.
You're welcome. In return, you've helped me to augment my vocabulary
with an important new word "nerdgasm". ;-)

STeVe
May 23 '07 #5

P: n/a
kaens wrote:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value
Have fun,

Stefan
May 23 '07 #6

P: n/a
kaens wrote:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value
Have fun,

Stefan
May 23 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.