473,385 Members | 1,536 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

xml.parsers.expat loading xml into a dict and whitespace

Hey everyone, this may be a stupid question, but I noticed the
following and as I'm pretty new to using xml and python, I was
wondering if I could get an explanation.

Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)

so we have
<parent>
<option1>foo</option1>
<option2>bar</option2>
. . .
</parent>

(I'm using xml.parsers.expat)
the parser sets a flag that says it's in the parent, and sets the
value of the current tag it's processing in the start tag handler.
The character data handler sets a dictionary value like so:

dictName[curTag] = data

after I'm done processing the file, I print out the dict, and the first value is
<a few bits of whitespace: <a whole bunch of whitespace>

There are comments in the xml file - is this what is causing this?
There are also blank lines. . .but I don't see how a blank line would
be interpreted as a tag. Comments though, I could see that happening.

Actually, I just did a test on an xml file that had no comments or
whitespace and got the same behaviour.

If I feed it the following xml file:

<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

wtf.

For reference, here's the handler functions:

def handleCharacterData(self, data):
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data

def handleStartElement(self, name, attributes):
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
def handleEndElement(self, name):
if name == "options":
self.inOptions = False
self.curTag = ""

Sorry if the whitespace in the code got mangled (fingers crossed...)
May 23 '07 #1
6 2617
kaens wrote:
Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)
[snip]
<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"
I don't have a good answer for your expat code, but if you're not
married to that, I strongly suggest you look into ElementTree[1]::
>>xml = '''\
.... <options>
.... <one>hey</one>
.... <two>bee</two>
.... <three>eff</three>
.... </options>
.... '''
>>import xml.etree.cElementTree as etree
tree = etree.fromstring(xml)
d = {}
for child in tree:
.... d[child.tag] = child.text
....
>>d
{'three': 'eff', 'two': 'bee', 'one': 'hey'}
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

STeVe
May 23 '07 #2
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions
I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #3
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

On 5/23/07, kaens <ap***************@gmail.comwrote:
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #4
kaens wrote:
Now the code looks like this:
[snip ElementTree code]
>
freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.
You're welcome. In return, you've helped me to augment my vocabulary
with an important new word "nerdgasm". ;-)

STeVe
May 23 '07 #5
kaens wrote:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value
Have fun,

Stefan
May 23 '07 #6
kaens wrote:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value
Have fun,

Stefan
May 23 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Will Stuyvesant | last post by:
There seems to be no XML parser that can do validation in the Python Standard Libraries. And I am stuck with Python 2.1.1. until my web master upgrades (I use Python for CGI). I know pyXML has...
2
by: Thomas Guettler | last post by:
Hi! What are the difference between xml.parsers.expat and xml.sax? Up to now I used xml.sax.make_parser and subclass from ContentHandler. I think xml.sax.make_parser uses expat as default....
0
by: dagurp | last post by:
I have this code: import xml.parsers.expat parser = xml.parsers.expat.ParserCreate(encoding="UTF-8") text = unicode("<div>þórður</div>",'UTF-8') print parser.Parse(text,1) And this is what I...
4
by: Laurens | last post by:
Hi, Is there any good open-source C++ XML parser library that isn't as huge as Xerces? The Xerces DLL is about 2.4 Mb in size, which is far too big for my application. (It doesn't seem to be...
4
by: Jeff Lambert | last post by:
I saw something similar on the sourceforge bugs list but it was from 2001 so I assume it's fixed by now. O/S: WinXP SP2 and WinCE. Expat lib linked in VC++ 6 SP6. I have the following XML...
2
by: Nikhil | last post by:
Hi, Does anybody knows faster parsers than C - RXP in validating category and Expat in non-validating category? Are both of them (or their faster ones) portable in Mac OS X? Thanks, ...
1
by: bloon | last post by:
I know there are three most popular open-source XML parsers. They are expat, libxml, and Xerces. All three are cross-platform. Does anybody test these three parsers? Which one is the fastest when...
2
by: dwelch91 | last post by:
Hi, c.l.p.'ers- I am having a problem with the import of xml.parsers.expat that has gotten me completely stumped. I have two programs, one a PyQt program and one a command line (text) program...
1
by: josh logan | last post by:
Vincent Yau <y...@ohsu.eduwrites: Fast-forward to 2008 I installed Python 3.0b2 on a Windows Vista laptop (after having previously installed Python 2.5), and I am getting this same error: ...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.