473,856 Members | 1,710 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

xml.parsers.exp at loading xml into a dict and whitespace

Hey everyone, this may be a stupid question, but I noticed the
following and as I'm pretty new to using xml and python, I was
wondering if I could get an explanation.

Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)

so we have
<parent>
<option1>foo</option1>
<option2>bar</option2>
. . .
</parent>

(I'm using xml.parsers.exp at)
the parser sets a flag that says it's in the parent, and sets the
value of the current tag it's processing in the start tag handler.
The character data handler sets a dictionary value like so:

dictName[curTag] = data

after I'm done processing the file, I print out the dict, and the first value is
<a few bits of whitespace: <a whole bunch of whitespace>

There are comments in the xml file - is this what is causing this?
There are also blank lines. . .but I don't see how a blank line would
be interpreted as a tag. Comments though, I could see that happening.

Actually, I just did a test on an xml file that had no comments or
whitespace and got the same behaviour.

If I feed it the following xml file:

<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

wtf.

For reference, here's the handler functions:

def handleCharacter Data(self, data):
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data

def handleStartElem ent(self, name, attributes):
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
def handleEndElemen t(self, name):
if name == "options":
self.inOptions = False
self.curTag = ""

Sorry if the whitespace in the code got mangled (fingers crossed...)
May 23 '07 #1
6 2652
kaens wrote:
Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)
[snip]
<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"
I don't have a good answer for your expat code, but if you're not
married to that, I strongly suggest you look into ElementTree[1]::
>>xml = '''\
.... <options>
.... <one>hey</one>
.... <two>bee</two>
.... <three>eff</three>
.... </options>
.... '''
>>import xml.etree.cElem entTree as etree
tree = etree.fromstrin g(xml)
d = {}
for child in tree:
.... d[child.tag] = child.text
....
>>d
{'three': 'eff', 'two': 'bee', 'one': 'hey'}
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

STeVe
May 23 '07 #2
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions
I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #3
Now the code looks like this:

import xml.etree.Eleme ntTree as etree

optionsXML = etree.parse("op tions.xml")
options = {}

for child in optionsXML.geti terator():
if child.tag != optionsXML.getr oot().tag:
options[child.tag] = child.text

for key, value in options.items() :
print key, ":", value

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

On 5/23/07, kaens <ap************ ***@gmail.comwr ote:
[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
May 23 '07 #4
kaens wrote:
Now the code looks like this:
[snip ElementTree code]
>
freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.
You're welcome. In return, you've helped me to augment my vocabulary
with an important new word "nerdgasm". ;-)

STeVe
May 23 '07 #5
kaens wrote:
Now the code looks like this:

import xml.etree.Eleme ntTree as etree

optionsXML = etree.parse("op tions.xml")
options = {}

for child in optionsXML.geti terator():
if child.tag != optionsXML.getr oot().tag:
options[child.tag] = child.text

for key, value in options.items() :
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse ("options.xm l")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items() :
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser (remove_blank_t ext=True)
lookup = objectify.Objec tifyElementClas sLookup()
parser.setEleme ntClassLookup(l ookup)

# parse
parent = etree.parse("op tions.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items() :
print key, ":", value
Have fun,

Stefan
May 23 '07 #6
kaens wrote:
Now the code looks like this:

import xml.etree.Eleme ntTree as etree

optionsXML = etree.parse("op tions.xml")
options = {}

for child in optionsXML.geti terator():
if child.tag != optionsXML.getr oot().tag:
options[child.tag] = child.text

for key, value in options.items() :
print key, ":", value
Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse ("options.xm l")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items() :
print key, ":", value
Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser (remove_blank_t ext=True)
lookup = objectify.Objec tifyElementClas sLookup()
parser.setEleme ntClassLookup(l ookup)

# parse
parent = etree.parse("op tions.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items() :
print key, ":", value
Have fun,

Stefan
May 23 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
6067
by: Will Stuyvesant | last post by:
There seems to be no XML parser that can do validation in the Python Standard Libraries. And I am stuck with Python 2.1.1. until my web master upgrades (I use Python for CGI). I know pyXML has validating parsers, but I can not compile things on the (unix) webserver. And even if I could, the compiler I have access to would be different than what was used to compile python for CGI. I need to write a CGI script that does XML validation...
2
3945
by: Thomas Guettler | last post by:
Hi! What are the difference between xml.parsers.expat and xml.sax? Up to now I used xml.sax.make_parser and subclass from ContentHandler. I think xml.sax.make_parser uses expat as default. Why should I want to use xml.parsers.expat?
0
1472
by: dagurp | last post by:
I have this code: import xml.parsers.expat parser = xml.parsers.expat.ParserCreate(encoding="UTF-8") text = unicode("<div>ţórđur</div>",'UTF-8') print parser.Parse(text,1) And this is what I get: UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128)
4
3804
by: Laurens | last post by:
Hi, Is there any good open-source C++ XML parser library that isn't as huge as Xerces? The Xerces DLL is about 2.4 Mb in size, which is far too big for my application. (It doesn't seem to be possible to link Xerces statically.) The parser doesn't need to do DTD/Schema validation, but it should be able to expand entities. I also need a DOM to navigate the document. XPath querying would be nice, but is not essential.
4
2433
by: Jeff Lambert | last post by:
I saw something similar on the sourceforge bugs list but it was from 2001 so I assume it's fixed by now. O/S: WinXP SP2 and WinCE. Expat lib linked in VC++ 6 SP6. I have the following XML (simplified for discussion purposes) The XML starts and ends with the braces. {
2
1173
by: Nikhil | last post by:
Hi, Does anybody knows faster parsers than C - RXP in validating category and Expat in non-validating category? Are both of them (or their faster ones) portable in Mac OS X? Thanks, Nikhil
1
1560
by: bloon | last post by:
I know there are three most popular open-source XML parsers. They are expat, libxml, and Xerces. All three are cross-platform. Does anybody test these three parsers? Which one is the fastest when loading and parsing large XML files? Thanks.
2
3747
by: dwelch91 | last post by:
Hi, c.l.p.'ers- I am having a problem with the import of xml.parsers.expat that has gotten me completely stumped. I have two programs, one a PyQt program and one a command line (text) program that both eventually call the same code that imports xml.parsers.expat. Both give me different results... The code that gets called is (print statements have been added for debugging):
1
4230
by: josh logan | last post by:
Vincent Yau <y...@ohsu.eduwrites: Fast-forward to 2008 I installed Python 3.0b2 on a Windows Vista laptop (after having previously installed Python 2.5), and I am getting this same error: Traceback (most recent call last): File "Programming\Python\monkeys.py", line 24, in <module> test_parse(sys.argv)
0
9906
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9758
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10694
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10774
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7088
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5956
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4571
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4171
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3196
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.