By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,479 Members | 1,179 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,479 IT Pros & Developers. It's quick & easy.

xpathEval fails for large files

P: n/a
Hi,

I tried to extract some data with xpathEval. Path contain more than
100,000 elements.

doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()

this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')

Any suggestions to resolve this.

Is there any better alternative to handle large documents?

Kanch
Jul 22 '08 #1
Share this Question
Share on Google+
7 Replies

P: n/a
Kanchana wrote:
I tried to extract some data with xpathEval. Path contain more than
100,000 elements.

doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()

this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')

Any suggestions to resolve this.
what happens if you just search for "//src_ref"? what happens if you
use libxml's command line tools to do the same search?
Is there any better alternative to handle large documents?
the raw libxml2 API is pretty hopeless; there's a much nicer binding
called lxml:

http://codespeak.net/lxml/

but that won't help if the problem is with libxml2 itself, though (in
case you probably should check with an appropriate libxml2 forum).

there's also cElementTree (bundled with Python 2.5), but that has only
limited xpath support in the current version.

both lxml and other implementations of the ET API supports incremental
tree parsing:

http://effbot.org/zone/element-iterparse.htm

which handles huge documents quite nicely, but requires you to write the
search logic in Python:

for event, elem in ET.iterparse("test.xml"):
if elem.tag == "src_ref" and elem.get("editions"):
... process element ...
elem.clear()

</F>

Jul 22 '08 #2

P: n/a
On 22 Jul, 11:00, Kanchana <kanchana.senevirat...@gmail.comwrote:
>
I tried to extract some data with xpathEval. Path contain more than
100,000 elements.

doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()
Another note on libraries: if you want a pure Python library which
works on top of libxml2 and the bundled Python bindings, consider
libxml2dom [1].
this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')

Any suggestions to resolve this.
How big is your document and how much memory is the process using
after you have parsed the document? Sometimes, you won't be able to
effectively handle very large documents by having them loaded
completely in memory because you'll require more main memory than your
system has available, making operations on the document somewhat
inefficient.
Is there any better alternative to handle large documents?
Fredrik pointed out a few. There's also xml.dom.pulldom and xml.sax in
the standard library - the latter attractive mostly if you have
previous experience with it - providing stream-based processing of
documents if you don't mind writing more code.

Paul

[1] http://www.python.org/pypi/libxml2dom
Jul 22 '08 #3

P: n/a
Kanchana wrote:
Hi,

I tried to extract some data with xpathEval. Path contain more than
100,000 elements.

doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()

this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')

Any suggestions to resolve this.

Is there any better alternative to handle large documents?
One option might be an XML database. I'm familiar with Sedna (
http://modis.ispras.ru/sedna/ ).

In practice, you store the document in the database, and let the
database do the extracting for you. Sedna does XQuery, which is a very
nice way to get just what you want out of your document or collection of
documents.

Good:
It's free (Apache 2.0 license)
It's cross-platform (later Windows x86, Linux x86, FreeBSD, MacOS X)
It has python bindings (zif.sedna at the cheese shop and others).
It's pretty fast, particularly if you set-up indexes.
Document and document collection size are limited only by disk space.

Not so good:
Sedna runs as a server. Expect to use in the range of 100M of RAM
per database. A database can contain many many documents, so you
probably only want one database, anyway.

Disclosure: I'm the author of the zif.sedna package, and I'm
interpreting the fact that I have not received much feedback as "It
works pretty well" :)
- Jim Washington
Jul 22 '08 #4

P: n/a
Fredrik Lundh wrote:
Kanchana wrote:
>I tried to extract some data with xpathEval. Path contain more than
100,000 elements.

doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()

this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')

Any suggestions to resolve this.

what happens if you just search for "//src_ref"? what happens if you
use libxml's command line tools to do the same search?
>Is there any better alternative to handle large documents?

the raw libxml2 API is pretty hopeless; there's a much nicer binding
called lxml:

http://codespeak.net/lxml/

but that won't help if the problem is with libxml2 itself, though
It may still help a bit as lxml's setup of libxml2 is pretty memory friendly
and hand-tuned in a lot of places. But it's definitely worth trying with both
cElementTree and lxml to see what works better for you. Depending on your
data, this may be fastest in lxml 2.1:

doc = lxml.etree.parse("test.xml")
for el in doc.iter("src_ref"):
attrval = el.get("editions")
if attrval is not None:
# do something

Stefan
Jul 22 '08 #5

P: n/a
On Jul 23, 2:03 am, Stefan Behnel <stefan...@behnel.dewrote:
Fredrik Lundh wrote:
Kanchana wrote:
I tried to extract some data with xpathEval. Path contain more than
100,000 elements.
doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval('//src_ref/@editions')
doc.freeDoc()
ctxt.xpathFreeContext()
this will stuck in following line and will result in high usage of
CPU.
result = ctxt.xpathEval('//src_ref/@editions')
Any suggestions to resolve this.
what happens if you just search for "//src_ref"? what happens if you
use libxml's command line tools to do the same search?
Is there any better alternative to handle large documents?
the raw libxml2 API is pretty hopeless; there's a much nicer binding
called lxml:
http://codespeak.net/lxml/
but that won't help if the problem is with libxml2 itself, though

It may still help a bit as lxml's setup of libxml2 is pretty memory friendly
and hand-tuned in a lot of places. But it's definitely worth trying with both
cElementTree and lxml to see what works better for you. Depending on your
data, this may be fastest in lxml 2.1:

doc = lxml.etree.parse("test.xml")
for el in doc.iter("src_ref"):
attrval = el.get("editions")
if attrval is not None:
# do something

Stefan
Original file was 18MB, and contained 288328 element attributes for
the particular path.
I wonder whether for loop will cause a problem in iterating for 288328
times.
Jul 23 '08 #6

P: n/a
Kanch wrote:
Original file was 18MB, and contained 288328 element attributes for
the particular path.
You didn't say how many elements there are in total, but I wouldn't expect
that to be a problem, unless you have very little free memory (say, way below
256MB). I just tried with lxml 2.1 and a 40MB XML file with 300 000 elements
and it lets the whole Python interpreter take up some 140MB of memory in
total. Looping over all elements by calling "list(root.iter())" takes a bit
more than one second on my laptop. That suggests that /any/ solution involving
lxml (or cElementTree) will do just fine for you.

I wonder whether for loop will cause a problem in iterating for 288328
times.
You are heavily underestimating the power of the Python here.

Stefan
Jul 23 '08 #7

P: n/a
On Jul 23, 11:05 am, Stefan Behnel <stefan...@behnel.dewrote:
Kanch wrote:
Original file was 18MB, and contained 288328 element attributes for
the particular path.

You didn't say how many elements there are in total, but I wouldn't expect
that to be a problem, unless you have very little free memory (say, way below
256MB). I just tried with lxml 2.1 and a 40MB XML file with 300 000 elements
and it lets the whole Python interpreter take up some 140MB of memory in
total. Looping over all elements by calling "list(root.iter())" takes a bit
more than one second on my laptop. That suggests that /any/ solution involving
lxml (or cElementTree) will do just fine for you.
I wonder whether for loop will cause a problem in iterating for 288328
times.

You are heavily underestimating the power of the Python here.

Stefan
Hi,

thanks for the help. lxml will suit my work. I have not being working
with python for that long. :)

Kanch
Jul 23 '08 #8

This discussion thread is closed

Replies have been disabled for this discussion.