473,320 Members | 2,041 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

10GB XML Blows out Memory, Suggestions?

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

Jun 6 '06 #1
40 3303
ax****@gmail.com:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is?


PullDOM.
http://www-128.ibm.com/developerwork...tipulldom.html
http://www.prescod.net/python/pulldom.html
http://docs.python.org/lib/module-xml.dom.pulldom.html (not much)

--
René Pijlman
Jun 6 '06 #2
ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?
More memory;)
Maybe you should have a look at pulldom, a combination of sax and dom: it
reads your document in a sax-like manner and expands only selected
sub-trees.
Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


Assuming a good design of course not. Esp. if you only need some selected
parts of the document SAX should be your choice.

Mathias
Jun 6 '06 #3
ax****@gmail.com schrieb:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


Yes.

You could used elementtree iterparse - that should be the easiest solution.

http://effbot.org/zone/element-iterparse.htm

Diez
Jun 6 '06 #4
ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.


With a 10gb file, you're best bet might be to juse use Expat and C!!

Regards
Sreeram

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhXVqrgn0plK5qqURArsdAKCyjsORjKDZlZNhwR82C9 bMDKtGtgCfVeCz
mgU+25qIR6eiyLVc/QOPn+U=
=Zv2q
-----END PGP SIGNATURE-----

Jun 6 '06 #5

ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


What you clearly need is a better suited file format, but I suspect
you're not in a position to change it, are you?

Cheers,
Nicola Musatti

Jun 6 '06 #6
K.S.Sreeram schrieb:
ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.


With a 10gb file, you're best bet might be to juse use Expat and C!!


No what exactly makes C grok a 10Gb file where python will fail to do so?

What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

Diez
Jun 6 '06 #7
<ax****@gmail.com> wrote in message
news:11********************@u72g2000cwu.googlegrou ps.com...
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer

Directly translated to XML gives:

<ThreeDimPoint>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoint xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul
Jun 6 '06 #8

Paul> You clearly need something instead of XML.

Amen, brother...

+1 QOTW.

Skip
Jun 6 '06 #9

ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


If your XML files grow so large you might rethink the representation
model. Maybe you give eXist a try?

http://exist.sourceforge.net/

Regards,
Kay

Jun 6 '06 #10
Em Ter, 2006-06-06 Ã*s 13:56 +0000, Paul McGuire escreveu:
(just can't open it up like a text file)


Who'll open a 10 GiB file anyway?

--
Felipe.

Jun 6 '06 #11
Diez B. Roggisch wrote:
What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

Diez B. Roggisch wrote: No what exactly makes C grok a 10Gb file where python will fail to do so?


In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhcgjrgn0plK5qqURAvW9AKCENpXKQY7xB6pQ8RCDkQ ssEoV+fwCgn2xM
Yq0TJ/RkipdJmOkVRXUu1Fw=
=iq9H
-----END PGP SIGNATURE-----

Jun 6 '06 #12
K.S.Sreeram wrote:
There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.


both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm

</F>

Jun 6 '06 #13
Fredrik Lundh wrote:
both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm

Thats cool! Thanks for the info!

For a multi-gigabyte file, I would still recommend C/C++, because the
processing code which sits on top of the XML library needs to be Python,
and that could turn out to be a significant overhead in such extreme cases.

Of course, the exact strategy to follow would depend on the specifics of
the case, and all this speculation may not really apply! :)

Regards
Sreeram
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhdISrgn0plK5qqURAo4cAJ9UaTpoIFkQx7JZg07XW3 EMdfp8NACfYMjR
TbNHV7CDROnUQTLSqtm3je8=
=evWp
-----END PGP SIGNATURE-----

Jun 6 '06 #14
10 gigs? Wow, even using SAX I would imagine that you would be pushing
the limits of reasonable performance. Any way you can depart from the
XML requirement? That's not really what XML was intended for in terms
of passing along information IMHO...

ax****@gmail.com wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


Jun 6 '06 #15
The file is an XML dump from Goldmine. I have built a document parser
that allows for the population of data from Goldmine into SugarCRM. The
clients data se is 10gb.

Felipe Almeida Lessa wrote:
Em Ter, 2006-06-06 às 13:56 +0000, Paul McGuire escreveu:
(just can't open it up like a text file)


Who'll open a 10 GiB file anyway?

--
Felipe.


Jun 6 '06 #16
gregarican wrote:
10 gigs? Wow, even using SAX I would imagine that you would be pushing
the limits of reasonable performance.
depends on how you define "reasonable", of course. modern computers are
quite fast:
dir data.xml
2006-06-06 21:35 1 002 000 015 data.xml
1 File(s) 1 002 000 015 bytes
more test.py

from xml.etree import cElementTree as ET
import time

t0 = time.time()

for event, elem in ET.iterparse("data.xml"):
if elem.tag == "item":
elem.clear()

print time.time() - t0

gives me timings between 27.1 and 49.1 seconds over 5 runs.

(Intel Dual Core T2300, slow laptop disks, 1000000 XML "item" elements
averaging 1000 byte each, bundled cElementTree, peak memory usage 33 MB.
your milage may vary.)

</F>

Jun 6 '06 #17
Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?
Paul McGuire wrote:
<ax****@gmail.com> wrote in message
news:11********************@u72g2000cwu.googlegrou ps.com...
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer

Directly translated to XML gives:

<ThreeDimPoint>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoint xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul


Jun 6 '06 #18
That a good sized Goldmine database. In past lives I have supported
that app and recall that you could match the Goldmine front end against
an SQL backend. If you can get to the underlying data utilizing SQL you
can selectively port over sections of the database and might be able to
attack things more methodically than parsing through a mongo XML file.
Instead you could bulk insert portions of the Goldmine data into
SugarCRM. Know what I mean?

ax****@gmail.com wrote:
The file is an XML dump from Goldmine. I have built a document parser
that allows for the population of data from Goldmine into SugarCRM. The
clients data se is 10gb.

Felipe Almeida Lessa wrote:
Em Ter, 2006-06-06 às 13:56 +0000, Paul McGuire escreveu:
(just can't open it up like a text file)


Who'll open a 10 GiB file anyway?

--
Felipe.


Jun 6 '06 #19
"K.S.Sreeram" <sr*****@tachyontech.net> writes:
[...]
There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python.
Yes.
So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.
No, that's not true. I guess you didn't read the other posts:

http://effbot.org/zone/element-iterparse.htm

Diez B. Roggisch wrote:
No what exactly makes C grok a 10Gb file where python will fail to do so?


In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most

[...]

I don't know where you got that from. And in this particular case, of
course, cElementTree *is* written in C, there's presumably plenty of
"significant python code" around since, one assumes, *all* of the OP's
code is written in Python (does that count as "any kind" of Python
code?), and yet rewriting something in C here may not make much
difference.
John
Jun 6 '06 #20

K.S.Sreeram wrote:
Diez B. Roggisch wrote:
What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

Diez B. Roggisch wrote:
No what exactly makes C grok a 10Gb file where python will fail to do so?


In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.

you got no idea what you are talking about, anyone knows that something
like this is IO bound.
CPU is the least of his worries. And for IO bound applications Python
is just as fast as any other language.

Jun 7 '06 #21

ax****@gmail.com wrote:
Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?


compressing the footprint on disk won't matter, you still have 10GB of
data that you need to process and it can only be processed
uncompressed.

I would just export the data in smaller batches, there should not be
any reason you can't export subsets and process them that way.

Jun 7 '06 #22
fuzzylollipop wrote:
you got no idea what you are talking about, anyone knows that something
like this is IO bound.


which of course explains why some XML parsers for Python are a 100 times
faster than other XML parsers for Python...

</F>

Jun 7 '06 #23
fuzzylollipop wrote:
Is it possible to read an XML document in compressed format?


compressing the footprint on disk won't matter, you still have 10GB of
data that you need to process and it can only be processed uncompressed.


didn't you just claim that this was an I/O bound problem ?

</F>

Jun 7 '06 #24
ax****@gmail.com wrote:
Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?


sure. you can e.g. use gzip.open to create a file object that
decompresses on the way in.

file = gzip.open("data.xml.gz")

for event, elem in ET.iterparse(file):
if elem.tag == "item":
elem.clear()

I tried compressing my 1 GB example, but all 1000-byte records in that
file are identical, so I got a 500x compression, which is a bit higher
than you can reasonably expect ;-) however, with that example, I get a
stable parsing time of 26 seconds, so it looks as if gzip can produce
data about as fast as a preloaded disk cache...

</F>

Jun 7 '06 #25
Am I missing something? I don't read where the poster mentioned the
operation as being CPU intensive. He does mention that the entirety of
a 10 GB file cannot be loaded into memory. If you discount physical
swapfile paging and base this assumption on a "normal" PC that might
have maybe 1 or 2 GB of RAM is his assumption that out of line?

And I don't doubt that Python is efficient as possible for I/O
operations. But since it is an interpreted scripting language how could
it be "just as fast as any language" as you claim? C would have to be
faster. Machine language would have to be faster. And even other
interpreted languages *could* be faster, given certain conditions. A
generalization like the claim kind of invalidates the remainder of your
assertion.

fuzzylollipop wrote:
K.S.Sreeram wrote:
Diez B. Roggisch wrote:
What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

Diez B. Roggisch wrote:
No what exactly makes C grok a 10Gb file where python will fail to do so?


In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.

you got no idea what you are talking about, anyone knows that something
like this is IO bound.
CPU is the least of his worries. And for IO bound applications Python
is just as fast as any other language.


Jun 7 '06 #26

Fredrik Lundh wrote:
fuzzylollipop wrote:
you got no idea what you are talking about, anyone knows that something
like this is IO bound.


which of course explains why some XML parsers for Python are a 100 times
faster than other XML parsers for Python...


dependes on the CODE and the SIZE of the file, in this case

processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!

And in the case of XML unless the PARSER is extremely inefficient, and
I assume, that would be an edge case, the parser is NOT the bottle neck
in this case.

The relativel performance of Python XML parsers is irrelvant in
relationship to this being an IO bound process, even the slowest parser
could only process the data as fast as it can be read off the disk.

Anyone saying that using C instead of Python will be faster when 99% of
the time in this case is just waiting on the disk to feed a buffer, has
no idea what they are talking about.

I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.

Jun 7 '06 #27
Thanks guys for all your posts...

So I am a bit confused....Fuzzy, the code I saw looks like it
decompresses as a stream (i.e. per byte). Is this the case or are you
just compressing for file storage but the actual data set has to be
exploded in memory?

fuzzylollipop wrote:
Fredrik Lundh wrote:
fuzzylollipop wrote:
you got no idea what you are talking about, anyone knows that something
like this is IO bound.


which of course explains why some XML parsers for Python are a 100 times
faster than other XML parsers for Python...


dependes on the CODE and the SIZE of the file, in this case

processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!

And in the case of XML unless the PARSER is extremely inefficient, and
I assume, that would be an edge case, the parser is NOT the bottle neck
in this case.

The relativel performance of Python XML parsers is irrelvant in
relationship to this being an IO bound process, even the slowest parser
could only process the data as fast as it can be read off the disk.

Anyone saying that using C instead of Python will be faster when 99% of
the time in this case is just waiting on the disk to feed a buffer, has
no idea what they are talking about.

I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.


Jun 7 '06 #28
fuzzylollipop wrote:

Fredrik Lundh wrote:
fuzzylollipop wrote:
> you got no idea what you are talking about, anyone knows that something
> like this is IO bound.
which of course explains why some XML parsers for Python are a 100 times
faster than other XML parsers for Python...


dependes on the CODE and the SIZE of the file, in this case

processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!


Why so? IO-bounds will be hit when the processing of the fetched data is
faster than the fetching itself. So if I decide to read 10GB a 4Kb block
per second, I'm possibly a very patient fella, but no IO-bounds are hit. So
no PERIOD here - without talking about _what_ actually happens.
Anyone saying that using C instead of Python will be faster when 99% of
the time in this case is just waiting on the disk to feed a buffer, has
no idea what they are talking about.
Which is true - but the chances for C performing whatever I want to in the
1% of time are a few times better than to do so in Python.

Mind you: I don't argue that the statements of Mr. Sreeram are true, either.
This discussion can only be hold with respect to the actual use case (which
is certainly more that just parsing XML, but also processing it)
I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.


Care to share what kind of processing you perfrom on these files?

Regards,

Diez

Jun 7 '06 #29
Point for Fredrik. If someone doesn't recognize the inherent
performance differences between different XML parsers they haven't
experienced the pain (and eventual victory) of trying to optimize their
techniques for working with the albatross that XML can be :-)

Fredrik Lundh wrote:
fuzzylollipop wrote:
dependes on the CODE and the SIZE of the file, in this case
processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!


so the fact that

for token, node in pulldom.parse(file):
pass

is 50-200% slower than

for event, elem in ET.iterparse(file):
if elem.tag == "item":
elem.clear()

when reading a gigabyte-sized XML file, is due to an unexpected slowdown
in the I/O subsystem after importing xml.dom?
I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.


so how large are the things that you're actually *processing* in your
Python code? megabyte blobs or 100-1000 byte records? or even smaller
things?

</F>


Jun 7 '06 #30
fuzzylollipop wrote:
dependes on the CODE and the SIZE of the file, in this case
processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!
so the fact that

for token, node in pulldom.parse(file):
pass

is 50-200% slower than

for event, elem in ET.iterparse(file):
if elem.tag == "item":
elem.clear()

when reading a gigabyte-sized XML file, is due to an unexpected slowdown
in the I/O subsystem after importing xml.dom?
I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.


so how large are the things that you're actually *processing* in your
Python code? megabyte blobs or 100-1000 byte records? or even smaller
things?

</F>

Jun 7 '06 #31
gregarican wrote:
Am I missing something? I don't read where the poster mentioned the
operation as being CPU intensive. He does mention that the entirety of
a 10 GB file cannot be loaded into memory. If you discount physical
swapfile paging and base this assumption on a "normal" PC that might
have maybe 1 or 2 GB of RAM is his assumption that out of line?


Indeed. The complaint is fairly obvious from the title of the thread.
Now, if the complaint was specifically about the size of the minidom
representation in memory, perhaps a more efficient representation could
be chosen by using another library. Even so, the size of the file being
processed is still likely to be pretty big, considering various
observations and making vague estimates:

http://effbot.org/zone/celementtree.htm

For many people, an XML file of, say, 600MB would still be quite a load
on their "home/small business edition" computer if you had to load the
whole file in and then work on it, even just as a text file. Of course,
approaches where you can avoid keeping a representation of the whole
thing around would be beneficial, and as mentioned previously in a
thread on large XML files, there's always the argument that some kind
of database system should be employed to make querying more efficient
if you can't perform some kind of sequential processing.

Paul

Jun 7 '06 #32
Paul McGuire schrieb:
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language): struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer Directly translated to XML gives: <ThreeDimPoint>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>
This is essentially true, but should not cause the OP's problem.
After parsing, the overhead of XML is gone, and long tag names
are nothing but pointers to a string which happens to be long
(unless *all* tags in the XML are differently named, which would
cause a huge DTD/XSD as well).
This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.
In the DOM, it contracts to 3 integers and a few pointers -
essentially the same as needed in a reasonably written
data structure.
Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
In this case, his DOM (or whatever equivalent data structure, i.e.
that what he *must* process) would be 300 MB + pointers.
I'd even go as far and say that the best thing that can happen to
him is a huge overhead - this would mean he has a little data
in a rather spongy file (which collapses on parsing).
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).


A table helps only if the data is tabular (i.e. a single relation),
i.e. probably never (otherwise the sending side would have shipped
something like CSV).

Ralf
Jun 7 '06 #33
Ralf Muschall wrote:
In the DOM, it contracts to 3 integers and a few pointers -
essentially the same as needed in a reasonably written
data structure.


what kind of magic DWIM DOM is this?

</F>

Jun 7 '06 #34
>>medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

A table helps only if the data is tabular (i.e. a single relation),
i.e. probably never (otherwise the sending side would have shipped
something like CSV).


Perhaps the previous poster meant "database file", which for some
systems describes the "container" of the whole database. If the XML has
redundancies represented in "linked" data, data normalization can cut
down on the needed space.

my 0.02 EUR

thomas
Jun 8 '06 #35

ax****@gmail.com wrote:
Thanks guys for all your posts...

So I am a bit confused....Fuzzy, the code I saw looks like it
decompresses as a stream (i.e. per byte). Is this the case or are you
just compressing for file storage but the actual data set has to be
exploded in memory?


it wasn't my code.

if you zip the 10GB and read from the zip into a DOM style tree, you
haven't gained anything, except adding additional CPU requirements to
do the decompression. You still have to load the entire thing into
memory.

There are differences in XML Parsers, IN EVERY LANGUAGE a poorly
written parser is a poorly written parser. Using the wrong IDIOM is
more of a problem than anything else. DOM parsers are good when you
need to read and process every element and attribute and the data is
"small". Granted, "small" is relative, but no body will consider 10GB
"small".

SAX style or a pull-parser has to be used when the data is "large" or
when you don't really need to process every element and attribute.

This problem looks like it is just a data export / import problem. In
that case you will either have to use a sax style parser and parse the
10GB file. Or as I suggested in another reply, export the data in
smaller chunks and process them separately, which in almost EVERY case
is a better solution to do batch processing.

You should always break processing up into as many discreet steps as
possible. Make for easier debugging and you can start over in the
middle much easier.

Even if you just write a simple SAX style parser to just break the file
up into smaller pieces to actually process it you will be ahead of the
game.

We have systems that process streaming data coming from sockets in XML
format, that run in Java with very little memory footprint and very
little CPU usage. At 50 megabit a sec, that is about 4TB a day. C
wouldn't read from a socket any faster than the NBIO, actually it would
be harder to get the same performance in C because we would have to
duplicate all the SEDA style NBIO.

Jun 8 '06 #36
fuzzylollipop wrote:
SAX style or a pull-parser has to be used when the data is "large" or
when you don't really need to process every element and attribute.

This problem looks like it is just a data export / import problem. In
that case you will either have to use a sax style parser and parse the
10GB file. Or as I suggested in another reply, export the data in
smaller chunks


or use a parser that can do the chunking for you, on the way in...

in Python, incremental parsers like cET's iterparse and the one in Amara
gives you *better* performance than SAX (including "raw" pyexpat) in
many cases, and offers a much simpler programming model.

</F>

Jun 8 '06 #37

Fredrik Lundh wrote:
fuzzylollipop wrote:
SAX style or a pull-parser has to be used when the data is "large" or
when you don't really need to process every element and attribute.

This problem looks like it is just a data export / import problem. In
that case you will either have to use a sax style parser and parse the
10GB file. Or as I suggested in another reply, export the data in
smaller chunks


or use a parser that can do the chunking for you, on the way in...

in Python, incremental parsers like cET's iterparse and the one in Amara
gives you *better* performance than SAX (including "raw" pyexpat) in
many cases, and offers a much simpler programming model.

</F>


thats good to know, I haven't worked with cET yet. Haven't had time to
get it installed :-(

Jun 8 '06 #38
K.S.Sreeram wrote:
Fredrik Lundh wrote:
both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm

Thats cool! Thanks for the info!

For a multi-gigabyte file, I would still recommend C/C++, because the
processing code which sits on top of the XML library needs to be Python,
and that could turn out to be a significant overhead in such extreme cases.

Of course, the exact strategy to follow would depend on the specifics of
the case, and all this speculation may not really apply! :)


Honestly, i think that legitimate use-cases for multi-gigabyte XML are
very rare. Many people abuse XML as some sort of DBMS replacement.
This abuse is part of the reason why so many developers are hostile to
XML. XML is best for documents, and documents can get to the
multi-gigabyte range, but rarely do. Usually, when they do, there is a
logical way to decompose them, process them, and re-compose them,
whereas with XML used as a DBMS replacement, relations and datatyping
complicate such natural divide-and-conquer techniques.

I always say that if you're dealing with gigabyte XML, it's well worth
considering whether you're not using a hammer to screw in a bolt.

If monster XML is inevitable, then I extend's Fredrik earlier mention
of Amara to say that Pushdom allows you to pre-declare the chunks of
XML you're interested in, and then it processes the XML in streaming
mode, only instantiating the chunks of interest one at a time. This
allows for handling of huge files with a very simple programming idiom.

http://uche.ogbuji.net/tech/4suite/amara/

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

Jun 11 '06 #39
> > I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


If the data is composed by a large number of records,
like a database dump of some sort,
then probably you could have a look to a stax processor
for python like pulldom.

In this way you could process each single record one at the time,
without loading the entiere document.

Regards,
Antonio

Jun 22 '06 #40
Hello guys, though I wasn't involved on the initial discussion of this. I want to shed some light on how I was able to port Goldmine data into SugarCRM.

First things first, you will never be able to work with that XML file that Goldmine exports, it's just too damned big.

I couldn't even get a "progressive parser" written in Java to read it. Every language I have used to try to parse the XML Goldmine dump ended in an IOException.OutOfMemory

So, I requested to our client to send me the ACTUAL dBASE files that Goldmine uses. I then wrote a command-line utility to take a dBASE file and re-create its table structure in MySQL and inport the data.

Based on an array, in which each dbf file is defined, I then call the command-line utility to parse each dbf file and re-create it's structure and data in MySQL.

So, using recursive looping through the array of dbf's, I could extract all the data and store it in a MySQL database with the dBASE filename as the table.

Next was the hard part, pulling data from the new MySQL GoldMine database into Sugar.

However, since I now had my goldmine data in MySQL, I could use SQL so sift through the data and put it where it needed to go for SugarCRM.

Here is a little mockup of how my application works.

Step1: upload all DBF's to a folder that the application will read.
Step2: Python script will look in the folder for all necessary files ending in dbf.
Step3: foreach dbf as Table and run the command-line utility on each dbf.
Step4: verify data in MySQL Goldmine Database
Step5: Generate SugarCRM GUID for each RECID from Goldmine
Step6: Parse Goldmine SQL Tables, 1 by 1, and using the new SugarCRM Keys generated for each Record ID, I then can parse the Goldmine Data and insert into the SugarCRM Database.

This solution was developed for my company and our Goldmine to SugarCRM Conversion utility will convert EVERYTHING from Goldmine into a SugarCRM substitute.

We started with a 4GB XML File, I told client to send us the DBF's instead.
Our conversion utility takes 10 minutes to run from all the DBF's from Goldmine and provides data for SugarCRM with nothing missing (even "crap" data is there, we can't fix data, only migrate it).

So that's the approach I took. The command-line utility is written for windows and therefore must be run on a windows machine however, I'm working on a cross-platform command-line utility to go along with it so it can be run on any web host.
Jun 30 '06 #41

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: WL | last post by:
Hey, all. I'm creating an array of strings (char **argv style) on the fly, and using realloc to create string pointers, and malloc for the strings itself (if that makes any sense). I'm using the...
0
by: Peter Bromberg [C# MVP] | last post by:
Recently another developer I work with and I were discussing ways to log more information than just the stock event log entry when an app blows up because of an unhandled exception. There were a...
6
by: Martin | last post by:
Hi all. I am developing a filemanager that needs to handle big files. While testing on some zipped files of 6-7GB each I noticed that filesize(), filemtime() and similar php-functions can't handle...
15
by: syang8 | last post by:
hi, folks, I use Kdevelop to build some scientific simulation on Linux. If I set the size of an array N = 8000, the program works fine. However, if I set the array N some number greater than...
2
by: alagariya | last post by:
SQL log file size on 22/06/2007 shown as 10GB...now its showing as 20GB....but when taking backup...the backup file size is 4MB only......when restoring, it showing msg that there is no space, due to...
3
by: Salad | last post by:
Using A97, SP2, most current jet35. I have a search form. The op enters an id to search/find. If found, a data entry form is presented for that id. This form has 7 or 8 combos, a bunch of...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.