473,545 Members | 2,388 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

10GB XML Blows out Memory, Suggestions?

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

Jun 6 '06 #1
40 3326
ax****@gmail.co m:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is?


PullDOM.
http://www-128.ibm.com/developerwork...tipulldom.html
http://www.prescod.net/python/pulldom.html
http://docs.python.org/lib/module-xml.dom.pulldom.html (not much)

--
René Pijlman
Jun 6 '06 #2
ax****@gmail.co m wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?
More memory;)
Maybe you should have a look at pulldom, a combination of sax and dom: it
reads your document in a sax-like manner and expands only selected
sub-trees.
Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


Assuming a good design of course not. Esp. if you only need some selected
parts of the document SAX should be your choice.

Mathias
Jun 6 '06 #3
ax****@gmail.co m schrieb:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


Yes.

You could used elementtree iterparse - that should be the easiest solution.

http://effbot.org/zone/element-iterparse.htm

Diez
Jun 6 '06 #4
ax****@gmail.co m wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.


With a 10gb file, you're best bet might be to juse use Expat and C!!

Regards
Sreeram

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhXVqrgn 0plK5qqURArsdAK CyjsORjKDZlZNhw R82C9bMDKtGtgCf VeCz
mgU+25qIR6eiyLV c/QOPn+U=
=Zv2q
-----END PGP SIGNATURE-----

Jun 6 '06 #5

ax****@gmail.co m wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


What you clearly need is a better suited file format, but I suspect
you're not in a position to change it, are you?

Cheers,
Nicola Musatti

Jun 6 '06 #6
K.S.Sreeram schrieb:
ax****@gmail.co m wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.


With a 10gb file, you're best bet might be to juse use Expat and C!!


No what exactly makes C grok a 10Gb file where python will fail to do so?

What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

Diez
Jun 6 '06 #7
<ax****@gmail.c om> wrote in message
news:11******** ************@u7 2g2000cwu.googl egroups.com...
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?


You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer

Directly translated to XML gives:

<ThreeDimPoin t>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoin t xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul
Jun 6 '06 #8

Paul> You clearly need something instead of XML.

Amen, brother...

+1 QOTW.

Skip
Jun 6 '06 #9

ax****@gmail.co m wrote:
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?


If your XML files grow so large you might rethink the representation
model. Maybe you give eXist a try?

http://exist.sourceforge.net/

Regards,
Kay

Jun 6 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
4014
by: WL | last post by:
Hey, all. I'm creating an array of strings (char **argv style) on the fly, and using realloc to create string pointers, and malloc for the strings itself (if that makes any sense). I'm using the construct ptr = realloc(ptr, size); *ptr = malloc(string_length); strncpy(ptr, src, string_length); to call realloc() multiple times. This should...
0
1363
by: Peter Bromberg [C# MVP] | last post by:
Recently another developer I work with and I were discussing ways to log more information than just the stock event log entry when an app blows up because of an unhandled exception. There were a couple of posts here about using the AppDomain.UnhandledExceptionEventHandler and / or ThreadException. But the AppDomain guy appears only to fire...
6
7723
by: Martin | last post by:
Hi all. I am developing a filemanager that needs to handle big files. While testing on some zipped files of 6-7GB each I noticed that filesize(), filemtime() and similar php-functions can't handle fikles larger than 2GB. This is true on the two servers I have regular access to (one php4-RedHat, the other php5-Fedora). If there a way to ger...
15
4766
by: syang8 | last post by:
hi, folks, I use Kdevelop to build some scientific simulation on Linux. If I set the size of an array N = 8000, the program works fine. However, if I set the array N some number greater than 10000 (actually, what I need is 80000), the program has segmentation error. The intersting thing is that the positions reporting segmentation error...
2
2052
by: alagariya | last post by:
SQL log file size on 22/06/2007 shown as 10GB...now its showing as 20GB....but when taking backup...the backup file size is 4MB only......when restoring, it showing msg that there is no space, due to the file size of 10 GB..... any body can tell the solution for this?......plz urgent
3
1613
by: Salad | last post by:
Using A97, SP2, most current jet35. I have a search form. The op enters an id to search/find. If found, a data entry form is presented for that id. This form has 7 or 8 combos, a bunch of textboxes, and some command buttons. At the client site yesterday the user got a "too many databases open" after entering a bunch of IDs into the...
1
7446
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7778
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6003
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5349
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3476
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3459
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1908
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1033
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
731
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.