By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,132 Members | 1,417 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,132 IT Pros & Developers. It's quick & easy.

Problem: "java.lang.OutOfMemoryError: Java heap space" while reading xml using SAX

blazedaces
100+
P: 284
Ok, so you know my problem, java is running out of memory reading with SAX, the event-based xml parser intended more-so than DOM for extremely large files. I'll try to explain what I've been doing and why I have to do it. Hopefully someone has a suggestion...

Alright, so I'm using a gps-simulation program that outputs gps data, like longitude, lattitude, altitude, etc. (hundreds of terms, these are just the well known ones). In the newer version the data is outputed in an xml sheet. In prior versions the outputted data for a 2-hour scenario was about 30-80 megabytes. Now, the one xml sheet (they wanted to be stupidly efficient) is over 200 megabytes, granted scenarios/simulations can run for over 24 hours you can see the problem I had when told "we need a new program to read in the new version data when we won't be able to get old version stuff anymore". Note: no one at the company even knew what xml really was, and my knowledge extended to... "xml sounds like html right?"

We have a program for graphing in Matlab, so in the end it has to be in Matlab. So first I look at what Matlab has for xml reading: DOM. DOM reads ALL the data in an xml file and then creates a "tree" within which you can look at the data. So to make things simple, DOM separates the navigation and data-collecting of the xml file. First, sets up its "tree" and then lets you navigate to wherever you want. This has obvious drawbacks, but the main one I cared about was its bad when reading extremely large files that you don't need every bit of data read from.

So, the alternative: SAX. SAX merges navigation and data-collection by using an "event-based" parser. The events we care most about are probably "startElement", "endElement", and "characters". You can imagine what these three do.

Unfortunately, Matlab doesn't have an xml reader capable of utilizing SAX's capabilities, though, nothing really does. To utilize sax you have to write a program that tells the SAX reader what to do when an event occurs. SAX was originally written in java, so I can write the program in java, and Matlab has its own JVM so I can utilize java classes in Matlab (not easy).

Ok, so it felt like this when I finally gathered this knowledge: "so all I have to do is write a Java program that uses this SAX thing, then use Matlab to run the program, store the data points in memory and transfer them to Matlab, one after the other, then simply do what I did with the previous version data, put it in our 'graphing program'."

After 2 weeks of extensive research into SAX, java, Matlab, and extensive programming, here I am. The parser program in java was done, I know how to utilize it in Matlab, and it was reading the xml files (the test ones I made for 5 second scenarios were 2 mb or so large) perfectly! Now I finally tested it out on the big file, but to no avail, too little memory (the present).

I just want to say that before I did this I didn't know what a is jar file, what is java's class path, what are try/catch/exceptions, what are impliments/interfaces and extensions. And most of all I knew absolutely nothing about SAX, which is hard enough to work with once you understand it fully.

I'm not saying this to be proud, I just wanted to point out I still think of myself as a novice if that when it comes to all this, as why I'm asking for help here...

Anyways, now that you guys know the background (thought it might help) do you have any suggestions to help avoid the problem of too much space needed? The only solution that comes to mind is something like reading part of the file, then storing it in a separate file on the hard drive (just the data I need out of it), closing that file, clearing all my data in memory at the time and repeating till I get to the end. Any other suggestions guys?

Thanks, all your help is much appreciated,
-blazed
May 24 '07 #1
Share this Question
Share on Google+
5 Replies


Expert 10K+
P: 11,448
I suspect that you're doing things this way now:

1) start your SAX parser
2) your Handler(s) build up an entire datastructure
3) you spit out everything to Matlab;
4) done.

This is an "off-line" approach which is essentially what a DOM does: collect
the whole shebang from some xml source and build up the entire tree. Can't
you use a "streaming" or "on-line" approach:

1) start your SAX parser
2) your Handler(s) build up a bit of data
3) spit it out to Matlab and forget about it
4) if more to parse repeat steps 2 and 3
5) done.

This way you can forget about the small parts of data you've collected and,
the most important part, that data will be released so you don't need to keep
humongous amounts of data in core.

kind regards,

Jos (<--- sincerely *hates* xml)
May 24 '07 #2

blazedaces
100+
P: 284
I suspect that you're doing things this way now:

1) start your SAX parser
2) your Handler(s) build up an entire datastructure
3) you spit out everything to Matlab;
4) done.

This is an "off-line" approach which is essentially what a DOM does: collect
the whole shebang from some xml source and build up the entire tree. Can't
you use a "streaming" or "on-line" approach:

1) start your SAX parser
2) your Handler(s) build up a bit of data
3) spit it out to Matlab and forget about it
4) if more to parse repeat steps 2 and 3
5) done.

This way you can forget about the small parts of data you've collected and,
the most important part, that data will be released so you don't need to keep
humongous amounts of data in core.

kind regards,

Jos (<--- sincerely *hates* xml)
Great idea, and I will definitely decide to do it that way when I impliment matlab but if it can't just store the data in vectors of vectors in java, how will it store it all in matlab's variables?

Like when you say "3) spit it out to Matlab and forget about it" isn't that just allocating a different spot in memory?

I think though, that off of your idea of using the "on-line" solution and then as it goes to matlab instead of forgetting about it, process it as I need (actually, almost no processing is necessary except maybe converting radians to degrees) and store it somehow as a saved file in hard drive, close the file and clear previous memory (forget about it as you say).

You might have meant that and I'm misunderstanding, if so, I apologize.

The other idea I'm coming up with has to do with the data I'm collecting. I could specify what data to look for, do one "run", collect that (using the on-line approach) and store it where I need it, then do another one for the next amount of data I decide upon. For example: read the header section, save it in text file called header information, read the signal information, save graph program of signal info, read position info (lat, long, alt, ecef-x,y,z, etc.), save, etc.

Thanks again,
-blazed (<--- also hates xml)

P.S. I'm also thinking of other things to improve the memory-efficiency of my program. When I wrote this, to be lazy with the programming I stored everything as a string, maybe I should be specific, storing things I know as integers or doubles as that, saving a lot of space in the long run probably.

Also, don't know how much it matters, but changing all those for loops ( int i = 0; i < length; i++) to unsigned int (does that save space or just let the number go to a doubly higher positive integer?)...
May 24 '07 #3

Expert 10K+
P: 11,448
There's another option: the java JVM has a flag that tells it how large the heap
is allowed to grow:
Expand|Select|Wrap|Line Numbers
  1. java -Xmx512m YourApplication
or try larger numbers if you have the memory. Those xml related things are always
a bad memory hog.

kind regards,

Jos

ps. There are no 'unsigned' ints/longs in Java and they don't take up less space
either ;-)
May 24 '07 #4

blazedaces
100+
P: 284
There's another option: the java JVM has a flag that tells it how large the heap
is allowed to grow:
Expand|Select|Wrap|Line Numbers
  1. java -Xmx512m YourApplication
or try larger numbers if you have the memory. Those xml related things are always
a bad memory hog.

kind regards,

Jos

ps. There are no 'unsigned' ints/longs in Java and they don't take up less space
either ;-)
I'll try using that, it'll come in handy I'm sure.

I think I know what I'm doing now, thanks for everything,
-blazed
May 24 '07 #5

Expert 10K+
P: 11,448
I'll try using that, it'll come in handy I'm sure.

I think I know what I'm doing now, thanks for everything,
-blazed
You're welcome; when you have to parse a 200MB xml file and want to store
all that data in Vectors (as you wrote before) you'll end up with quite a bit more
than just the size of that file. If the "online" approach I sketched is applicable,
go for it. best of luck and

kind regards,

Jos
May 24 '07 #6

Post your reply

Sign in to post your reply or Sign up for a free account.