473,386 Members | 1,754 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Problem: "java.lang.OutOfMemoryError: Java heap space" while reading xml using SAX

blazedaces
284 100+
Ok, so you know my problem, java is running out of memory reading with SAX, the event-based xml parser intended more-so than DOM for extremely large files. I'll try to explain what I've been doing and why I have to do it. Hopefully someone has a suggestion...

Alright, so I'm using a gps-simulation program that outputs gps data, like longitude, lattitude, altitude, etc. (hundreds of terms, these are just the well known ones). In the newer version the data is outputed in an xml sheet. In prior versions the outputted data for a 2-hour scenario was about 30-80 megabytes. Now, the one xml sheet (they wanted to be stupidly efficient) is over 200 megabytes, granted scenarios/simulations can run for over 24 hours you can see the problem I had when told "we need a new program to read in the new version data when we won't be able to get old version stuff anymore". Note: no one at the company even knew what xml really was, and my knowledge extended to... "xml sounds like html right?"

We have a program for graphing in Matlab, so in the end it has to be in Matlab. So first I look at what Matlab has for xml reading: DOM. DOM reads ALL the data in an xml file and then creates a "tree" within which you can look at the data. So to make things simple, DOM separates the navigation and data-collecting of the xml file. First, sets up its "tree" and then lets you navigate to wherever you want. This has obvious drawbacks, but the main one I cared about was its bad when reading extremely large files that you don't need every bit of data read from.

So, the alternative: SAX. SAX merges navigation and data-collection by using an "event-based" parser. The events we care most about are probably "startElement", "endElement", and "characters". You can imagine what these three do.

Unfortunately, Matlab doesn't have an xml reader capable of utilizing SAX's capabilities, though, nothing really does. To utilize sax you have to write a program that tells the SAX reader what to do when an event occurs. SAX was originally written in java, so I can write the program in java, and Matlab has its own JVM so I can utilize java classes in Matlab (not easy).

Ok, so it felt like this when I finally gathered this knowledge: "so all I have to do is write a Java program that uses this SAX thing, then use Matlab to run the program, store the data points in memory and transfer them to Matlab, one after the other, then simply do what I did with the previous version data, put it in our 'graphing program'."

After 2 weeks of extensive research into SAX, java, Matlab, and extensive programming, here I am. The parser program in java was done, I know how to utilize it in Matlab, and it was reading the xml files (the test ones I made for 5 second scenarios were 2 mb or so large) perfectly! Now I finally tested it out on the big file, but to no avail, too little memory (the present).

I just want to say that before I did this I didn't know what a is jar file, what is java's class path, what are try/catch/exceptions, what are impliments/interfaces and extensions. And most of all I knew absolutely nothing about SAX, which is hard enough to work with once you understand it fully.

I'm not saying this to be proud, I just wanted to point out I still think of myself as a novice if that when it comes to all this, as why I'm asking for help here...

Anyways, now that you guys know the background (thought it might help) do you have any suggestions to help avoid the problem of too much space needed? The only solution that comes to mind is something like reading part of the file, then storing it in a separate file on the hard drive (just the data I need out of it), closing that file, clearing all my data in memory at the time and repeating till I get to the end. Any other suggestions guys?

Thanks, all your help is much appreciated,
-blazed
May 24 '07 #1
5 14958
JosAH
11,448 Expert 8TB
I suspect that you're doing things this way now:

1) start your SAX parser
2) your Handler(s) build up an entire datastructure
3) you spit out everything to Matlab;
4) done.

This is an "off-line" approach which is essentially what a DOM does: collect
the whole shebang from some xml source and build up the entire tree. Can't
you use a "streaming" or "on-line" approach:

1) start your SAX parser
2) your Handler(s) build up a bit of data
3) spit it out to Matlab and forget about it
4) if more to parse repeat steps 2 and 3
5) done.

This way you can forget about the small parts of data you've collected and,
the most important part, that data will be released so you don't need to keep
humongous amounts of data in core.

kind regards,

Jos (<--- sincerely *hates* xml)
May 24 '07 #2
blazedaces
284 100+
I suspect that you're doing things this way now:

1) start your SAX parser
2) your Handler(s) build up an entire datastructure
3) you spit out everything to Matlab;
4) done.

This is an "off-line" approach which is essentially what a DOM does: collect
the whole shebang from some xml source and build up the entire tree. Can't
you use a "streaming" or "on-line" approach:

1) start your SAX parser
2) your Handler(s) build up a bit of data
3) spit it out to Matlab and forget about it
4) if more to parse repeat steps 2 and 3
5) done.

This way you can forget about the small parts of data you've collected and,
the most important part, that data will be released so you don't need to keep
humongous amounts of data in core.

kind regards,

Jos (<--- sincerely *hates* xml)
Great idea, and I will definitely decide to do it that way when I impliment matlab but if it can't just store the data in vectors of vectors in java, how will it store it all in matlab's variables?

Like when you say "3) spit it out to Matlab and forget about it" isn't that just allocating a different spot in memory?

I think though, that off of your idea of using the "on-line" solution and then as it goes to matlab instead of forgetting about it, process it as I need (actually, almost no processing is necessary except maybe converting radians to degrees) and store it somehow as a saved file in hard drive, close the file and clear previous memory (forget about it as you say).

You might have meant that and I'm misunderstanding, if so, I apologize.

The other idea I'm coming up with has to do with the data I'm collecting. I could specify what data to look for, do one "run", collect that (using the on-line approach) and store it where I need it, then do another one for the next amount of data I decide upon. For example: read the header section, save it in text file called header information, read the signal information, save graph program of signal info, read position info (lat, long, alt, ecef-x,y,z, etc.), save, etc.

Thanks again,
-blazed (<--- also hates xml)

P.S. I'm also thinking of other things to improve the memory-efficiency of my program. When I wrote this, to be lazy with the programming I stored everything as a string, maybe I should be specific, storing things I know as integers or doubles as that, saving a lot of space in the long run probably.

Also, don't know how much it matters, but changing all those for loops ( int i = 0; i < length; i++) to unsigned int (does that save space or just let the number go to a doubly higher positive integer?)...
May 24 '07 #3
JosAH
11,448 Expert 8TB
There's another option: the java JVM has a flag that tells it how large the heap
is allowed to grow:
Expand|Select|Wrap|Line Numbers
  1. java -Xmx512m YourApplication
or try larger numbers if you have the memory. Those xml related things are always
a bad memory hog.

kind regards,

Jos

ps. There are no 'unsigned' ints/longs in Java and they don't take up less space
either ;-)
May 24 '07 #4
blazedaces
284 100+
There's another option: the java JVM has a flag that tells it how large the heap
is allowed to grow:
Expand|Select|Wrap|Line Numbers
  1. java -Xmx512m YourApplication
or try larger numbers if you have the memory. Those xml related things are always
a bad memory hog.

kind regards,

Jos

ps. There are no 'unsigned' ints/longs in Java and they don't take up less space
either ;-)
I'll try using that, it'll come in handy I'm sure.

I think I know what I'm doing now, thanks for everything,
-blazed
May 24 '07 #5
JosAH
11,448 Expert 8TB
I'll try using that, it'll come in handy I'm sure.

I think I know what I'm doing now, thanks for everything,
-blazed
You're welcome; when you have to parse a 200MB xml file and want to store
all that data in Vectors (as you wrote before) you'll end up with quite a bit more
than just the size of that file. If the "online" approach I sketched is applicable,
go for it. best of luck and

kind regards,

Jos
May 24 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Phillip Montgomery | last post by:
Hello all; I'm trying to debug an issue with a java script called, SelectSockets. It appears to be a fairly common one found on the web. I downloaded the SGI Java v1.4.1 installation from SGI's...
15
by: Joe Van Dyk | last post by:
Can someone explain what a heap and what a stack is? And why I should care? I used to know, but then I forgot. And I can't seem to find it in the C++ FAQ. I keep reading how allocating from...
53
by: fdmfdmfdm | last post by:
This is an interview question and I gave out my answer here, could you please check for me? Q. What are the memory allocation for static variable in a function, an automatic variable and global...
14
by: mlw | last post by:
Do not take anything about this, it is not a flame or troll, while I'm not new to Java I favor C++. However, I may need to use it in a contract position, and am concerned that the restrictions it...
11
by: hamiltongreg | last post by:
I am new to Java and am having problems getting my program to compile correctly. My assignment is as follows; Choose a product that lends itself to an inventory (for example, products at your...
49
by: aarklon | last post by:
Hi all, See:- http://www.cs.princeton.edu/introcs/faq/c2java.html for C vs Java in number crunching http://husnusensoy.blogspot.com/2006/06/c-vs-java-in-number-crunching.html
4
by: jmitch89 | last post by:
I don't why I get this error: Exception in thread "main" java.lang.NoClassDefFoundError The statement below works just fine: java -cp...
3
by: ohadr | last post by:
hi, i get Exception in thread "main" java.lang.NullPointerException when i run my application. the exact error is: "Exception in thread "main" java.lang.NullPointerException at...
1
by: onlinegear | last post by:
HI i am writing this for college i know i have loads of combo boxes with nothing in the i havent got that far yet. but every time i run this is comes up with this erro run: Exception in thread...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.