XSLTranslation of a large XML file using Java results in OutOfMemory

Lenny Wintfeld

Hi

I'm attempting additions/changes to a Java program that (among other
things) uses XSLT to transform a large (96 Mb) XML file. It runs fine on
small XML files but generates OutOfMemory exceptions with large XML
files. I tried a simple punt of -Xmx512MB but that didn't work. In the
future, the input XML file may become considerably bigger than 96 MB, so
even if it did work, it probably would be putting off the inevitable to
some later date.

I'm using JavaSE 1.4.2_11 and the XSL/XML libraries that come with it.
The translation is from and to an xml file. The code I inherited looks a
lot like most of the example code you can find on the net for doing an
XSLT transformation. The relevant part is:

TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer(xsltSource);
transformer.transform(new StreamSource(new StringReader(x)),
xsltDest);

where xsltSource is XSLT in the form of a string, generated by code
immediately above the snip shown, and the "x" is the input xml to be
transformed.

Things I tried:

1. I modified the above code to use a file instead of a String as the
XML to be transformed and a file for the XSLT that specifies the
transformation. It works fine with small XML input files but not with
large ones. I assume this code is using the DOM parser, and there is
simply not enough room in memory to house the input XML file.

2. Based on some old (years old) newsgroup posts I found, I tried using
a SAX equivalent of the above code, assuming that SAX takes in, parses
and transforms the input XML file either picemeal (maybe element by
element?) or that SAX uses the complete virtual memory of the computer.
But this code also results successful runs on small input XML files and
OutOfMemory errors on large ones. Here is a snip of the SAX code
(adapted from a chapter of Burke's "XSLT and Java" at the O'Reilly
website):

FileInputStream brXSLT = new FileInputStream ("C:/Documents and
Settings/Lenny/Desktop/OCCxsl.xsl");

// Set up the transformer
TransformerFactory transFact =
TransformerFactory.newInstance( );
SAXTransformerFactory saxTransFact =
(SAXTransformerFactory) transFact;
Source xsltSource = new StreamSource(brXSLT);
TransformerHandler transHand =
saxTransFact.newTransformerHandler(xsltSource);

// Set up input source
InputSource inxml = new InputSource(inXML);
SAXSource saxSource = new SAXSource(inxml);

// Set the destination for the XSLT transformation
transHand.setResult(new StreamResult(outXML));

// attach the XSLT processor to the XMLReader
String parserClass = "org.apache.crimson.parser.XMLReaderImpl";
XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

//parse the input file to an output file
reader.setContentHandler(transHand);
reader.parse(inxml);
I'm considering making a custom parser of the input XML file which
basically identifies elements of the input XML file and treats each
element as if it were a comlete document. e.g. send the content handler
ch.startDocument()
ch.startElement(..) // pass through the original element
ch.characters(..) // "
ch.endElement(..) // "
ch.endDocument()
for each element in the input XML file.

But being a newbie to XSLT, I don't know if this is worth pursuing, or
even if it would work; I'm hoping there are simpler, more strightforward
ways of accomplising the same thing and at a higher level. It does seem
pretty clumsy, even if it would work.

I found a reply on the web to someone who had a similar problem. To the
effect that a "SAX pipeline" should be used. But there was no further
elaboration, and so far, I haven't figured out what a SAX Pipeline is or
how it would help.

Any advice, references to examples, or actual examples would be
greatly appreciated.

Non-procedural programming is taking quite a bit of effort to
understand!

Thanks in advance for your help.

Lenny Wintfeld

ps - I've had this up on comp.lang.java.programmer for most of the day
with no replies. It bridges both specialties, that's why I'm trying
here.

May 17 '06 #1

Subscribe Post Reply

4946

Joe Kesselman

In general, XSLT can't operate as a streaming processor, since its use
of XPaths assumes the entire document is available in memory (or at
least can be re-read) at once. Some processors use more compact models
than others and thus may be able to handle larger documents in the same
memory; this is part of why Xalan created its own model, known as DTM,
rather than using an off-the-shelf DOM implementation.

If you're willing to limit the kinds of stylesheets you write to ones
which _only_ process the document in forward order, you can of course
set up a minimal data model which just contains one (or a few) nodes;
Xalan's SQL extension works that way, actually.

Yes, automatically recognizing which stylesheets (or portions thereof)
are streamable would be a Good Thing, but it's still something of a Holy
Grail for XSLT implementers. If you look in the archives of the Xalan
mailing list, you'll see much past discussion of this, and of possible
approaches to dealing with it. Look in particular for the keywords
"streaming", "pruning", and "filtering". Folks are continuing to
research this, but it is not an easy problem.

But until someone does get a handle on this problem... Sometimes, if you
have to process large documents, the only good answer is to drop down
from XSLT to a lower level and code the processing yourself as a direct
SAX application. That lets you take advantage of whatever
streaming/pruning/filtering opportunities exist, as well as letting you
code a special-purpose (and thus more compact) model for any data you do
have to retain. High-level languages are a good thing, but some problems
are still best addressed by low-level bit-twiddling.

May 17 '06 #2

Peter Flynn

Joe Kesselman wrote:

In general, XSLT can't operate as a streaming processor, since its use
of XPaths assumes the entire document is available in memory (or at
least can be re-read) at once. Some processors use more compact models
than others and thus may be able to handle larger documents in the same
memory; this is part of why Xalan created its own model, known as DTM,
rather than using an off-the-shelf DOM implementation.

Perhaps it's appropriate to mention Omnimark, which uses a technique
sometimes known as "write-behind" (borrowed from the hardware field).
Instead of having an addressing scheme (XPath) for accessing objects
out of document sequence, it provides for the placement of references
to named anchors at the places where you know (or have computed) you
will need to access such objects, and then creating the anchors
themselves when you encounter them in document order. When the last
event in document order has triggered, the "write-behind" reconciliation
takes place, and all the values of the anchors are slotted into the
places reserved for them by the references.

(At least, this is how it used to work: I haven't used it for years.)

///Peter
--
XML FAQ: http://xml.silmaril.ie/

May 17 '06 #3

lennyw

Thanks very much for your reply and advice. It's a shame that the XSL
transform engines can't (at least as an option) use virtual memory as
their target environment for xml data file transformations. It looks
like I may have a long row to hoe in doing the equivalent of the
transform using procedural code! The sad part is, the transfomations
that are done to these XML files using XSLT seem to be custom made for
XSLT!

Just a couple of quick follow ups: 1. Note that the transformation that
is being done is XML to XML. Except for a sort, which could be broken
out of the XSLT stylesheet and done procedurally after the
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation? If so
could you point me to some examples? 2. I'm tantlized by the reference
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT. Do you know where I could get
additonal info on a "SAX Pipeline", or might this have been some
wishful thnking on the part of it's author?

Once again, thanks for your feedback.

Lenny Wintfeld

May 19 '06 #4

Jürgen Kahrs

le****@comcast.net wrote:

Just a couple of quick follow ups: 1. Note that the transformation that
is being done is XML to XML. Except for a sort, which could be broken
out of the XSLT stylesheet and done procedurally after the
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation? If so
could you point me to some examples? 2. I'm tantlized by the reference
It sounds like your focus is on large files
(> 100 MB) and you may be willing to give up
XSL and Java in order to solve the problem.
The following tool is not so specialized in
producing XML files, but it can handle 1 GB
of data withing 1 or 2 minutes:

http://home.vrweb.de/~juergen.kahrs/...of-an-XML-file
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT. Do you know where I could get
additonal info on a "SAX Pipeline", or might this have been some
wishful thnking on the part of it's author?

Maybe this one helps:

Pipestreaming microformats
http://www-128.ibm.com/developerwork...matters44.html

May 19 '06 #5

Joe Kesselman

le****@comcast.net wrote:

Thanks very much for your reply and advice. It's a shame that the XSL
transform engines can't (at least as an option) use virtual memory as
their target environment for xml data file transformations.
Generally, XSLT transformers *will* use virtual memory if the language
they're running in and the operating system they're running on support
it -- they just don't try to do the memory management themselves; they
trust the system to do it for them. And in fact Java does use virtual
memory... but the JVM you're using won't let you set that limit high
enough for this particular document.
It looks
like I may have a long row to hoe in doing the equivalent of the
transform using procedural code! The sad part is, the transfomations
that are done to these XML files using XSLT seem to be custom made for
XSLT!
I know how you feel. All I can say is that I know folks who are working
on finding ways to address this, so In The Future Things Should Be
Better. The concepts are relatively straightforward; the hard part is
translating them into rules the machine can apply.
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation?
I agree that this is exactly the kind of problem that ought to be
streamable... There's no portable way to leverage that, but specific
XSLT processor may have a way to handle it. To take the example I know
best: Xalan's internal data representation does happen to have the
ability to "prune off" the most recently added nodes, so an explicit
call to an extension function could, theoretically, discard the element
once you're done processing it. In fact, one of Xalan's more obscure and
underdocumented extensions does discard trees, though only in specific
situations; we added that to handle the
foreach-over-a-list-of-document()s situation... but I don't think
there's a generalized version which would address your case. (We'd
started investigating one, actually, then Other Priorities Intervenes.)
could you point me to some examples? 2. I'm tantlized by the reference
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT.

I think that was probably intended to be a reference to hand-coded SAX
processing.

But actually, you *could* do a compromise: hand-code a SAX processor
which essentially breaks the large document up into a series of smaller
ones and runs XSLT transforms on each one via its API (eg TrAX, of
you're working in Java), then reassembles the output of those
transformations into a single document again.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

May 20 '06 #6

lennyw

Jurgen, I looked at your reference to xmlgawk in some detail, and it
seems pretty encouraging; not only for the problem I stated, but for
web tie-ins on XML data. I will look at your document in more detail
and at the references (especially XMLBooster, xmllib and Expat). But in
the mean time could you let me know directly, or provide me with some
info on the following: How would I tie in xmlgawk to my primary
application(s) in java. Would I do the equivalent of an exec(..) of the
awk processor and then look for an exit code or is there a library that
ties it in more directly (similar to the XSLT library for Java)?

I'm looking forward to seeing if xmlgawk would be a reasonable half
step between purely procedural code and XSLT; either premanently, or
until XSLT can handle the kinds of XML files I'm called on to process.

Thanks for the reference!

Lenny W.

May 22 '06 #7

by: Charlie | last post by:

Dear all, I am currently writting a trace analyzer in C++. It always fails to open a very large input file (3.7Gb). I tried on a simple program, same thing happens:...

C / C++

How to read tables from DB2 and write them in txt file using JAVA Code?

by: keyjen2017 | last post by:

i am trying to read the table in DB2 using JAVA code. then it is supposed to be copied in .txt file. how can i do this? other thing, i m passing date in dddd-mm-yy format from JAVA codeto call a ...

DB2 Database

program to find a perticular string in a file using java

by: gauravkhanna | last post by:

import java.io.*; class Automation1 implements Runnable { int iterationCount=0; Thread t; FileReader fr; //FileWriter fw; BufferedReader br; ...

Java

How to export table data from a mysql database into text file using java?

by: asenthil | last post by:

Hai, i'm a beginner to java... just now i had tried to read and write files using java... and then i had tried to connect a database using jdbc... now i want to export the data's from a...

Java

Merging PDF document with XML file using Java class/servlet

by: ErikaW | last post by:

Hi all, I've tried to google this but could not find a clear solution. I have a Web application developed in JDevloper using mostly html and Javascript. I have a pre-defined PDF form which I merge...

Java

reading and posting data to a file using java

by: yaveus | last post by:

Hi?I am studying java on my own and got stack on how to read and post data to a file using java e.g how will you solve this: A simple application that, when run, Welcomes the users and tells them...

Java

Create an excel file using java

by: SagarDoke | last post by:

I have a txt file. I want to add the data from that file into the excel file using java. That data is delimited by spaces as follows: 1 Sagar Doke Address Roll No. City 2 ABC ...

Java

How to get a .jrxml file using java?

by: Sary | last post by:

hi, I 'm working with ireport-3.7.1.How to get the .jrxml file using java?

Java

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

XSLTranslation of a large XML file using Java results in OutOfMemory

Similar topics