By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,994 Members | 1,567 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,994 IT Pros & Developers. It's quick & easy.

xslt processing and memory needs

P: n/a
Hi,

I want to read records from a database and export it in an arbitrary
format.
My idea was to feed a class with a String array fetched from the
database and let
this class fire SAX events as processor input.

The basic class hierarchy is:
org.xml.sax.XMLReader [Interface]
^
|
|
AbstractXMLReader
^
/ \
|
|
|
StringArrayXMLReader --------------------> TransformerHandler

The TransformerHandler is associated with a StreamResult
(ByteArrayOutputStream)
and the XSL-Stylesheet.

In a first try I provided a init and close method for
StringArrayXMLReader which implements
startDocument()/endDocument() and some header information.
The problem is here that the xslt processing takes place on endDocument
and therefore all
rows from the database are buffered until endDocument is fired.
Because the result set from the database could be really large this is
not an option for me.
For recap:
fired SAX event was :
<?xml version="1.0" encoding="UTF-8"?>
<dataset>

<row><value>1-String</value><value>2-String</value><value>3-String</value></row>

<row><value>R1-String</value><value>R2-String</value><value>R3-String</value></row>
</dataset>

XSL was:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="/">
<table border="1">
<xsl:apply-templates select="dataset/row"/>
</table>
</xsl:template>

<xsl:template match="dataset/row">
<tr>
<xsl:apply-templates select="value"/>
</tr>
</xsl:template>

<xsl:template match="value">
<td>
<xsl:value-of select='.'/>
</td>
</xsl:template>
</xsl:stylesheet>

My next idea was to consider one row from the database result set as
one input document for
the processor and to provide the header information manually.
I also modified my XSL-Stylesheet and put the startDocument/endDocument
in the parse method.
Processor input is now:

<?xml version="1.0" encoding="UTF-8"?>
<row><value>1-String</value><value>2-String</value><value>3-String</value></row>

and XSL looks:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="/">
<tr>
<xsl:apply-templates select="row/value"/>
</tr>
</xsl:template>

<xsl:template match="value">
<td>
<xsl:value-of select='.'/>
</td>
</xsl:template>
</xsl:stylesheet>

Doing this now I get an NullPointerException in
org.apache.xalan.transformer.TransformerImpl.run
on the second call of my StringArrayXMLReader.parse method.
The first call looks OK. The fired SAX event is converted properly by
the XSLT-processor.

Beside this special problem I have the more general question:
Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
Because my result set is very large I can not read it in memory.
I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.
However consider a row as a complete document looks also not tidy
because now I lose the xslt
power for my header information.
Maybe I misunderstand here something completely.
How would a XML/XSLT expert solve the problem ?

Apr 18 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
th***************@osp-dd.de wrote:
Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
In general, no, because the XSLT language has complete random access to
the source document.

If your stylesheet reliably reads only forward through the document,
some processors can take advantage of that if configured properly. For
example, Xalan's SQL extension, for its own convenience, doesn't attempt
to present the entire document at once and requires that you write the
stylesheet with that restriction in mind.

Automating recognition of that sort of opportunity and using it for
optimization is one of the holy grails of XSLT processing, and is an
ongoing area of research. If you search the archives of the Xalan
mailing list for the keywords "streaming", "pruning", and "filtering"
you'll find discussion of the challenges involved in making this work
properly. I know folks who are still trying to make it work.

Meanwhile, some XSLT processors (eg Xalan) switched from using DOMs
internally to using non-object-based data models for memory efficiency
reasons; that may help you.

In fact, that brings up another possibility. IBM's recently added a
native XML model to DB2, letting it function as a true XML database as
well as a relational database, and I believe they have implemented
XQuery for that. They may have implemented XSLT as well; if not, XQuery
is functionally interchangable with XSLT 2.0 (XQuery and XSLT2 are
actually two sides of the same design effort) so you could probably
rewrite your stylesheet as an XQuery. This would let you leverage DB2's
intelligence in memory management. I believe trial/beta copies are
available from IBM's website.
I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.


That's another solution, of course: Write your own SAX-based processing
layer, storing only the information you Really Need in memory. If you
want to use that together with XSLT, what I'd suggest is that you create
a SAX-based filter to discard portions of your document that the
stylesheet doesn't need to see, then run XSLT on the result.

If you really need to render everything, you can try writing a SAX-based
solution from the ground up, which may allow you to avoid storing
anything in memory for the long term -- IF your rendering permits you to
do so. Or you may find it makes sense to actually re-read the source
document several times rather than keeping too much in memory. As I say,
we do hope XSLT will eventually be optimized to the point where it can
make these decisions for you... but for now, there are still cases where
the right thing to do is drop down to a lower-level language, just as a
Java programmer will occasionally find they have to drop down to the
bytecode level or JNI native code to get the performance they need.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Apr 18 '06 #2

P: n/a

Joe Kesselman schrieb:
th***************@osp-dd.de wrote:
Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
In general, no, because the XSLT language has complete random access to
the source document.

If your stylesheet reliably reads only forward through the document,
some processors can take advantage of that if configured properly. For
example, Xalan's SQL extension, for its own convenience, doesn't attempt
to present the entire document at once and requires that you write the
stylesheet with that restriction in mind.


Actually my stylesheets are very simple.
The stylesheets work also all on a very simple XML input which is:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
<row><value>1-String</value><value>2-String</value></row>
<row><value>R1-String</value><value>R2-String</value></row>
</dataset>

The problem is that it can be very many row entries.
Automating recognition of that sort of opportunity and using it for
optimization is one of the holy grails of XSLT processing, and is an
ongoing area of research. If you search the archives of the Xalan
mailing list for the keywords "streaming", "pruning", and "filtering"
you'll find discussion of the challenges involved in making this work
properly. I know folks who are still trying to make it work.

Meanwhile, some XSLT processors (eg Xalan) switched from using DOMs
internally to using non-object-based data models for memory efficiency
reasons; that may help you.
I do not create a DOM document as input but Sax-Events.
I think the XML part is not the problem but the XSLT processing part.
The problem is that the TransformerHandler starts its work, if the
endDocument() call happend. When I read much <row> data records it
must buffer all the the data read from databse in memory.
In fact, that brings up another possibility. IBM's recently added a
native XML model to DB2, letting it function as a true XML database as
well as a relational database, and I believe they have implemented
XQuery for that. They may have implemented XSLT as well; if not, XQuery
is functionally interchangable with XSLT 2.0 (XQuery and XSLT2 are
actually two sides of the same design effort) so you could probably
rewrite your stylesheet as an XQuery. This would let you leverage DB2's
intelligence in memory management. I believe trial/beta copies are
available from IBM's website. We use an oracle database but I can not use it internal XML
capabilities because
I'm tied to our company database access framework. I can only build the
output on top of
the database result sets.
I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.
That's another solution, of course: Write your own SAX-based processing
layer, storing only the information you Really Need in memory. If you
want to use that together with XSLT, what I'd suggest is that you create
a SAX-based filter to discard portions of your document that the
stylesheet doesn't need to see, then run XSLT on the result.

I think the "then run XSLT on the result" is the problem.
When I start the fetch from the database I call a init() method which
calls startDocument() then I fetch row by row and call parse().
In parse() the <row><value>... is fired.
When all records from the database are read, I call a close() method,
which itself
calls endDocument(). Now the xslt processing takes place.
If you really need to render everything, you can try writing a SAX-based
solution from the ground up, which may allow you to avoid storing
anything in memory for the long term -- IF your rendering permits you to
do so. Or you may find it makes sense to actually re-read the source
document several times rather than keeping too much in memory.
Because I fetch the data from Database, I can not read it multiple
times.
But this is not the point. I only fetch the data I really need put it
in a String Array
and produce these startElement,endElement,characters calls of my
ContentHandler.
The code is at:
http://randspringer.de/sax_row.tar
http://randspringer.de/sax.tar
As I say,
we do hope XSLT will eventually be optimized to the point where it can
make these decisions for you... but for now, there are still cases where
the right thing to do is drop down to a lower-level language, just as a
Java programmer will occasionally find they have to drop down to the
bytecode level or JNI native code to get the performance they need.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry


Apr 19 '06 #3

P: n/a
As I said: XSLT processors are still learning when they can
stream/prune/filter the incoming data.

For now, and for your particular case, you'll be better off hand-coding
a SAX-based solution.
Apr 19 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.