xslt processing and memory needs

thomas.porschberg

Hi,

I want to read records from a database and export it in an arbitrary
format.
My idea was to feed a class with a String array fetched from the
database and let
this class fire SAX events as processor input.

The basic class hierarchy is:
org.xml.sax.XMLReader [Interface]
^
|
|
AbstractXMLReader
^
/ \
|
|
|
StringArrayXMLReader --------------------> TransformerHandler

The TransformerHandler is associated with a StreamResult
(ByteArrayOutputStream)
and the XSL-Stylesheet.

In a first try I provided a init and close method for
StringArrayXMLReader which implements
startDocument()/endDocument() and some header information.
The problem is here that the xslt processing takes place on endDocument
and therefore all
rows from the database are buffered until endDocument is fired.
Because the result set from the database could be really large this is
not an option for me.
For recap:
fired SAX event was :
<?xml version="1.0" encoding="UTF-8"?>
<dataset>

<row><value>1-String</value><value>2-String</value><value>3-String</value></row>

<row><value>R1-String</value><value>R2-String</value><value>R3-String</value></row>
</dataset>

XSL was:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="/">
<table border="1">
<xsl:apply-templates select="dataset/row"/>
</table>
</xsl:template>

<xsl:template match="dataset/row">
<tr>
<xsl:apply-templates select="value"/>
</tr>
</xsl:template>

<xsl:template match="value">
<td>
<xsl:value-of select='.'/>
</td>
</xsl:template>
</xsl:stylesheet>

My next idea was to consider one row from the database result set as
one input document for
the processor and to provide the header information manually.
I also modified my XSL-Stylesheet and put the startDocument/endDocument
in the parse method.
Processor input is now:

<?xml version="1.0" encoding="UTF-8"?>
<row><value>1-String</value><value>2-String</value><value>3-String</value></row>

and XSL looks:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="/">
<tr>
<xsl:apply-templates select="row/value"/>
</tr>
</xsl:template>

<xsl:template match="value">
<td>
<xsl:value-of select='.'/>
</td>
</xsl:template>
</xsl:stylesheet>

Doing this now I get an NullPointerException in
org.apache.xalan.transformer.TransformerImpl.run
on the second call of my StringArrayXMLReader.parse method.
The first call looks OK. The fired SAX event is converted properly by
the XSLT-processor.

Beside this special problem I have the more general question:
Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
Because my result set is very large I can not read it in memory.
I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.
However consider a row as a complete document looks also not tidy
because now I lose the xslt
power for my header information.
Maybe I misunderstand here something completely.
How would a XML/XSLT expert solve the problem ?

Apr 18 '06 #1

Subscribe Post Reply

1984

Joe Kesselman

th***************@osp-dd.de wrote:

Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
In general, no, because the XSLT language has complete random access to
the source document.

If your stylesheet reliably reads only forward through the document,
some processors can take advantage of that if configured properly. For
example, Xalan's SQL extension, for its own convenience, doesn't attempt
to present the entire document at once and requires that you write the
stylesheet with that restriction in mind.

Automating recognition of that sort of opportunity and using it for
optimization is one of the holy grails of XSLT processing, and is an
ongoing area of research. If you search the archives of the Xalan
mailing list for the keywords "streaming", "pruning", and "filtering"
you'll find discussion of the challenges involved in making this work
properly. I know folks who are still trying to make it work.

Meanwhile, some XSLT processors (eg Xalan) switched from using DOMs
internally to using non-object-based data models for memory efficiency
reasons; that may help you.

In fact, that brings up another possibility. IBM's recently added a
native XML model to DB2, letting it function as a true XML database as
well as a relational database, and I believe they have implemented
XQuery for that. They may have implemented XSLT as well; if not, XQuery
is functionally interchangable with XSLT 2.0 (XQuery and XSLT2 are
actually two sides of the same design effort) so you could probably
rewrite your stylesheet as an XQuery. This would let you leverage DB2's
intelligence in memory management. I believe trial/beta copies are
available from IBM's website.
I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.

That's another solution, of course: Write your own SAX-based processing
layer, storing only the information you Really Need in memory. If you
want to use that together with XSLT, what I'd suggest is that you create
a SAX-based filter to discard portions of your document that the
stylesheet doesn't need to see, then run XSLT on the result.

If you really need to render everything, you can try writing a SAX-based
solution from the ground up, which may allow you to avoid storing
anything in memory for the long term -- IF your rendering permits you to
do so. Or you may find it makes sense to actually re-read the source
document several times rather than keeping too much in memory. As I say,
we do hope XSLT will eventually be optimized to the point where it can
make these decisions for you... but for now, there are still cases where
the right thing to do is drop down to a lower-level language, just as a
Java programmer will occasionally find they have to drop down to the
bytecode level or JNI native code to get the performance they need.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Apr 18 '06 #2

thomas.porschberg

Joe Kesselman schrieb:

th***************@osp-dd.de wrote:
Is it possible to transform XML with a xslt processor without having
the whole document in memory ?
In general, no, because the XSLT language has complete random access to
the source document.

If your stylesheet reliably reads only forward through the document,
some processors can take advantage of that if configured properly. For
example, Xalan's SQL extension, for its own convenience, doesn't attempt
to present the entire document at once and requires that you write the
stylesheet with that restriction in mind.

Actually my stylesheets are very simple.
The stylesheets work also all on a very simple XML input which is:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
<row><value>1-String</value><value>2-String</value></row>
<row><value>R1-String</value><value>R2-String</value></row>
</dataset>

The problem is that it can be very many row entries.
Automating recognition of that sort of opportunity and using it for
optimization is one of the holy grails of XSLT processing, and is an
ongoing area of research. If you search the archives of the Xalan
mailing list for the keywords "streaming", "pruning", and "filtering"
you'll find discussion of the challenges involved in making this work
properly. I know folks who are still trying to make it work.

Meanwhile, some XSLT processors (eg Xalan) switched from using DOMs
internally to using non-object-based data models for memory efficiency
reasons; that may help you.
I do not create a DOM document as input but Sax-Events.
I think the XML part is not the problem but the XSLT processing part.
The problem is that the TransformerHandler starts its work, if the
endDocument() call happend. When I read much <row> data records it
must buffer all the the data read from databse in memory.
In fact, that brings up another possibility. IBM's recently added a
native XML model to DB2, letting it function as a true XML database as
well as a relational database, and I believe they have implemented
XQuery for that. They may have implemented XSLT as well; if not, XQuery
is functionally interchangable with XSLT 2.0 (XQuery and XSLT2 are
actually two sides of the same design effort) so you could probably
rewrite your stylesheet as an XQuery. This would let you leverage DB2's
intelligence in memory management. I believe trial/beta copies are
available from IBM's website. We use an oracle database but I can not use it internal XML
capabilities because
I'm tied to our company database access framework. I can only build the
output on top of
the database result sets.

I tried the SAX-fire event mechanism because I thought it gives me the
fewest overhead.
That's another solution, of course: Write your own SAX-based processing
layer, storing only the information you Really Need in memory. If you
want to use that together with XSLT, what I'd suggest is that you create
a SAX-based filter to discard portions of your document that the
stylesheet doesn't need to see, then run XSLT on the result.

I think the "then run XSLT on the result" is the problem.
When I start the fetch from the database I call a init() method which
calls startDocument() then I fetch row by row and call parse().
In parse() the <row><value>... is fired.
When all records from the database are read, I call a close() method,
which itself
calls endDocument(). Now the xslt processing takes place.
If you really need to render everything, you can try writing a SAX-based
solution from the ground up, which may allow you to avoid storing
anything in memory for the long term -- IF your rendering permits you to
do so. Or you may find it makes sense to actually re-read the source
document several times rather than keeping too much in memory.
Because I fetch the data from Database, I can not read it multiple
times.
But this is not the point. I only fetch the data I really need put it
in a String Array
and produce these startElement,endElement,characters calls of my
ContentHandler.
The code is at:
http://randspringer.de/sax_row.tar
http://randspringer.de/sax.tar
As I say,
we do hope XSLT will eventually be optimized to the point where it can
make these decisions for you... but for now, there are still cases where
the right thing to do is drop down to a lower-level language, just as a
Java programmer will occasionally find they have to drop down to the
bytecode level or JNI native code to get the performance they need.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Apr 19 '06 #3

Joseph Kesselman

As I said: XSLT processors are still learning when they can
stream/prune/filter the incoming data.

For now, and for your particular case, you'll be better off hand-coding
a SAX-based solution.

Apr 19 '06 #4

Similar topics

XML/XSLT with Python

by: K. N. | last post by:

Is there any good and fast Python module for XSLT processing ? I'm going to use XML and XSLT to generate web pages, so I need XSLT processor that will be able to transform for example a DOM object...

Python

How to make such XSLT?

by: Maciej Wegorkiewicz | last post by:

Hi, I have small experience in XSLT processing and I have a problem which I cannot solve. Can you look at it? I have an input file containing info about bank accounts like this: (...) <acc...

.NET Framework

WriteXml to include xslt reference statement - how?

by: Gordon Moore | last post by:

Hi, I'm new to using xml/xslt and although I can create an xml document using the dataset.WriteXml statement, and I have created an xslt to transform the xml into the output I want, I have to...

.NET Framework

huge XML files, XSLT memory problems, Java & SAX...

by: Jeff Calico | last post by:

I have 2 XML data files that I want to extract data from simultaneously and transform with XSLT to generate a report. The first file is huge and when XSLT builds the DOM tree in memory, it runs...

.NET Framework

XSLT question: How to lookup another tag's children in XSLT

by: yinglcs | last post by:

Hi, I have a newbie XSLT question. I have the following xml, and I would like to find out the children of feature element in each 'features' element. i.e. for each <featuresI would like to...

.NET Framework

Performance of XSLT

by: starlight | last post by:

Hallo, there were some posts about this, but nothing I could find useful. I have a large XML file (80MB) and need certain information out of it. I though I could use XSLT with an fairy simple...

.NET Framework

Convert CSV To html via XSLT

by: Chris | last post by:

Hi, Just wondering if anyone out there knows if it is possible to convert a CSV to xml using XSLT? I've seen a lot of examples of xml to CSV, but is it possible to go back the other way? I...

.NET Framework

Why is Intel XSLT accelerator so slow ?

by: killy971 | last post by:

I have been testing different libraries to process XSL transformations on large XML files. The fact is that I read a document from Intel, stating their library (XSLT accelerator) was more twice...

.NET Framework

printing XML file with XSLT code

by: Stu | last post by:

Being a newbie with XSLT transformation code please excuse my neivte. In addition, I am not sure what I want to do can be done with xslt so I apologize up front for asking anything stupid I...

.NET Framework

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware