469,917 Members | 1,701 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,917 developers. It's quick & easy.

trax, and transforms from a DOMSource

Consider this Java fragment, part of an application which takes crufty HTML
documents in MS Word and OO.o's excuses for HTML and produces a
standardised clean presentation in both HTML and PDF:

/**
* a map of my substitutions, loaded from the file in my resources which
* contains my substitution specifications
*/
protected Map substitutions = null;

/** the DOM printer I'm going to use */
protected Printer caxton = new Printer( );

/**
* a SedBuffer which ro knock out the worst cruft from MS Word and OO.o
* generated HTML
*/
protected SedBuffer sed = new SedBuffer( );

/** a tidy parser to load messy HTML as a document */
protected Tidy sweeper = new Tidy( );

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for converting the heathen into pdf
*/
protected Transformer converter = null;

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for splitting the heathen into web-servable units
*/
protected Transformer splitter = null;
/**
* convert the heathen
*
* @param heathen the foreign file to convert
*
* @return the base name of the conversion
*/
public String convert( File heathen )
throws IOException, TransformerException, SubstitutionException
{
String result = toBaseName( heathen.getName( ) );
File htmlFile = new File( repository, result + ".html" );
File pdfFile = new File( repository, result + ".pdf" );
File sweptFile = File.createTempFile( result, ".swept" );

File subrep = new File( repository, result );

if ( !subrep.mkdir( ) )
{
throw new IOException(
"could not create sub-directory within repository" );
}

File tmp = File.createTempFile( result, ".conv" );

/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or OO.o, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */
sed.substitute( new FileInputStream( heathen ),
new FileOutputStream( tmp ), substitutions );

/* sweeper is an instance of Andy Quick and Dave Raggett's JTidy -
* it knocks the remaining cruft out of foreign HTML, and produces
* a DOM object */
Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{
/* caxton is my own recursive descent DOM pretty-printer - it dates
* back to 1999, before the days of TRAX. It's reliable, if not
* perfect.
http://www.weft.co.uk/library/jacqua...l/Printer.html
*/
caxton.print( swept, new FileOutputStream( sweptFile ) );

converter.transform( new StreamSource( sweptFile ),
new StreamResult( htmlFile ) );

splitter.transform( new StreamSource( sweptFile ),
new StreamResult( new File( subrep, "index.html" ) ) );

StringBuffer commandString = new StringBuffer( "prince -s " );

commandString.append( resourceDir ).append( File.separatorChar );
commandString.append( "paperback.css" ).append( ' ' );
commandString.append( htmlFile.getCanonicalPath( ) ).append( ' ' );
commandString.append( pdfFile.getCanonicalPath( ) );

/* pass the result off to Prince for final formatting to PDF */
Runtime.getRuntime( ).exec( commandString.toString( ) );
System.err.println( "Finished");
}
catch ( Exception e )
{
// TODO Auto-generated catch block
e.printStackTrace( );
}

return result;
}

The above works - which is great - but it isn't wonderfully efficient
writing the Document object created by JTidy out to disk and parsing it in
again. It would be much more efficient just to pass the Document object on
to the transformers, like this:

Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{

converter.transform( new DOMSource( swept ),
new StreamResult( htmlFile ) );

splitter.transform( new DOMSource( swept ),
new StreamResult( new File( subrep, "index.html" ) ) );

...

However, this doesn't work - both 'converter' generates output which isn't
as expected, and 'splitter' generates output as if no transform had been
applied.

So, what am I doing wrong here? I thought that TRAX (I'm using Xalan2
2.7.0) might be marking the Document object as processed in the sweep by
the 'converter' Transformer, so that when it gets to the 'splitter'
transformer it's already polluted, but that isn't the case as if I reverse
the order of the transformations I get exactly the same output. Anyone?

--
si***@jasmine.org.uk (Simon Brooke) http://www.jasmine.org.uk/~simon/

[ This .sig subject to change without notice ]
Mar 15 '07 #1
2 1459
* Simon Brooke wrote in comp.text.xml:
/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or OO.o, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */
You might want to try http://home.ccil.org/~cowan/XML/tagsoup/ instead.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Mar 15 '07 #2
in message <au********************************@hive.bjoern.ho ehrmann.de>,
Bjoern Hoehrmann ('b*****@hoehrmann.de') wrote:
* Simon Brooke wrote in comp.text.xml:
> /* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or OO.o, the sort of cruft that's so bad
even * Tidy wouldn't cope with it. */

You might want to try http://home.ccil.org/~cowan/XML/tagsoup/ instead.
Thanks, looks interesting.

--
si***@jasmine.org.uk (Simon Brooke) http://www.jasmine.org.uk/~simon/

;; Perl ... is the Brittney Spears of programming - easily accessible
;; but, in the final analysis, empty of any significant thought
;; Frank Adrian on Slashdot, 21st July 2003
Mar 16 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.