468,104 Members | 1,328 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,104 developers. It's quick & easy.

Copy and indenting XML files

Hi group,

I want to indent existing XML files so they are more readable (at least
to me). At this moment I'm looking at the XML files OpenOffice.org's
Writer application produces in it's zipped "SXW" format (and they're one
line, probably to save space, which I find hard to read). At first I
thought I was going to do it with sed/awk or something like that, but
then I remembered the xsl:output element with the indent attribute of
XSL and this seems more natural to me. What I'm using now is this XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="*">
<xsl:copy-of select=".">
<xsl:apply-templates/>
</xsl:copy-of>
</xsl:template>

</xsl:stylesheet>

This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.

http://www.biglist.com/lists/xsl-lis.../msg00585.html

I'm using xsltproc as XSL processor and I know you can pass arguments to
it, so I'm looking for a way to extract the PUBLIC and/or SYSTEM
identifier of an XML file with other tools and pass it as an argument to
xsltproc, so it can generate a DTD with the doctype-public and/or
doctype-system attributes of xsl:output, but I'm not really sure how to
tackle this.

Has somebody already done something like this? Does someone have some
pointers for me?

--
Regards
Harrie
Feb 26 '06 #1
7 2039
Harrie wrote:
This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.


That's correct. You can explicitly specify the Public and System
Identifiers to be used in XSLT's output (see the doctype-public and
doctype-system options on the xsl:output directive), but as far as I
know there's no standard way to retrieve those values from the source
document in XSLT or XPath 1.0. (2.0 may change that.)

I believe both the DOM and SAX APIs expose these fields, though, so if
you really want to, it shouldn't be too hard to write a front-end tool
to obtain them and then pass those to your XSLT processor as parameters.
Or you could just write an indenting tool that uses those APIs to parse
the document in, explicitly modify it to add the indentation, and
serialize it back out.

WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Feb 27 '06 #2


Harrie wrote:

but
then I remembered the xsl:output element with the indent attribute of
XSL and this seems more natural to me. What I'm using now is this XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="*">
<xsl:copy-of select=".">
<xsl:apply-templates/>
</xsl:copy-of>
</xsl:template>

</xsl:stylesheet>

This works like a charm,
Really? xsl:copy-of will copy the element, its attribute and its child
nodes, then you additionallty use xsl:apply-templates to process the
child nodes again so you should got a lot of duplicated content that way.
but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).


You can't copy those but you can output them with the xsl:output
instruction e.g.
<xsl:output omit-xml-declaration="no" />

<xsl:output encoding="utf-8" />

<xsl:output
doctype-public="public id here"
doctype-system="syste id here" />

Of course it will be a problem if you want to use one stylesheet to
indent lots of different XML documents with different doctype
declarations but if you know the doctype all those documents need then
you can make sure that is output with the above instruction.

In addition to that some XSLT processors have extensions, like Saxon 6
for instance
<http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
to allow you to output doctype declarations.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Feb 27 '06 #3
Joe Kesselman said the following on 2/27/2006 01:13 +0200:
Harrie wrote:
This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.


That's correct. You can explicitly specify the Public and System
Identifiers to be used in XSLT's output (see the doctype-public and
doctype-system options on the xsl:output directive), but as far as I
know there's no standard way to retrieve those values from the source
document in XSLT or XPath 1.0. (2.0 may change that.)

I believe both the DOM and SAX APIs expose these fields, though, so if
you really want to, it shouldn't be too hard to write a front-end tool
to obtain them and then pass those to your XSLT processor as parameters.


This is what I had in mind and described at the end of my original
posting, but I don't have any experience with API's. I have installed
XMLgawk which uses Expat and hope that it might help me, but I have a
hard time digesting the language (I do have some experience with (g)awk
itself, but I find this quite different).
Or you could just write an indenting tool that uses those APIs to parse
the document in, explicitly modify it to add the indentation, and
serialize it back out.
Before I thought about using XSLT for indenting, I was thinking about a
POSIX shell script which uses some awk and sed. I suppose this give me
more flexabilaty, since I have no control over the amount of indenting
with 'xsl:output indent="yes"' and when I write something myself I
probably can.
WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)


Yes, the same is true for HTML where white space can be significant. I
hadn't thought about it in this particular case, 'cause all I want to do
is reformat is so I can read it more easily and try to understand it
(it's just for educational purpose). I won't use the indenting results
for anything else, but thanks for reminding me.

--
Regards
Harrie
Feb 28 '06 #4
Martin Honnen said the following on 2/27/2006 14:01 +0200:
Harrie wrote:
[stripped xsl file]
This works like a charm,


Really? xsl:copy-of will copy the element, its attribute and its child
nodes, then you additionallty use xsl:apply-templates to process the
child nodes again so you should got a lot of duplicated content that way.


I've read up on xsl:copy-of and see you're quite right. The
xsl:apply-templates is a left over from my start with xsl:copy, which
didn't work for me 'cause it doesn't copy the attributes and child nodes.

Strange enough I don't have duplicated content and without the
xsl:apply-templates rule I get exactly the same result (I compared it
with "diff").

But thanks for pointing this out te me.
but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).


You can't copy those but you can output them with the xsl:output
instruction e.g.
<xsl:output omit-xml-declaration="no" />

<xsl:output encoding="utf-8" />

<xsl:output
doctype-public="public id here"
doctype-system="syste id here" />


Yes, this is what I have in mind and was at the end of my original posting.
Of course it will be a problem if you want to use one stylesheet to
indent lots of different XML documents with different doctype
declarations but if you know the doctype all those documents need then
you can make sure that is output with the above instruction.
At this moment I'm looking at the XML files OpenOffice.org's Writer
application is producing, so in this case I can hard code the DOCTYPE,
but I want a general solution.
In addition to that some XSLT processors have extensions, like Saxon 6
for instance
<http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
to allow you to output doctype declarations.


Thanks, but to output it is not my problem (at least, not yet), I can do
that with XSL already like we both described earlier, but I need to find
a way to extract it from the source document first. Since
OpenOffice.org's Writer files are (when unzipped) XML files with only 1
long line, I find it hard to only extract the DOCTYPE (if it had been
multiple lines, I would have used grep with awk).

I've read that XMLgawk can read files by element, so I'm hoping that can
help me, but as I've said in Joe's reply, I have a hard time mastering
it's syntax.

Hmmm, if the DOCTYPE is not a part of the document tree, is it an element?

--
Regards
Harrie
Feb 28 '06 #5
Joe Kesselman said the following on 2/27/2006 01:13 +0200:
WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)


Just out of curiosity:

I just read section 16.1 of the XSLT Recomandation [1] and there is a
warning (NOTE) about using indent with mixed content. I understand that
white space is signaficant there, but mixed content is not a good way of
writing XML anyway.

Above that, there is a paragraph about indent. I don't understand the
last long line of that paragraph (starting with: "The xml output method
should use an algorithm .." till the end).

Can somebody give an example of what is ment there?

[1] http://www.w3.org/TR/xslt#section-XML-Output-Method

--
Regards
Harrie
Feb 28 '06 #6
Harrie escribió:
...
I want to indent existing XML files so they are more readable (at least
to me)...


http://xmlindent.sourceforge.net/

Hope it helps.

PS: I'm a member of the XMLgawk team. I'll try to contribute an
xmlindent utility. But it may take some time. There are project forums
open to registered users:

http://sourceforge.net/forum/?group_id=133165
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado

Mar 1 '06 #7
Harrie wrote:
I want to indent existing XML files so they are more readable (at least
to me). At this moment I'm looking at the XML files OpenOffice.org's
Writer application produces in it's zipped "SXW" format (and they're one
line, probably to save space, which I find hard to read).
Try to disable the option "Size optimization for XML format" in
OpenOffice.org.

Quotation from OpenOffice.org Help: Size optimization for XML format (no pretty printing)
When saving the document, OpenOffice.org writes the XML data without indents and extra line breaks. This allows documents to be saved and opened more quickly, and the file size is smaller.


In Version 2.0:
--> Menu: 'Tools'/'Options...'
--> Group: 'Load/Save'/'General'
--> Option: 'Save'/'Size optimization for XML format'
Above all, there's "XML indent":
http://xmlindent.sourceforge.net/
Regards,

Mirco
Mar 21 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Casey | last post: by
1 post views Thread by JohnSouth | last post: by
3 posts views Thread by rubbishemail | last post: by
13 posts views Thread by jim | last post: by
1 post views Thread by Solo | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.