Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old February 26th, 2006, 11:55 PM
Harrie
Guest
 
Posts: n/a
Default Copy and indenting XML files

Hi group,

I want to indent existing XML files so they are more readable (at least
to me). At this moment I'm looking at the XML files OpenOffice.org's
Writer application produces in it's zipped "SXW" format (and they're one
line, probably to save space, which I find hard to read). At first I
thought I was going to do it with sed/awk or something like that, but
then I remembered the xsl:output element with the indent attribute of
XSL and this seems more natural to me. What I'm using now is this XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="*">
<xsl:copy-of select=".">
<xsl:apply-templates/>
</xsl:copy-of>
</xsl:template>

</xsl:stylesheet>

This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.

http://www.biglist.com/lists/xsl-lis.../msg00585.html

I'm using xsltproc as XSL processor and I know you can pass arguments to
it, so I'm looking for a way to extract the PUBLIC and/or SYSTEM
identifier of an XML file with other tools and pass it as an argument to
xsltproc, so it can generate a DTD with the doctype-public and/or
doctype-system attributes of xsl:output, but I'm not really sure how to
tackle this.

Has somebody already done something like this? Does someone have some
pointers for me?

--
Regards
Harrie
  #2  
Old February 27th, 2006, 12:25 AM
Joe Kesselman
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Harrie wrote:[color=blue]
> This works like a charm, but I cannot copy the DOCTYPE declaration (and
> XML declaration, but that's of less importance to me at this moment).
>
> I've done some Googling and found out that it's not posible using XSL
> since the document type declaration is not part of the tree model of the
> XML file.[/color]

That's correct. You can explicitly specify the Public and System
Identifiers to be used in XSLT's output (see the doctype-public and
doctype-system options on the xsl:output directive), but as far as I
know there's no standard way to retrieve those values from the source
document in XSLT or XPath 1.0. (2.0 may change that.)

I believe both the DOM and SAX APIs expose these fields, though, so if
you really want to, it shouldn't be too hard to write a front-end tool
to obtain them and then pass those to your XSLT processor as parameters.
Or you could just write an indenting tool that uses those APIs to parse
the document in, explicitly modify it to add the indentation, and
serialize it back out.

WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
  #3  
Old February 27th, 2006, 01:15 PM
Martin Honnen
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files



Harrie wrote:

[color=blue]
> but
> then I remembered the xsl:output element with the indent attribute of
> XSL and this seems more natural to me. What I'm using now is this XSL file:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="1.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>
> <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
>
> <xsl:template match="*">
> <xsl:copy-of select=".">
> <xsl:apply-templates/>
> </xsl:copy-of>
> </xsl:template>
>
> </xsl:stylesheet>
>
> This works like a charm,[/color]

Really? xsl:copy-of will copy the element, its attribute and its child
nodes, then you additionallty use xsl:apply-templates to process the
child nodes again so you should got a lot of duplicated content that way.
[color=blue]
> but I cannot copy the DOCTYPE declaration (and
> XML declaration, but that's of less importance to me at this moment).[/color]

You can't copy those but you can output them with the xsl:output
instruction e.g.
<xsl:output omit-xml-declaration="no" />

<xsl:output encoding="utf-8" />

<xsl:output
doctype-public="public id here"
doctype-system="syste id here" />

Of course it will be a problem if you want to use one stylesheet to
indent lots of different XML documents with different doctype
declarations but if you know the doctype all those documents need then
you can make sure that is output with the above instruction.

In addition to that some XSLT processors have extensions, like Saxon 6
for instance
<http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
to allow you to output doctype declarations.

--

Martin Honnen
http://JavaScript.FAQTs.com/
  #4  
Old February 28th, 2006, 09:35 AM
Harrie
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Joe Kesselman said the following on 2/27/2006 01:13 +0200:[color=blue]
> Harrie wrote:[/color]
[color=blue][color=green]
>> This works like a charm, but I cannot copy the DOCTYPE declaration (and
>> XML declaration, but that's of less importance to me at this moment).
>>
>> I've done some Googling and found out that it's not posible using XSL
>> since the document type declaration is not part of the tree model of the
>> XML file.[/color]
>
> That's correct. You can explicitly specify the Public and System
> Identifiers to be used in XSLT's output (see the doctype-public and
> doctype-system options on the xsl:output directive), but as far as I
> know there's no standard way to retrieve those values from the source
> document in XSLT or XPath 1.0. (2.0 may change that.)
>
> I believe both the DOM and SAX APIs expose these fields, though, so if
> you really want to, it shouldn't be too hard to write a front-end tool
> to obtain them and then pass those to your XSLT processor as parameters.[/color]

This is what I had in mind and described at the end of my original
posting, but I don't have any experience with API's. I have installed
XMLgawk which uses Expat and hope that it might help me, but I have a
hard time digesting the language (I do have some experience with (g)awk
itself, but I find this quite different).
[color=blue]
> Or you could just write an indenting tool that uses those APIs to parse
> the document in, explicitly modify it to add the indentation, and
> serialize it back out.[/color]

Before I thought about using XSLT for indenting, I was thinking about a
POSIX shell script which uses some awk and sed. I suppose this give me
more flexabilaty, since I have no control over the amount of indenting
with 'xsl:output indent="yes"' and when I write something myself I
probably can.
[color=blue]
> WARNING: Changing indentation means changing the text content of the
> document, and may change its actual meaning. Don't assume the
> pretty-printed version is usable in place of the original; know what the
> requirements are of the program you're working with. (Or you could avoid
> changing the file at all, and use an XML-aware editor to make its
> structure more visible.)[/color]

Yes, the same is true for HTML where white space can be significant. I
hadn't thought about it in this particular case, 'cause all I want to do
is reformat is so I can read it more easily and try to understand it
(it's just for educational purpose). I won't use the indenting results
for anything else, but thanks for reminding me.

--
Regards
Harrie
  #5  
Old February 28th, 2006, 09:55 AM
Harrie
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Martin Honnen said the following on 2/27/2006 14:01 +0200:[color=blue]
> Harrie wrote:
>[/color]
[stripped xsl file][color=blue][color=green]
>> This works like a charm,[/color]
>
> Really? xsl:copy-of will copy the element, its attribute and its child
> nodes, then you additionallty use xsl:apply-templates to process the
> child nodes again so you should got a lot of duplicated content that way.[/color]

I've read up on xsl:copy-of and see you're quite right. The
xsl:apply-templates is a left over from my start with xsl:copy, which
didn't work for me 'cause it doesn't copy the attributes and child nodes.

Strange enough I don't have duplicated content and without the
xsl:apply-templates rule I get exactly the same result (I compared it
with "diff").

But thanks for pointing this out te me.
[color=blue][color=green]
>> but I cannot copy the DOCTYPE declaration (and
>> XML declaration, but that's of less importance to me at this moment).[/color]
>
> You can't copy those but you can output them with the xsl:output
> instruction e.g.
> <xsl:output omit-xml-declaration="no" />
>
> <xsl:output encoding="utf-8" />
>
> <xsl:output
> doctype-public="public id here"
> doctype-system="syste id here" />[/color]

Yes, this is what I have in mind and was at the end of my original posting.
[color=blue]
> Of course it will be a problem if you want to use one stylesheet to
> indent lots of different XML documents with different doctype
> declarations but if you know the doctype all those documents need then
> you can make sure that is output with the above instruction.[/color]

At this moment I'm looking at the XML files OpenOffice.org's Writer
application is producing, so in this case I can hard code the DOCTYPE,
but I want a general solution.
[color=blue]
> In addition to that some XSLT processors have extensions, like Saxon 6
> for instance
> <http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
> to allow you to output doctype declarations.[/color]

Thanks, but to output it is not my problem (at least, not yet), I can do
that with XSL already like we both described earlier, but I need to find
a way to extract it from the source document first. Since
OpenOffice.org's Writer files are (when unzipped) XML files with only 1
long line, I find it hard to only extract the DOCTYPE (if it had been
multiple lines, I would have used grep with awk).

I've read that XMLgawk can read files by element, so I'm hoping that can
help me, but as I've said in Joe's reply, I have a hard time mastering
it's syntax.

Hmmm, if the DOCTYPE is not a part of the document tree, is it an element?

--
Regards
Harrie
  #6  
Old February 28th, 2006, 10:25 AM
Harrie
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Joe Kesselman said the following on 2/27/2006 01:13 +0200:
[color=blue]
> WARNING: Changing indentation means changing the text content of the
> document, and may change its actual meaning. Don't assume the
> pretty-printed version is usable in place of the original; know what the
> requirements are of the program you're working with. (Or you could avoid
> changing the file at all, and use an XML-aware editor to make its
> structure more visible.)[/color]

Just out of curiosity:

I just read section 16.1 of the XSLT Recomandation [1] and there is a
warning (NOTE) about using indent with mixed content. I understand that
white space is signaficant there, but mixed content is not a good way of
writing XML anyway.

Above that, there is a paragraph about indent. I don't understand the
last long line of that paragraph (starting with: "The xml output method
should use an algorithm .." till the end).

Can somebody give an example of what is ment there?

[1] http://www.w3.org/TR/xslt#section-XML-Output-Method

--
Regards
Harrie
  #7  
Old March 1st, 2006, 12:15 PM
Manuel Collado
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Harrie escribió:[color=blue]
> ...
> I want to indent existing XML files so they are more readable (at least
> to me)...[/color]

http://xmlindent.sourceforge.net/

Hope it helps.

PS: I'm a member of the XMLgawk team. I'll try to contribute an
xmlindent utility. But it may take some time. There are project forums
open to registered users:

http://sourceforge.net/forum/?group_id=133165
--
To reply by e-mail, please remove the extra dot
in the given address: m.collado -> mcollado

  #8  
Old March 21st, 2006, 04:05 PM
Mirco Hilbert
Guest
 
Posts: n/a
Default Re: Copy and indenting XML files

Harrie wrote:[color=blue]
> I want to indent existing XML files so they are more readable (at least
> to me). At this moment I'm looking at the XML files OpenOffice.org's
> Writer application produces in it's zipped "SXW" format (and they're one
> line, probably to save space, which I find hard to read).[/color]

Try to disable the option "Size optimization for XML format" in
OpenOffice.org.

Quotation from OpenOffice.org Help:[color=blue]
> Size optimization for XML format (no pretty printing)
> When saving the document, OpenOffice.org writes the XML data without indents and extra line breaks. This allows documents to be saved and opened more quickly, and the file size is smaller.[/color]

In Version 2.0:
--> Menu: 'Tools'/'Options...'
--> Group: 'Load/Save'/'General'
--> Option: 'Save'/'Size optimization for XML format'


Above all, there's "XML indent":
http://xmlindent.sourceforge.net/


Regards,

Mirco
 

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles