472,139 Members | 1,412 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,139 software developers and data experts.

high-performance alternative to xsl:number

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Nov 22 '06 #1
12 3122
aj****@blueyonder.co.uk wrote:
I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!
A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Nov 22 '06 #2
Martin Honnen wrote:
aj****@blueyonder.co.uk wrote:
I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

thanks martin, I didn't know about that one.

strictly my ID doesn't have to be a number, but I would require that
the IDs are lexically in document order when sorted. I guess this is
unlikely (and the standard certainly doesn't guarantee it). so
unfortunately I don't think it's something I will be able to use this
time

Nov 22 '06 #3
aj****@blueyonder.co.uk wrote:
[...]
the Id's I generate don't have to be contiguous but they must increase
the further you go down the document
Others have provided a number of solutions to the original request,
but I feel I should take issue with this one. It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Trying to make an ID value mean something in addition is almost always
wrong, and almost always the hallmark of poor data design. It's like the
traditional way of creating customer numbers: two digits for the area,
three digits for the industry code, then a dash because that company we
took over in 1954 always used them, then one digit for this and four
digits for that, then a check digit, and finally a "unique" sequence
number. Accounting offices and marketing offices *love* doing this, when
what they should be doing is recording all that information elsewhere
and assigning an arbitrary unique ID to the customer.

If your customer needs a sequence indicator, create an attribute and
make it reflect the numeric sequence position of each foo element in the
document. If the data is long-term important, just keep the ID as an ID
and your successors will thank you for it.

On the other hand, if like a lot of business data it's only important
for 10-15 minutes while a decision is made, then any old junk will do so
long as it satisfies the immediate conditions :-)

///Peter
--
XML FAQ: http://xml.silmaril.ie
Nov 23 '06 #4
In article <11**********************@k70g2000cwa.googlegroups .com>,
<aj****@blueyonder.co.ukwrote:
>I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>
The only reasonably efficient XSLT solution I can come up with is a 3
step process. First assign ids (which won't be numbers, and could be
in any format) to the elements using generate-id(). Then generate a
file mapping ids to sequence numbers (using position() on a node-set
containing all the desired elements). Then use key() to look up the
sequence numbers in the map file. I think this should be order N (or
close) for most stylesheet processors.

Here are the stylesheets, which assume that you want to operate on all
"foo" elements, that you want to call the sequence number attribute
"seq", and that you don't already have attributes called "id".

(1) Assign ids:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:attribute name="id"><xsl:value-of select="generate-id()"/></xsl:attrib\
ute>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

(2) Create map file from the result of step 1:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<junk>
<xsl:apply-templates select="//foo"/>
</junk>
</xsl:template>

<xsl:template match="foo">
<map id="{@id}" seq="{position()}"/>
</xsl:template>

</xsl:stylesheet>

(3) Map ids to sequence numbers (pass in the URL of the map file as the
"mapfile" parameter, and use the file generated in step 1 as the input):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:param name="mapfile"/>

<xsl:key name="id" match="map" use="@id"/>

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:apply-templates select="@*[name() != 'id']"/>
<xsl:attribute name="seq">
<xsl:variable name="id" select="@id"/>
<!-- the for-each is just to set the context node for key() -->
<xsl:for-each select="document($mapfile)">
<xsl:value-of select="key('id', $id)/@seq"/>
</xsl:for-each>
</xsl:attribute>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Nov 23 '06 #5
Peter Flynn wrote:
It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".
Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 23 '06 #6
On 22 Nov 2006 04:42:13 -0800, aj****@blueyonder.co.uk wrote:
Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
For example, add the following at the bottom of your XSLT:

<ms:script language="C#" implements-prefix="ext">
<![CDATA[

int currentPosition = 0;

public string GetPosition(){
currentPosition = currentPosition + 1;
return currentPosition.ToString();
}

]]>
</ms:script>

Now you can get an incrementing ID using something like:
<xsl:value-of select="ext:GetPosition()"/>

Nothing is going to be as fast as opening an XmlTextReader/XmlTextWriter
pair and iterating through the document, adding the attributes when you
read a node.Name='foo', as the whole file has to be parsed and rewritten
anyway.

Cheers,
Gadget
Nov 23 '06 #7
Gadget wrote:
Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 23 '06 #8
On Thu, 23 Nov 2006 07:43:53 -0500, Joe Kesselman wrote:
Gadget wrote:
>Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.
Well we obviously avoid vendor specific code when doing anything, but in
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
The advantage of XSLT in this case is that it is the most flexible way to
manipulate XML, and does not require recompiling every time a change is
made, which is why the inclusion of the code in the XSLT is almost
certainly going to be his best flexible 'high performance' option.

If this is a single requirement that does not require flexibility, use an
XMLTextReader and XMLTextWriter, and manipulate the data as you copy it
from one stream to another. This is a 'one shot' solution that requires
compilation but is the fastest 'structured' method.

Insisting on using platform independent XSLT for code that will be running
under MSXML is a bit of an 'ivory tower' practise, and ideal if you believe
your solution might one day be ported to a Linux box, run another vendor's
engine, or be posted for the scrutiny of the open-source community, but the
chances are that this would require redesigning 90% of your application
anyway, in which this small part becomes negligible :)

It would be interesting to see if the XSLT processor did reorder any of the
code, but given that this solution was provided by Microsoft, and given
that there are standards for the order in which nodes are traversed, this
is rather unlikely.

I guess it's just a case of prioritizing speed, flexibility, and
standardization.

Cheers,
Gadget
Nov 23 '06 #9
Joe Kesselman wrote:
Peter Flynn wrote:
>It's probably A Bad Idea to trespass on the ID space by adding another
meaning to it. An ID is just an ID, nothing more: all it says is "This
Is Me, I'm Unique".

Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
Yep, but in that case it would be better to call it SEQ or something,
just in case it gets accidentally misinterpreted as being an ID in the
XML sense of the term.

///Peter
Nov 24 '06 #10
Gadget wrote:
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
Some of you are in a dotnet.xml groups; the discussion's being crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 25 '06 #11
I agree with Gadget.
The cross-platform thing without code rewriting is still like a dream.
Yet I suggest that the file mapping solution is a better choice.

"Joe Kesselman" <ke************@comcast.netwrote
news:bf******************************@comcast.com. ..
Gadget wrote:
>this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)

Some of you are in a dotnet.xml groups; the discussion's being
crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 25 '06 #12
W. Jordan wrote:
The cross-platform thing without code rewriting is still like a dream.
I'm not surprised to hear that opinion expressed in Microsoft-specific
group. The rest of the industry seems to be managing it pretty well.

There are certainly times that a portable solution doesn't matter and
extensions are the right answer. Up to the developer to decide whether
this is such a case. I'll let it rest at that.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 25 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by EAS | last post: by
7 posts views Thread by Irmen de Jong | last post: by
4 posts views Thread by tao_benz | last post: by
2 posts views Thread by Maziar Aflatoun | last post: by
4 posts views Thread by temper3243 | last post: by
9 posts views Thread by xparrot1 | last post: by
7 posts views Thread by pedalpete | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.