473,372 Members | 1,191 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,372 software developers and data experts.

high-performance alternative to xsl:number

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Nov 22 '06 #1
12 3240
aj****@blueyonder.co.uk wrote:
I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!
A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Nov 22 '06 #2
Martin Honnen wrote:
aj****@blueyonder.co.uk wrote:
I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

thanks martin, I didn't know about that one.

strictly my ID doesn't have to be a number, but I would require that
the IDs are lexically in document order when sorted. I guess this is
unlikely (and the standard certainly doesn't guarantee it). so
unfortunately I don't think it's something I will be able to use this
time

Nov 22 '06 #3
aj****@blueyonder.co.uk wrote:
[...]
the Id's I generate don't have to be contiguous but they must increase
the further you go down the document
Others have provided a number of solutions to the original request,
but I feel I should take issue with this one. It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Trying to make an ID value mean something in addition is almost always
wrong, and almost always the hallmark of poor data design. It's like the
traditional way of creating customer numbers: two digits for the area,
three digits for the industry code, then a dash because that company we
took over in 1954 always used them, then one digit for this and four
digits for that, then a check digit, and finally a "unique" sequence
number. Accounting offices and marketing offices *love* doing this, when
what they should be doing is recording all that information elsewhere
and assigning an arbitrary unique ID to the customer.

If your customer needs a sequence indicator, create an attribute and
make it reflect the numeric sequence position of each foo element in the
document. If the data is long-term important, just keep the ID as an ID
and your successors will thank you for it.

On the other hand, if like a lot of business data it's only important
for 10-15 minutes while a decision is made, then any old junk will do so
long as it satisfies the immediate conditions :-)

///Peter
--
XML FAQ: http://xml.silmaril.ie
Nov 23 '06 #4
In article <11**********************@k70g2000cwa.googlegroups .com>,
<aj****@blueyonder.co.ukwrote:
>I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>
The only reasonably efficient XSLT solution I can come up with is a 3
step process. First assign ids (which won't be numbers, and could be
in any format) to the elements using generate-id(). Then generate a
file mapping ids to sequence numbers (using position() on a node-set
containing all the desired elements). Then use key() to look up the
sequence numbers in the map file. I think this should be order N (or
close) for most stylesheet processors.

Here are the stylesheets, which assume that you want to operate on all
"foo" elements, that you want to call the sequence number attribute
"seq", and that you don't already have attributes called "id".

(1) Assign ids:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:attribute name="id"><xsl:value-of select="generate-id()"/></xsl:attrib\
ute>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

(2) Create map file from the result of step 1:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<junk>
<xsl:apply-templates select="//foo"/>
</junk>
</xsl:template>

<xsl:template match="foo">
<map id="{@id}" seq="{position()}"/>
</xsl:template>

</xsl:stylesheet>

(3) Map ids to sequence numbers (pass in the URL of the map file as the
"mapfile" parameter, and use the file generated in step 1 as the input):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:param name="mapfile"/>

<xsl:key name="id" match="map" use="@id"/>

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:apply-templates select="@*[name() != 'id']"/>
<xsl:attribute name="seq">
<xsl:variable name="id" select="@id"/>
<!-- the for-each is just to set the context node for key() -->
<xsl:for-each select="document($mapfile)">
<xsl:value-of select="key('id', $id)/@seq"/>
</xsl:for-each>
</xsl:attribute>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Nov 23 '06 #5
Peter Flynn wrote:
It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".
Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 23 '06 #6
On 22 Nov 2006 04:42:13 -0800, aj****@blueyonder.co.uk wrote:
Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
For example, add the following at the bottom of your XSLT:

<ms:script language="C#" implements-prefix="ext">
<![CDATA[

int currentPosition = 0;

public string GetPosition(){
currentPosition = currentPosition + 1;
return currentPosition.ToString();
}

]]>
</ms:script>

Now you can get an incrementing ID using something like:
<xsl:value-of select="ext:GetPosition()"/>

Nothing is going to be as fast as opening an XmlTextReader/XmlTextWriter
pair and iterating through the document, adding the attributes when you
read a node.Name='foo', as the whole file has to be parsed and rewritten
anyway.

Cheers,
Gadget
Nov 23 '06 #7
Gadget wrote:
Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 23 '06 #8
On Thu, 23 Nov 2006 07:43:53 -0500, Joe Kesselman wrote:
Gadget wrote:
>Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.
Well we obviously avoid vendor specific code when doing anything, but in
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
The advantage of XSLT in this case is that it is the most flexible way to
manipulate XML, and does not require recompiling every time a change is
made, which is why the inclusion of the code in the XSLT is almost
certainly going to be his best flexible 'high performance' option.

If this is a single requirement that does not require flexibility, use an
XMLTextReader and XMLTextWriter, and manipulate the data as you copy it
from one stream to another. This is a 'one shot' solution that requires
compilation but is the fastest 'structured' method.

Insisting on using platform independent XSLT for code that will be running
under MSXML is a bit of an 'ivory tower' practise, and ideal if you believe
your solution might one day be ported to a Linux box, run another vendor's
engine, or be posted for the scrutiny of the open-source community, but the
chances are that this would require redesigning 90% of your application
anyway, in which this small part becomes negligible :)

It would be interesting to see if the XSLT processor did reorder any of the
code, but given that this solution was provided by Microsoft, and given
that there are standards for the order in which nodes are traversed, this
is rather unlikely.

I guess it's just a case of prioritizing speed, flexibility, and
standardization.

Cheers,
Gadget
Nov 23 '06 #9
Joe Kesselman wrote:
Peter Flynn wrote:
>It's probably A Bad Idea to trespass on the ID space by adding another
meaning to it. An ID is just an ID, nothing more: all it says is "This
Is Me, I'm Unique".

Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
Yep, but in that case it would be better to call it SEQ or something,
just in case it gets accidentally misinterpreted as being an ID in the
XML sense of the term.

///Peter
Nov 24 '06 #10
Gadget wrote:
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
Some of you are in a dotnet.xml groups; the discussion's being crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 25 '06 #11
I agree with Gadget.
The cross-platform thing without code rewriting is still like a dream.
Yet I suggest that the file mapping solution is a better choice.

"Joe Kesselman" <ke************@comcast.netwrote
news:bf******************************@comcast.com. ..
Gadget wrote:
>this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)

Some of you are in a dotnet.xml groups; the discussion's being
crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 25 '06 #12
W. Jordan wrote:
The cross-platform thing without code rewriting is still like a dream.
I'm not surprised to hear that opinion expressed in Microsoft-specific
group. The rest of the industry seems to be managing it pretty well.

There are certainly times that a portable solution doesn't matter and
extensions are the right answer. Up to the developer to decide whether
this is such a case. I'll let it rest at that.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 25 '06 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: EAS | last post by:
Hey, I'm new to python (and programming in general) so I'll prolly be around here a lot... Anyways, I've found out how to make a "guess my number game" where the player guesses a number between...
7
by: Irmen de Jong | last post by:
Hi, Things like Twisted, medusa, etc.... that claim to be able to support hundreds of concurrent connections because of the async I/O framework they're based on.... can someone give a few...
4
by: tao_benz | last post by:
Hi: My system generates a bunch of integers, about 1000. There is no any relationship between those integers; the smallest one might only contain 1 digit and the biggest one might contain 6...
2
by: Maziar Aflatoun | last post by:
Hi everyone, I have the following code in my class method TheSeed = (int)DateTime.Now.Ticks; Random rndNum = new Random(TheSeed); RandNum = rndNum.Next(0, TotalRows);...
4
by: temper3243 | last post by:
i, If i have an array like {1,4, 10, 15 , 20 , 30 } of size n , now if i want to search for number 25 , i should get 20 , if i search for number 11 i hould get 10 , if i search for 4 i...
2
by: Gordowey | last post by:
Hi all, I would like to ear your opinion about the best approach for and ASP.net with high workload traffic (High number of visitors) using SQL DB Consider the following scenario: - Website...
9
by: xparrot1 | last post by:
I know that I can get the SERVER port number like this: HttpContext.Current.Request.ServerVariables My question is how do I get the remote CLIENT port number? Thanks Derek
10
by: strife | last post by:
Hi, This is a homework question. I will try to keep it minimal so not to have anyone do it for me. I am really just stuck on one small spot. I have to figure out the highest number from a users...
0
by: Tequilaman | last post by:
Hi everybody! I want to search some data string of 3 to 11 characters within a hich number of files. tese files are text files with names that are not .txt but .10F etc. All are in one ffolder,...
7
by: pedalpete | last post by:
I've got a value I'm grabbing a price range through preg-match from an external page. I take the value, and then split it into two pieces with explode, and then I split the value again using with...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.