high-performance alternative to xsl:number

ajfish

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Nov 22 '06 #1

Subscribe Post Reply

3242

Martin Honnen

aj****@blueyonder.co.uk wrote:

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Nov 22 '06 #2

ajfish

Martin Honnen wrote:

aj****@blueyonder.co.uk wrote:

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

A unique id can be generated with generate-id() although the format will
not be a number but rather a string following the XML ID requirements.

thanks martin, I didn't know about that one.

strictly my ID doesn't have to be a number, but I would require that
the IDs are lexically in document order when sorted. I guess this is
unlikely (and the standard certainly doesn't guarantee it). so
unfortunately I don't think it's something I will be able to use this
time

Nov 22 '06 #3

Peter Flynn

aj****@blueyonder.co.uk wrote:
[...]

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

Others have provided a number of solutions to the original request,
but I feel I should take issue with this one. It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Trying to make an ID value mean something in addition is almost always
wrong, and almost always the hallmark of poor data design. It's like the
traditional way of creating customer numbers: two digits for the area,
three digits for the industry code, then a dash because that company we
took over in 1954 always used them, then one digit for this and four
digits for that, then a check digit, and finally a "unique" sequence
number. Accounting offices and marketing offices *love* doing this, when
what they should be doing is recording all that information elsewhere
and assigning an arbitrary unique ID to the customer.

If your customer needs a sequence indicator, create an attribute and
make it reflect the numeric sequence position of each foo element in the
document. If the data is long-term important, just keep the ID as an ID
and your successors will thank you for it.

On the other hand, if like a lot of business data it's only important
for 10-15 minutes while a decision is made, then any old junk will do so
long as it satisfies the immediate conditions :-)

///Peter
--
XML FAQ: http://xml.silmaril.ie

Nov 23 '06 #4

Richard Tobin

In article <11**********************@k70g2000cwa.googlegroups .com>,
<aj****@blueyonder.co.ukwrote:

>I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

The only reasonably efficient XSLT solution I can come up with is a 3
step process. First assign ids (which won't be numbers, and could be
in any format) to the elements using generate-id(). Then generate a
file mapping ids to sequence numbers (using position() on a node-set
containing all the desired elements). Then use key() to look up the
sequence numbers in the map file. I think this should be order N (or
close) for most stylesheet processors.

Here are the stylesheets, which assume that you want to operate on all
"foo" elements, that you want to call the sequence number attribute
"seq", and that you don't already have attributes called "id".

(1) Assign ids:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:attribute name="id"><xsl:value-of select="generate-id()"/></xsl:attrib\
ute>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

(2) Create map file from the result of step 1:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<junk>
<xsl:apply-templates select="//foo"/>
</junk>
</xsl:template>

<xsl:template match="foo">
<map id="{@id}" seq="{position()}"/>
</xsl:template>

</xsl:stylesheet>

(3) Map ids to sequence numbers (pass in the URL of the map file as the
"mapfile" parameter, and use the file generated in step 1 as the input):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:param name="mapfile"/>

<xsl:key name="id" match="map" use="@id"/>

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:apply-templates select="@*[name() != 'id']"/>
<xsl:attribute name="seq">
<xsl:variable name="id" select="@id"/>

<xsl:for-each select="document($mapfile)">
<xsl:value-of select="key('id', $id)/@seq"/>
</xsl:for-each>
</xsl:attribute>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Nov 23 '06 #5

Joe Kesselman

Peter Flynn wrote:

It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 23 '06 #6

Gadget

On 22 Nov 2006 04:42:13 -0800, aj****@blueyonder.co.uk wrote:

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <footags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy

Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
For example, add the following at the bottom of your XSLT:

<ms:script language="C#" implements-prefix="ext">
<![CDATA[

int currentPosition = 0;

public string GetPosition(){
currentPosition = currentPosition + 1;
return currentPosition.ToString();
}

]]>
</ms:script>

Now you can get an incrementing ID using something like:
<xsl:value-of select="ext:GetPosition()"/>

Nothing is going to be as fast as opening an XmlTextReader/XmlTextWriter
pair and iterating through the document, adding the attributes when you
read a node.Name='foo', as the whole file has to be parsed and rewritten
anyway.

Cheers,
Gadget

Nov 23 '06 #7

Joe Kesselman

Gadget wrote:

Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 23 '06 #8

Gadget

On Thu, 23 Nov 2006 07:43:53 -0500, Joe Kesselman wrote:

Gadget wrote:
>Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.

Well we obviously avoid vendor specific code when doing anything, but in
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
The advantage of XSLT in this case is that it is the most flexible way to
manipulate XML, and does not require recompiling every time a change is
made, which is why the inclusion of the code in the XSLT is almost
certainly going to be his best flexible 'high performance' option.

If this is a single requirement that does not require flexibility, use an
XMLTextReader and XMLTextWriter, and manipulate the data as you copy it
from one stream to another. This is a 'one shot' solution that requires
compilation but is the fastest 'structured' method.

Insisting on using platform independent XSLT for code that will be running
under MSXML is a bit of an 'ivory tower' practise, and ideal if you believe
your solution might one day be ported to a Linux box, run another vendor's
engine, or be posted for the scrutiny of the open-source community, but the
chances are that this would require redesigning 90% of your application
anyway, in which this small part becomes negligible :)

It would be interesting to see if the XSLT processor did reorder any of the
code, but given that this solution was provided by Microsoft, and given
that there are standards for the order in which nodes are traversed, this
is rather unlikely.

I guess it's just a case of prioritizing speed, flexibility, and
standardization.

Cheers,
Gadget

Nov 23 '06 #9

Peter Flynn

Joe Kesselman wrote:

Peter Flynn wrote:
>It's probably A Bad Idea to trespass on the ID space by adding another
meaning to it. An ID is just an ID, nothing more: all it says is "This
Is Me, I'm Unique".

Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.

Yep, but in that case it would be better to call it SEQ or something,
just in case it gets accidentally misinterpreted as being an ID in the
XML sense of the term.

///Peter

Nov 24 '06 #10

Joe Kesselman

Gadget wrote:

this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)

Some of you are in a dotnet.xml groups; the discussion's being crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 25 '06 #11

W. Jordan

I agree with Gadget.
The cross-platform thing without code rewriting is still like a dream.
Yet I suggest that the file mapping solution is a better choice.

"Joe Kesselman" <ke************@comcast.netwrote
news:bf******************************@comcast.com. ..

Gadget wrote:
>this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)

Some of you are in a dotnet.xml groups; the discussion's being
crossposted.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 25 '06 #12

Joe Kesselman

W. Jordan wrote:

The cross-platform thing without code rewriting is still like a dream.

I'm not surprised to hear that opinion expressed in Microsoft-specific
group. The rest of the industry seems to be managing it pretty well.

There are certainly times that a portable solution doesn't matter and
extensions are the right answer. Up to the developer to decide whether
this is such a case. I'll let it rest at that.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 25 '06 #13

Similar topics

Guess My Number Game

by: EAS | last post by:

Hey, I'm new to python (and programming in general) so I'll prolly be around here a lot... Anyways, I've found out how to make a "guess my number game" where the player guesses a number between...

Python

High volume websites using Python web server software?

by: Irmen de Jong | last post by:

Hi, Things like Twisted, medusa, etc.... that claim to be able to support hundreds of concurrent connections because of the async I/O framework they're based on.... can someone give a few...

Python

random number generator

by: tao_benz | last post by:

Hi: My system generates a bunch of integers, about 1000. There is no any relationship between those integers; the smallest one might only contain 1 digit and the biggest one might contain 6...

C / C++

Random number is C#

by: Maziar Aflatoun | last post by:

Hi everyone, I have the following code in my class method TheSeed = (int)DateTime.Now.Ticks; Random rndNum = new Random(TheSeed); RandNum = rndNum.Next(0, TotalRows);...

C# / C Sharp

lower number

by: temper3243 | last post by:

i, If i have an array like {1,4, 10, 15 , 20 , 30 } of size n , now if i want to search for number 25 , i should get 20 , if i search for number 11 i hould get 10 , if i search for 4 i...

C / C++

ASP.net and SQL impersonation for High Scalability

by: Gordowey | last post by:

Hi all, I would like to ear your opinion about the best approach for and ASP.net with high workload traffic (High number of visitors) using SQL DB Consider the following scenario: - Website...

ASP.NET

How to get the remote port number in asp.net

by: xparrot1 | last post by:

I know that I can get the SERVER port number like this: HttpContext.Current.Request.ServerVariables My question is how do I get the remote CLIENT port number? Thanks Derek

ASP.NET

Need helping finding highest number from user input.

by: strife | last post by:

Hi, This is a homework question. I will try to keep it minimal so not to have anyone do it for me. I am really just stuck on one small spot. I have to figure out the highest number from a users...

C / C++

search a data string within high number of text files

by: Tequilaman | last post by:

Hi everybody! I want to search some data string of 3 to 11 characters within a hich number of files. tese files are text files with names that are not .txt but .10F etc. All are in one ffolder,...

Microsoft Windows

number not numeric issue

by: pedalpete | last post by:

I've got a value I'm grabbing a price range through preg-match from an external page. I take the value, and then split it into two pieces with explode, and then I split the value again using with...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice