472,111 Members | 2,047 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,111 software developers and data experts.

XSLT Compare two documents and output differences

Greetings,

I am relatively new to, what I would call, advanced XSLT/XPath and I
am after some advice from those in the know. I am attempting to figure
out a mechanism within XSLT to compare the difference between two
source documents and output node-sets which are "different" (changed
or new) to new XML files using xsl:result-document

To describe the problem I have provided some example data below along
with my a portion of my current XSLT. I have changed the meaning of
the data to make it less specific to my project just in case the
suggestions we get here prove useful to others.

OK, so problem is as follows:

- We have a source document "SourceData.xml" containing a catalogue
of "Fish" provided for us by a partner so that we can update our
internal databases.

- The process requires that we take each <datarecordnode and parse
it into our internal format using our naming conventions

- We also have to perform a replacement against their "location"
element which does not map to our "habitat" values. I have done this
by loading a lookup file called "DataMapping.xml" into a global
variable. I then assign an xsl:key to the @clientname attribute of the
<entryelement. When I need to get the value I grab the clients value
into a variable, switch to the lookup documents context using the
xsl:for-each trick and then perform a lookup using key(x,y).

- Each <datarecordnode in the Source will produce a new xml file
containing a single <updateRecordelement with our structure beneath

All of this works fine (oddly enough) and we have been quite impressed
with how XSLT handles all this. HOWEVER, we have just been told that
the partner who supplies our Source XML is not able to filter the
records they send us to only contain those new or recently modified,
in fact that have to send us pretty much their entire database. There
is no option for them to change this and to make matters worse the
source file could grow to upwards of 50,000 records, making it over
120MB.

I have been asked to look at ways to compare the previous days Source
XML against the one coming in and output only those records which are
new or have changed. I am currently doing this in the code warping the
XSLT Transformation, but it's going to get real slow when there are
50k records.

The rules are:

- Both documents will be an identical structure
- Both documents will have ~95% the same content
- The source document <datarecordhas a compound key to make it
unique <species+ <subspecies>
- A modified record consists of any change to the payload value of
the elements within the <datarecord>'s
- A new record is obviously one not found in the previous days XML
- We only want to produce either a single XML containing new or
modified records *OR* incorporate the required XSLT into our current
GenerateDataSegments.xsl

I have been thinking about with loading one document as the source and
then document() to load the previous filename (passed as a Global
Param), but frankly I'm a little lost as to how to attack it after
that.

If the answer is that there is no decent way of doing this in XSLT
without killing the load on the machine, does anyone know of a fully
automatable Command Line tool or Service that can do the "compare and
output differences" bit ? Open Source or Commercial is fine by me. for
the record, I'm currently using the latest build of Saxon-B
<!-- SourceData.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>23</subspecies>
<location>Pacific</location>
<name>Blue Bopper Fish</name>
</datarecord>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Indian</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>
<!-- DataMapping.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<mapsection name="oceans">
<entry clientname="Pacific" internalname="Pacific Ocean">
<entry clientname="Atlantic" internalname="Atlantic Ocean">
<entry clientname="Indian" internalname="Indian Ocean">
<entry clientname="Southern" internalname="Southern Ocean">
</mapsection>
</mapping>
<!-- GenerateDataSegments.xsl -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:param name="outputPath" />
<xsl:variable name="dataMapping"
select="document('DataMapping.xml')" />
<xsl:key name="oceans" match="mapsection[@name='oceans']/
entry" use="@clientname" />
<xsl:template match="/">
<xsl:for-each select="main/datarecord">
<xsl:result-document href="file:///{$outputPath}-
{count(ancestor::node()|preceding::*)}.xml" >
<updateRecord>
<family><xsl:value-of select="species" /></family>
<genus><xsl:value-of select="subspecies" /></genus>
<habitat>
<xsl:variable name="clientHabitat" select="location" />
<xsl:for-each select="$dataMapping">
<xsl:value-of select="key('oceans', $clientHabitat)/
@internalname"/>
</xsl:for-each>
</habitat>
<fullname><xsl:value-of select="name" /></fullname>
</updateRecord>
</xsl:result-document>
</xsl:for-each>
</xsl:stylesheet>
<!-- PreviousSourceData.xml - Missing one record and value changed in
another-->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Southern</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

Thanks in advance for your time and assistance,

Al

Jun 22 '07 #1
3 9186
On Jun 22, 3:36 am, super.radd...@gmail.com wrote:
Greetings,

I am relatively new to, what I would call, advanced XSLT/XPath and I
am after some advice from those in the know. I am attempting to figure
out a mechanism within XSLT to compare the difference between two
source documents and output node-sets which are "different" (changed
or new) to new XML files using xsl:result-document

To describe the problem I have provided some example data below along
with my a portion of my current XSLT. I have changed the meaning of
the data to make it less specific to my project just in case the
suggestions we get here prove useful to others.

OK, so problem is as follows:

- We have a source document "SourceData.xml" containing a catalogue
of "Fish" provided for us by a partner so that we can update our
internal databases.

- The process requires that we take each <datarecordnode and parse
it into our internal format using our naming conventions

- We also have to perform a replacement against their "location"
element which does not map to our "habitat" values. I have done this
by loading a lookup file called "DataMapping.xml" into a global
variable. I then assign an xsl:key to the @clientname attribute of the
<entryelement. When I need to get the value I grab the clients value
into a variable, switch to the lookup documents context using the
xsl:for-each trick and then perform a lookup using key(x,y).

- Each <datarecordnode in the Source will produce a new xml file
containing a single <updateRecordelement with our structure beneath

All of this works fine (oddly enough) and we have been quite impressed
with how XSLT handles all this. HOWEVER, we have just been told that
the partner who supplies our Source XML is not able to filter the
records they send us to only contain those new or recently modified,
in fact that have to send us pretty much their entire database. There
is no option for them to change this and to make matters worse the
source file could grow to upwards of 50,000 records, making it over
120MB.

I have been asked to look at ways to compare the previous days Source
XML against the one coming in and output only those records which are
new or have changed. I am currently doing this in the code warping the
XSLT Transformation, but it's going to get real slow when there are
50k records.

The rules are:

- Both documents will be an identical structure
- Both documents will have ~95% the same content
- The source document <datarecordhas a compound key to make it
unique <species+ <subspecies>
- A modified record consists of any change to the payload value of
the elements within the <datarecord>'s
- A new record is obviously one not found in the previous days XML
- We only want to produce either a single XML containing new or
modified records *OR* incorporate the required XSLT into our current
GenerateDataSegments.xsl

I have been thinking about with loading one document as the source and
then document() to load the previous filename (passed as a Global
Param), but frankly I'm a little lost as to how to attack it after
that.

If the answer is that there is no decent way of doing this in XSLT
without killing the load on the machine, does anyone know of a fully
automatable Command Line tool or Service that can do the "compare and
output differences" bit ? Open Source or Commercial is fine by me. for
the record, I'm currently using the latest build of Saxon-B

<!-- SourceData.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>23</subspecies>
<location>Pacific</location>
<name>Blue Bopper Fish</name>
</datarecord>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Indian</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

<!-- DataMapping.xml -->

<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<mapsection name="oceans">
<entry clientname="Pacific" internalname="Pacific Ocean">
<entry clientname="Atlantic" internalname="Atlantic Ocean">
<entry clientname="Indian" internalname="Indian Ocean">
<entry clientname="Southern" internalname="Southern Ocean">
</mapsection>
</mapping>

<!-- GenerateDataSegments.xsl -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:param name="outputPath" />
<xsl:variable name="dataMapping"
select="document('DataMapping.xml')" />
<xsl:key name="oceans" match="mapsection[@name='oceans']/
entry" use="@clientname" />
<xsl:template match="/">
<xsl:for-each select="main/datarecord">
<xsl:result-document href="file:///{$outputPath}-
{count(ancestor::node()|preceding::*)}.xml" >
<updateRecord>
<family><xsl:value-of select="species" /></family>
<genus><xsl:value-of select="subspecies" /></genus>
<habitat>
<xsl:variable name="clientHabitat" select="location" />
<xsl:for-each select="$dataMapping">
<xsl:value-of select="key('oceans', $clientHabitat)/
@internalname"/>
</xsl:for-each>
</habitat>
<fullname><xsl:value-of select="name" /></fullname>
</updateRecord>
</xsl:result-document>
</xsl:for-each>
</xsl:stylesheet>

<!-- PreviousSourceData.xml - Missing one record and value changed in
another-->

<?xml version="1.0" encoding="UTF-8"?>
<main>
<datarecord>
<species>23</species>
<subspecies>25</subspecies>
<location>Southern</location>
<name>Purple Bopper Fish</name>
</datarecord>
<datarecord>
<species>17</species>
<subspecies>3</subspecies>
<location>Atlantic</location>
<name>Ringed Oaf Fish</name>
</datarecord>
...
</main>

Thanks in advance for your time and assistance,

Al
you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>

Jun 22 '07 #2
>
you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>
I am trying not to use an extensions. I ended up using the following,
which works perfectly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:output method="xml" indent="yes" />
<xsl:param name="fileCurrentPath" />
<xsl:param name="filePreviousPath" />
<xsl:variable name="fileCurrent"
select="document($fileCurrentPath, /)" />
<xsl:variable name="filePrevious"
select="document($filePreviousPath, /)" />
<xsl:template match="/">
<main>
<xsl:apply-templates select="$fileCurrent//datarecord"
mode="addedchanged"/>
</main>
</xsl:template>
<xsl:template match="//datarecord" mode="addedchanged" >
<xsl:variable name="varSpecies" select="species"/>
<xsl:variable name="varSubspecies" select="subspecies"/>
<xsl:choose>
<xsl:when test="$filePrevious//datarecord[species=$varSpecies]
[subspecies=$varSubspecies]">
<xsl:if test="not(.=$filePrevious//datarecord[species=
$varSpecies][subspecies=$varSubspecies])">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

Jun 23 '07 #3
>
you could try using the node assertion mechanics of XSLT Unit (http://
xsltunit.org/#notEqual)

<xsltu:test id="test-title">
<xsl:call-template name="xsltu:assertEqual">
<xsl:with-param name="id" select="'full-value'"/>
<xsl:with-param name="nodes1">
<xsl:apply-templates select="document('library.xml')/
library/book[isbn='0836217462']/title"/>
</xsl:with-param>
<xsl:with-param name="nodes2">
<h1>Being a Dog Is a Full-Time Job</h1>
</xsl:with-param>
</xsl:call-template>
</xsltu:test>
I am trying not to use an extensions. I ended up using the following,
which works perfectly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:output method="xml" indent="yes" />
<xsl:param name="fileCurrentPath" />
<xsl:param name="filePreviousPath" />
<xsl:variable name="fileCurrent"
select="document($fileCurrentPath, /)" />
<xsl:variable name="filePrevious"
select="document($filePreviousPath, /)" />
<xsl:template match="/">
<main>
<xsl:apply-templates select="$fileCurrent//datarecord"
mode="addedchanged"/>
</main>
</xsl:template>
<xsl:template match="//datarecord" mode="addedchanged" >
<xsl:variable name="varSpecies" select="species"/>
<xsl:variable name="varSubspecies" select="subspecies"/>
<xsl:choose>
<xsl:when test="$filePrevious//datarecord[species=$varSpecies]
[subspecies=$varSubspecies]">
<xsl:if test="not(.=$filePrevious//datarecord[species=
$varSpecies][subspecies=$varSubspecies])">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

Jun 23 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Don Garrett | last post: by
8 posts views Thread by Maciej Wegorkiewicz | last post: by
3 posts views Thread by Teksure | last post: by
2 posts views Thread by Ganesh Muthuvelu | last post: by
15 posts views Thread by Jeff Uchtman | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.