XML diff/merge standard

Andreas Kasparek

Hola!

I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:

- is a diff result computed on the ordered or unordered xml node tree of
the compared documents?
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?
- should a diff tool consider move operations or only insert/delete
(besides update)?
- is there any common default behavior that a user would expect from a xml
diff/merge tool?
I've searched the web (especially the W3C site) and newsgroups but couldn't
find anything useful (or I was too blind). I was told this topic is a common
one in this newsgroup but I didn't find anything reasonable here either.

There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.
But is there a standard behavior (say like in sorting a list in alphabetical
order) that one might expect? There are a lot of xml files out there and so
there have to be users of xml diff/merge tools who surely expect some
specific behavior from these tools, or aren't they? So has anyone taken the
time to write a draft or something about how to handle xml diffs/merges in a
default way?

I don't want help how to implement such a tool or what algorithms there are
but I'm looking for some recommendations on how such a tool should behave in
a default setting without regarding any user preferences.

Any pointers to some (official?) documents/websites/books/groups/conferences
would be nice. Thanks!
Hasta lügo
Andreas
P.S.: Excuse my unqualified use of OE as Newsreader, but I'm constrained to
it here. Oh, and this posting is merely a private affair, nothing which
represents the owner of my sender domain in any kind :)

Dec 21 '05 #1

Subscribe Post Reply

4920

Rémi Peyronnet

Hi,

I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:
[...Snip...]
As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized. (but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml))
There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.

Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.

Hth,

--
Rémi Peyronnet

Dec 21 '05 #2

Andreas Kasparek

Hola!

"Rémi Peyronnet" <re*******@via.ecp.fr> wrote:

As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized. Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them. And in some cases I really
wondered why the diff result was as it was and not any other way.
I think it would be nice to have some common default behavior one might
expect, to base the own configuration decisions on. How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?

Lets look at a short example:

Doc1:

<root>
<node>
<subnode>
Text
</subnode>
</node>
<node foo="bar">
Text
</node>
</root>

----------------------

Doc2:

<root>
<node foo="bar">
<subnode>
Text
<subnode>
</node>
<node>
Text
</node>
</root>
So what two elements from both documents are considered to be matching? Is
it the first node of both docs (and respectively the second also), because
they comprise a similar subtree of (sub-/text-)nodes? Or is it the second
node of Doc1 to the first node of Doc2, because they contain the same
attribute (and that the same value, maybe even marked as ID in any schema)
each? And if the ordering of the node-elements has been reversed, then was
the subnode part moved from the none-foo node to the foo-node (meaning it is
in fact the same subnode element and not just a similar one)? Or was it
deleted on the one node and inserted to the other node (is there any case
where that matters?)?

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.
(but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml)) And at least a handful others :)

Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.

Ok, that's right.
Thanks!

Hasta lügo
Andreas

Dec 22 '05 #3

Rémi Peyronnet

Hi Andreas,

Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them.
Of course it would be possible to define a default behaviour of a xml
diff tool, but I cannot see what it could improve :

In my mind, what decided authors to write different free xml diff tools
is that they did not find any other that matches their needs ; there is
too many ways to diff xml files according to your needs :
- a systematic diff (as does microsoft's one and xmldiff), which
reports addition, deletion,... Quite useful with a software, but quite
unusable straight by a human.
- a "minimal differences" diff, (as does ssddiff). Usefull for text
edition (xhtml & co)
- a diff specialized for large xml files with the same structure, (as
does libxmldiff).

The three respond to different needs, and therefore implements
completely different algorithms. If you define a standart behaviour,
that implies at least that all implementations can cope with this
behaviour, ie. implements the standart algorithm.

The dream would be that all implementation implement all the possible
algorithms according all the needs a user could have, but well, that
seems somewhat unrealistic :-)
And in some cases I really
wondered why the diff result was as it was and not any other way.
How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?
Read the documentation ? :-)
Lets look at a short example: [Snip]

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.

Quite true, that proves that xml diff introduces many more
user-dependant kind of needs than a classic diff.

--
Rémi Peyronnet

Dec 22 '05 #4

erich.schubert

Hi,
Note that I'm the author of ssddiff, so my opinion is biased.

- a "minimal differences" diff, (as does ssddiff). Usefull for text edition (xhtml & co)

Other tools also try to do minimal differences; except usually an
approximation is preferable due to speed and memory reasons. the
ssddiff prototype also has an option to do that but this isn't really
optimized yet (I still have some ideas to improve quality in the fast
mode);
ssddiff also has three different output formats, one is basically a
merge, one is an xupdate diff script (similar to other xml diff
applications, but a rather well-defined standard; whereas many xml diff
applications use a proprietary format, sometimes not even xml itself,
that may or may not have actual patch applications for it...) and the
third one is labelling both source documents so a third party can
further process the "best match" to e.g. produce another output format.

ssddiff is NOT optimal for xhtml. Because changes in XHTML are often
within "strings" of the document (sorry for using the programmers
vocabulary, and not the XML vocabulary: cdata); on the other hand it
can work very well on some cases of XHTML because this format has a
very flexible structure (and the structure is an important part of the
information), whereas in an address book, the structure is next to
meaningless, and except for the ordering a pure classic LCS diff does
the job just fine.

So the big difference with ssddiff is that it tries to do a *semantic*
diff on the structure, whereas other diff applications basically do a
text diff, on tokens instead of lines. (granted, some also do an
inside-out matching and such)
The ssddiff approach could (not the current prototype, but a
straighforward extension of it) also match data in two different xml
formats, or RDF data. Or even diff XML and RDF data.
It doesn't rely on the document to be a tree structure.
(For best results, the user has to specify which relations in the
document contain relevant information)

to the original questions: - is a diff result computed on the ordered or unordered xml node tree of
the compared documents?
If you specify that the following-sibling::* relation is part of the
structural information, matching pays respect to the ordering of
children. Note that in XML by default the ordering is significant,
although in many applications it is not; there is no "ignore-ordering"
attribute in XML, this has to be handled by applications.
Also you must differentiate between ignoring the ordering in
calculating the diff or in calculating the edit script. By default
ssddiff will ignore ordering of children (unless you give an
appropriate xpath), the xupdate and merge output writers however will
try to reconstruct the original ordering.
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?
This depends very much on your application. You have data where there
are rarely ever changes within string parts, and you can have data
where XML is a mere container and all the difference happens in there.
The ssddiff approach isn't of much use in cases where the structure of
the documents is very restricted and fixed (such as relational
databases exported to XML); the prototype is next to useless since it
only support string equalty.
Other implementations might want to add substring or levenstein
matches.
- should a diff tool consider move operations or only insert/delete
(besides update)?
This depends a lot on what you allow as insert and delete operations,
and what your user base is. If you make diffs for patching XML files,
an insert/delete only diff may be useful.
if the diff is to be read by humans, moves are very useful.
If you do not allow subtree insertions and deletions (note that if I
allow subtree insertions and deletions I can replace a document in just
two operations!), a move can safe much resources.
On merges, moves are also much more useful (when detected correctly),
but can also cause confusion (when an independent deletion and
insertion are incorrectly made a move)
- is there any common default behavior that a user would expect from a xml
diff/merge tool?

It just works?
Well, it depends a lot on wheter the intended audience for the file is
a human reader or not.
Have a look at the harry potter books example in the ssddiff
publication at eXtreme Markup Languages 2005 (available online)
together with slide 27 in
http://ssddiff.alioth.debian.org/tal...-languages.pdf
(the colors make it easier to read; but the explanaition was mostly
given orally, so use the publication itself for the comments)
This is an example where there is no way for a computer to decide which
is best, he has to rely on additional information by the user (such as:
the ISBN of a book is it's unique identifier and should never change);
sometimes this could be derived from DTD/Schema information.

Regards,
Erich Schubert

Dec 30 '05 #5

by: Guillaume Lahitette | last post by:

Hello everyone, I am looking for a lightweight Java Applet to diff and merge two text files. The functionality would be a subset of what WinMerge (http://winmerge.sourceforge.net) offers: -...

Java

XML Diff Utility

by: Butter Scotch | last post by:

Hi folks, I'm looking for a good XML diff utility. Does anybody have any that you can recommend? thanks! -Yasutaka

.NET Framework

binary diff

by: Ching-Lung | last post by:

Hi all, I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the...

C# / C Sharp

XML Merge/Diff tool?

by: nickdu | last post by:

Is there a tool that will merge XML documents? We also need the reverse, we need to be able to create a Diff of two documents. What we're trying to do is just store differences of documents at...

.NET Framework

How to prevent automatic menu merge in MDI

by: jon morgan | last post by:

Hello, In an MDI application how can I prevent the CLR automatically merging a child windows' menu with that of the parent MDI form ? Thanks for any help Jon

Visual Basic .NET

DataSet Diff?

by: brian.f.oneil | last post by:

Is there any functionality in a DataSet for doing a "Diff" with another DataSet? I want to perform the opposite of "Merge" i.e. If a matching row exists in DataSet B, remove it from DataSet A....

ASP.NET

diff / merge c++ library?

by: earthwormgaz | last post by:

Is there anything in Boost (or elsewhere if needs be) that will allow me to diff/merge files? I am writing out a data file, and I want to merge it with an existing on if its there.

C / C++

HTML or XML diff/merge

by: Andy Fish | last post by:

hi, I am looking for a library (i.e. not a standalone GUI program) that can do diff and merge of HTML or XML, preferably in C# or at least that can be called from C# anyone know of such a...

.NET Framework

code diff tool

by: John A Grandy | last post by:

What do people like for a code comparison (diff) tool ? I've been using Beyond Compare but perhaps there is better. I am on Subversion.

C# / C Sharp

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Similar topics