473,591 Members | 2,872 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

XML diff/merge standard

Hola!

I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:

- is a diff result computed on the ordered or unordered xml node tree of
the compared documents?
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?
- should a diff tool consider move operations or only insert/delete
(besides update)?
- is there any common default behavior that a user would expect from a xml
diff/merge tool?
I've searched the web (especially the W3C site) and newsgroups but couldn't
find anything useful (or I was too blind). I was told this topic is a common
one in this newsgroup but I didn't find anything reasonable here either.

There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.
But is there a standard behavior (say like in sorting a list in alphabetical
order) that one might expect? There are a lot of xml files out there and so
there have to be users of xml diff/merge tools who surely expect some
specific behavior from these tools, or aren't they? So has anyone taken the
time to write a draft or something about how to handle xml diffs/merges in a
default way?

I don't want help how to implement such a tool or what algorithms there are
but I'm looking for some recommendations on how such a tool should behave in
a default setting without regarding any user preferences.

Any pointers to some (official?) documents/websites/books/groups/conferences
would be nice. Thanks!
Hasta lügo
Andreas
P.S.: Excuse my unqualified use of OE as Newsreader, but I'm constrained to
it here. Oh, and this posting is merely a private affair, nothing which
represents the owner of my sender domain in any kind :)
Dec 21 '05 #1
4 4944
Hi,
I'm preparing my master thesis about a XML Merge Tool implementation and was
wondering if there is any open standard for XML diff regarding topics like:
[...Snip...]
As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized. (but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml))
There are a lot of diff programs and papers on the web but none of them seem
to reference or follow any standard. It's not a question that most tools
have to be configured properly by the user to fit the special structure and
semantic of one's specific xml files, so that it will yield optimal results.


Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.

Hth,

--
Rémi Peyronnet
Dec 21 '05 #2
Hola!

"Rémi Peyronnet" <re*******@via. ecp.fr> wrote:
As these topics are really dependant of what the user needs and what the
diff software choose to implement, I cannot see how it could be
normalized. Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them. And in some cases I really
wondered why the diff result was as it was and not any other way.
I think it would be nice to have some common default behavior one might
expect, to base the own configuration decisions on. How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?

Lets look at a short example:

Doc1:

<root>
<node>
<subnode>
Text
</subnode>
</node>
<node foo="bar">
Text
</node>
</root>

----------------------

Doc2:

<root>
<node foo="bar">
<subnode>
Text
<subnode>
</node>
<node>
Text
</node>
</root>
So what two elements from both documents are considered to be matching? Is
it the first node of both docs (and respectively the second also), because
they comprise a similar subtree of (sub-/text-)nodes? Or is it the second
node of Doc1 to the first node of Doc2, because they contain the same
attribute (and that the same value, maybe even marked as ID in any schema)
each? And if the ordering of the node-elements has been reversed, then was
the subnode part moved from the none-foo node to the foo-node (meaning it is
in fact the same subnode element and not just a similar one)? Or was it
deleted on the one node and inserted to the other node (is there any case
where that matters?)?

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.
(but some standards on the output format exist : XDL (Used
by Microsoft Diff & Patch), DUL (Used by diffxml)) And at least a handful others :)

Most of the diff tools I have seen are quite generic and do not depend
on the xml structure of the files. But the user have to choose the right
tools / options (spaces, orders,....) for his needs. This is quite the
same thing with the classic "diff" tool, which has many options.

Ok, that's right.
Thanks!

Hasta lügo
Andreas
Dec 22 '05 #3
Hi Andreas,
Surely things like the ordering of elements or what attributes should be
regarded as IDs are highly dependant on the user's documents, but aren't
there any assumptions on i.e. whether elements could have moved between
subtrees or just have been deleted/inserted, whether an element is matching
one from the other document because it has the same attribute or because it
bears the same child-nodes? I tried some free xml diff tools and often I got
different results without preconfiguring them.
Of course it would be possible to define a default behaviour of a xml
diff tool, but I cannot see what it could improve :

In my mind, what decided authors to write different free xml diff tools
is that they did not find any other that matches their needs ; there is
too many ways to diff xml files according to your needs :
- a systematic diff (as does microsoft's one and xmldiff), which
reports addition, deletion,... Quite useful with a software, but quite
unusable straight by a human.
- a "minimal differences" diff, (as does ssddiff). Usefull for text
edition (xhtml & co)
- a diff specialized for large xml files with the same structure, (as
does libxmldiff).

The three respond to different needs, and therefore implements
completely different algorithms. If you define a standart behaviour,
that implies at least that all implementations can cope with this
behaviour, ie. implements the standart algorithm.

The dream would be that all implementation implement all the possible
algorithms according all the needs a user could have, but well, that
seems somewhat unrealistic :-)
And in some cases I really
wondered why the diff result was as it was and not any other way.
How could I know how to
set my preferences if I have to play around with a tool at first to find out
its normal mode of operation?
Read the documentation ? :-)
Lets look at a short example: [Snip]

This is a really simple and straightforward example, I'm sure one could
construct more complex structures to show some kind of ambiguity that can't
be resolved as easily.


Quite true, that proves that xml diff introduces many more
user-dependant kind of needs than a classic diff.

--
Rémi Peyronnet
Dec 22 '05 #4
Hi,
Note that I'm the author of ssddiff, so my opinion is biased.
- a "minimal differences" diff, (as does ssddiff). Usefull for text edition (xhtml & co)

Other tools also try to do minimal differences; except usually an
approximation is preferable due to speed and memory reasons. the
ssddiff prototype also has an option to do that but this isn't really
optimized yet (I still have some ideas to improve quality in the fast
mode);
ssddiff also has three different output formats, one is basically a
merge, one is an xupdate diff script (similar to other xml diff
applications, but a rather well-defined standard; whereas many xml diff
applications use a proprietary format, sometimes not even xml itself,
that may or may not have actual patch applications for it...) and the
third one is labelling both source documents so a third party can
further process the "best match" to e.g. produce another output format.

ssddiff is NOT optimal for xhtml. Because changes in XHTML are often
within "strings" of the document (sorry for using the programmers
vocabulary, and not the XML vocabulary: cdata); on the other hand it
can work very well on some cases of XHTML because this format has a
very flexible structure (and the structure is an important part of the
information), whereas in an address book, the structure is next to
meaningless, and except for the ordering a pure classic LCS diff does
the job just fine.

So the big difference with ssddiff is that it tries to do a *semantic*
diff on the structure, whereas other diff applications basically do a
text diff, on tokens instead of lines. (granted, some also do an
inside-out matching and such)
The ssddiff approach could (not the current prototype, but a
straighforward extension of it) also match data in two different xml
formats, or RDF data. Or even diff XML and RDF data.
It doesn't rely on the document to be a tree structure.
(For best results, the user has to specify which relations in the
document contain relevant information)

to the original questions: - is a diff result computed on the ordered or unordered xml node tree of
the compared documents?
If you specify that the following-sibling::* relation is part of the
structural information, matching pays respect to the ordering of
children. Note that in XML by default the ordering is significant,
although in many applications it is not; there is no "ignore-ordering"
attribute in XML, this has to be handled by applications.
Also you must differentiate between ignoring the ordering in
calculating the diff or in calculating the edit script. By default
ssddiff will ignore ordering of children (unless you give an
appropriate xpath), the xupdate and merge output writers however will
try to reconstruct the original ordering.
- what identifiers/criteria should be used by default to match elements of
the same type in different documents?
This depends very much on your application. You have data where there
are rarely ever changes within string parts, and you can have data
where XML is a mere container and all the difference happens in there.
The ssddiff approach isn't of much use in cases where the structure of
the documents is very restricted and fixed (such as relational
databases exported to XML); the prototype is next to useless since it
only support string equalty.
Other implementations might want to add substring or levenstein
matches.
- should a diff tool consider move operations or only insert/delete
(besides update)?
This depends a lot on what you allow as insert and delete operations,
and what your user base is. If you make diffs for patching XML files,
an insert/delete only diff may be useful.
if the diff is to be read by humans, moves are very useful.
If you do not allow subtree insertions and deletions (note that if I
allow subtree insertions and deletions I can replace a document in just
two operations!), a move can safe much resources.
On merges, moves are also much more useful (when detected correctly),
but can also cause confusion (when an independent deletion and
insertion are incorrectly made a move)
- is there any common default behavior that a user would expect from a xml
diff/merge tool?


It just works?
Well, it depends a lot on wheter the intended audience for the file is
a human reader or not.
Have a look at the harry potter books example in the ssddiff
publication at eXtreme Markup Languages 2005 (available online)
together with slide 27 in
http://ssddiff.alioth.debian.org/tal...-languages.pdf
(the colors make it easier to read; but the explanaition was mostly
given orally, so use the publication itself for the comments)
This is an example where there is no way for a computer to decide which
is best, he has to rely on additional information by the user (such as:
the ISBN of a book is it's unique identifier and should never change);
sometimes this could be derived from DTD/Schema information.

Regards,
Erich Schubert

Dec 30 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
3669
by: Guillaume Lahitette | last post by:
Hello everyone, I am looking for a lightweight Java Applet to diff and merge two text files. The functionality would be a subset of what WinMerge (http://winmerge.sourceforge.net) offers: - diff 2 text files - merge (in both directions) - color coding to highlight differences I found some Java diff tools but they were all command line. And my
2
4665
by: Butter Scotch | last post by:
Hi folks, I'm looking for a good XML diff utility. Does anybody have any that you can recommend? thanks! -Yasutaka
9
6505
by: Ching-Lung | last post by:
Hi all, I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.
2
2290
by: nickdu | last post by:
Is there a tool that will merge XML documents? We also need the reverse, we need to be able to create a Diff of two documents. What we're trying to do is just store differences of documents at different levels of hierarchy in our configuration store. As an example, lets say at a certain hierarchy in our configuration store is the following document: <grid bgColor="Red" fgColor="Green" Width="200" Height="100"> <font name="Arial"...
4
6963
by: jon morgan | last post by:
Hello, In an MDI application how can I prevent the CLR automatically merging a child windows' menu with that of the parent MDI form ? Thanks for any help Jon
0
1179
by: brian.f.oneil | last post by:
Is there any functionality in a DataSet for doing a "Diff" with another DataSet? I want to perform the opposite of "Merge" i.e. If a matching row exists in DataSet B, remove it from DataSet A. Happy New Year : - ) Brian.ONeil@MCDean.com
1
3449
by: earthwormgaz | last post by:
Is there anything in Boost (or elsewhere if needs be) that will allow me to diff/merge files? I am writing out a data file, and I want to merge it with an existing on if its there.
1
3960
by: Andy Fish | last post by:
hi, I am looking for a library (i.e. not a standalone GUI program) that can do diff and merge of HTML or XML, preferably in C# or at least that can be called from C# anyone know of such a thing TIA
8
4367
by: John A Grandy | last post by:
What do people like for a code comparison (diff) tool ? I've been using Beyond Compare but perhaps there is better. I am on Subversion.
0
7935
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
7871
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8236
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
7995
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
6642
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5735
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
3851
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2379
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
1202
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.