473,581 Members | 2,307 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Identify File Difference

I've been tasked with developing a document/file versioning system of sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh
Nov 16 '05 #1
4 1806
You can generate a Hash on both files and compare the hashed value.
http://www.dotnetspider.com/technology/KBPages/397.aspx

http://dotnetjunkies.com/WebLog/darr.../20/19820.aspx

--
W.G. Ryan MVP (Windows Embedded)

TiBA Solutions
www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"Josh Carlisle" <jc*******@remo veforspam.viewf usion.com> wrote in message
news:eN******** ******@TK2MSFTN GP12.phx.gbl...
I've been tasked with developing a document/file versioning system of sorts. Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh

Nov 16 '05 #2
Thanks for the quick reply :)

This looks like a good starting point, although the hashing algorithm will
only tell me if there is a difference but not what the differences are. Non
the less this is a very useful exercise for verification prior to actual
byte by byte comparison. Are you aware of any algorithms that are helpful in
this area? I was initially considering doing a byte for byte comparison for
actually identifying the changes (so only the changes are stored) but I
wasn't sure if there was a better way. I was about to check out the source
code of some open source windiff type projects but like most developers I'm
looking for some shortcuts :)

Anyway thanks again for the two links and any further advice you may have
would be appreciated.

Josh
"W.G. Ryan eMVP" <Wi*********@No Spam.gmail.com> wrote in message
news:ug******** ******@TK2MSFTN GP14.phx.gbl...
You can generate a Hash on both files and compare the hashed value.
http://www.dotnetspider.com/technology/KBPages/397.aspx

http://dotnetjunkies.com/WebLog/darr.../20/19820.aspx

--
W.G. Ryan MVP (Windows Embedded)

TiBA Solutions
www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"Josh Carlisle" <jc*******@remo veforspam.viewf usion.com> wrote in message
news:eN******** ******@TK2MSFTN GP12.phx.gbl...
I've been tasked with developing a document/file versioning system of

sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes
it

a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh


Nov 16 '05 #3
Hi Josh,

Comparing differences between files, especially binary files, in an
interesting topic. It has been described in various places. I did a little
google searching and found this excerpt:
GNU diff was written by Mike Haertel, David Hayes, Richard Stallman, Len
Tower, and Paul Eggert. Wayne Davison designed and implemented the unified
output format. The basic algorithm is described in "An O(ND) Difference
Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2,
1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and
Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985, pp.
1025--1040. The algorithm was independently discovered as described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64, 1985, pp. 100--118. <<

The cited articles are probably a good place to start. Unfortunately, I
have not read them, so I cannot comment on the algorithm itself.

I will say one thing though: modern code control systems do NOT store the
original file and then store the differences to get newer versions.

Modern systems store the MOST RECENT file and store the differences needed
to recreate Previous versions (since 99% of the time, you don't want the
first version... you want the most recent one.)

Also, given the low cost of hard drive space and the ability to simply
compress prior versions, you may want to simply consider keeping the entire
contents of each version of each file, simply compressing old versions to
save space.

One more thing to look at: If you have Windows Server 2003, you can
download, for free, Windows Sharepoint Services. This system gives you
simple document management capabilities, including the ability to set up a
virtual "folder" tree that contains "files" where you can store every
version of any or all files. It's pretty nice, and because it's free, you
would avoid most licensing issues. That's the upside. The downside: it
only runs on Windows Server 2003. If your customers cannot upgrade their
OS, then this can't be used as your back end. Still, it's worth
considering, if for no other reason that to simply Write Less Code.

Good Luck. I hope this helps,

--- Nick

"Josh Carlisle" <jc*******@remo veforspam.viewf usion.com> wrote in message
news:eN******** ******@TK2MSFTN GP12.phx.gbl... I've been tasked with developing a document/file versioning system of sorts. Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh

Nov 16 '05 #4
Nick,

I actually downloaded the windiff code so I'm going to take a look at that
but GNU diff looks interesting also. Luckily I'm comfortable enough with c
and c++ to get algorithms out of the code I need so hopefully that will be
usefull. Luckily we're not looking at some of the other features of most
source code controls systems (branching, merging, etc) so I'm hoping to keep
it simple. Also because of some of the unique aspects of what we are tieing
it into using a product like sharepoint isn't possible. However you make a
very good point about alternative mechansims. I had done some googling and
found some references about only storing differences so not knowing any
better I started to pursue that route but after reading your statement that
most modern systems store the current and only the differences for historic
purposes is some very valuable information and does make more sense. I had
originally thought of storing the complete versions but discounted it for
space concerns but I'm actually heading that direction more to reduce
complexity and as you say with the use of some compression (which I've used
with some remoting sinks in the past) makes it a very feasible.

Thanks for your replies.

Josh

"Nick Malik" <ni*******@hotm ail.nospam.com> wrote in message
news:oGutd.1540 41$V41.20900@at tbi_s52...
Hi Josh,

Comparing differences between files, especially binary files, in an
interesting topic. It has been described in various places. I did a
little
google searching and found this excerpt:
GNU diff was written by Mike Haertel, David Hayes, Richard Stallman, Len

Tower, and Paul Eggert. Wayne Davison designed and implemented the unified
output format. The basic algorithm is described in "An O(ND) Difference
Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2,
1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and
Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985,
pp.
1025--1040. The algorithm was independently discovered as described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64, 1985, pp. 100--118. <<

The cited articles are probably a good place to start. Unfortunately, I
have not read them, so I cannot comment on the algorithm itself.

I will say one thing though: modern code control systems do NOT store the
original file and then store the differences to get newer versions.

Modern systems store the MOST RECENT file and store the differences needed
to recreate Previous versions (since 99% of the time, you don't want the
first version... you want the most recent one.)

Also, given the low cost of hard drive space and the ability to simply
compress prior versions, you may want to simply consider keeping the
entire
contents of each version of each file, simply compressing old versions to
save space.

One more thing to look at: If you have Windows Server 2003, you can
download, for free, Windows Sharepoint Services. This system gives you
simple document management capabilities, including the ability to set up a
virtual "folder" tree that contains "files" where you can store every
version of any or all files. It's pretty nice, and because it's free, you
would avoid most licensing issues. That's the upside. The downside: it
only runs on Windows Server 2003. If your customers cannot upgrade their
OS, then this can't be used as your back end. Still, it's worth
considering, if for no other reason that to simply Write Less Code.

Good Luck. I hope this helps,

--- Nick

"Josh Carlisle" <jc*******@remo veforspam.viewf usion.com> wrote in message
news:eN******** ******@TK2MSFTN GP12.phx.gbl...
I've been tasked with developing a document/file versioning system of

sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes
it

a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh


Nov 16 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2039
by: kevin.hall | last post by:
I've got a problem where I have to identify differences in network. The network may have different types of nodes and may only have a string of ring-like topology: Code: A--B--C--D--E or
4
1936
by: Eric | last post by:
Hi, I need to find a way to identify between a few different file formats WITHOUT looking at the file extension. Very often our customers will name file incorrectly. For example, they'll send us a file that's named 'filename.xls', but it's actually a tab delimited or comma delimited file. The possible formats that I need to identify are:...
2
1865
by: yezi | last post by:
Hi, ALl: The following code is to canculate 2 vector distance. Suppose the vectore is stored in some txt file like -1 0.34 0 0.045 1 0.98 1 0.01
4
6084
by: Goh | last post by:
Hi, I would like to know how can we implement a web page that intelligent enough to unique identify that pc have been visit before without any cookies and login user require. I have try implement this by MAC address. When user browser the web site I sometime can get user pc MAC and sometime no. Why this type of implementation are so...
2
1553
by: Shilpa | last post by:
Hi All, I want to write C# code to identify a file type and open the file in the associated editor. For example, text files should be identified and opened in notepad, html should be opened in internet explorer/netscape/mozilla. At design time, I do not know if internet explorer/netscape is installed on the client machine. The C# code...
3
3593
by: Shilpa | last post by:
Hi All, I want to write C# code to identify a file type and open the file in the associated editor. For example, text files should be identified and opened in notepad, html should be opened in internet explorer/netscape/mozilla. At design time, I do not know if internet explorer/netscape is installed on the client machine. The C# code...
2
22888
by: h112211 | last post by:
Hi, I installed the newest available PIL (1.1.5 for Python 2.4) from their site, but cannot seem to open any files. The following from PIL import Image i = Image.open(file('c:\\image2.png')) results in
6
2485
by: Pieter | last post by:
Hi, For some procedures that throws exceptions, I would like to show different messages to the user depending on what type of exception he's getting. For instance this one: when the file is locked, I want a messagebox to tell that the user has to close the file first. Is there a way to identify an exception by some kind of unique number...
16
3730
by: Alan Jones | last post by:
Hello everyone, any help would be greatly appreciated. :) What I'm trying to do may not be advisable, but here goes... I want a page named signature.php to appear conditionally as an include within another include so that it will, for example, appear in index.php but not in other result pages that use the same top level include. The...
0
7789
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8144
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8301
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7894
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8169
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6551
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5361
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3820
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1400
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.