473,508 Members | 2,396 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Identify File Difference

I've been tasked with developing a document/file versioning system of sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh
Nov 16 '05 #1
4 1802
You can generate a Hash on both files and compare the hashed value.
http://www.dotnetspider.com/technology/KBPages/397.aspx

http://dotnetjunkies.com/WebLog/darr.../20/19820.aspx

--
W.G. Ryan MVP (Windows Embedded)

TiBA Solutions
www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"Josh Carlisle" <jc*******@removeforspam.viewfusion.com> wrote in message
news:eN**************@TK2MSFTNGP12.phx.gbl...
I've been tasked with developing a document/file versioning system of sorts. Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh

Nov 16 '05 #2
Thanks for the quick reply :)

This looks like a good starting point, although the hashing algorithm will
only tell me if there is a difference but not what the differences are. Non
the less this is a very useful exercise for verification prior to actual
byte by byte comparison. Are you aware of any algorithms that are helpful in
this area? I was initially considering doing a byte for byte comparison for
actually identifying the changes (so only the changes are stored) but I
wasn't sure if there was a better way. I was about to check out the source
code of some open source windiff type projects but like most developers I'm
looking for some shortcuts :)

Anyway thanks again for the two links and any further advice you may have
would be appreciated.

Josh
"W.G. Ryan eMVP" <Wi*********@NoSpam.gmail.com> wrote in message
news:ug**************@TK2MSFTNGP14.phx.gbl...
You can generate a Hash on both files and compare the hashed value.
http://www.dotnetspider.com/technology/KBPages/397.aspx

http://dotnetjunkies.com/WebLog/darr.../20/19820.aspx

--
W.G. Ryan MVP (Windows Embedded)

TiBA Solutions
www.tibasolutions.com | www.devbuzz.com | www.knowdotnet.com
"Josh Carlisle" <jc*******@removeforspam.viewfusion.com> wrote in message
news:eN**************@TK2MSFTNGP12.phx.gbl...
I've been tasked with developing a document/file versioning system of

sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes
it

a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh


Nov 16 '05 #3
Hi Josh,

Comparing differences between files, especially binary files, in an
interesting topic. It has been described in various places. I did a little
google searching and found this excerpt:
GNU diff was written by Mike Haertel, David Hayes, Richard Stallman, Len
Tower, and Paul Eggert. Wayne Davison designed and implemented the unified
output format. The basic algorithm is described in "An O(ND) Difference
Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2,
1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and
Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985, pp.
1025--1040. The algorithm was independently discovered as described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64, 1985, pp. 100--118. <<

The cited articles are probably a good place to start. Unfortunately, I
have not read them, so I cannot comment on the algorithm itself.

I will say one thing though: modern code control systems do NOT store the
original file and then store the differences to get newer versions.

Modern systems store the MOST RECENT file and store the differences needed
to recreate Previous versions (since 99% of the time, you don't want the
first version... you want the most recent one.)

Also, given the low cost of hard drive space and the ability to simply
compress prior versions, you may want to simply consider keeping the entire
contents of each version of each file, simply compressing old versions to
save space.

One more thing to look at: If you have Windows Server 2003, you can
download, for free, Windows Sharepoint Services. This system gives you
simple document management capabilities, including the ability to set up a
virtual "folder" tree that contains "files" where you can store every
version of any or all files. It's pretty nice, and because it's free, you
would avoid most licensing issues. That's the upside. The downside: it
only runs on Windows Server 2003. If your customers cannot upgrade their
OS, then this can't be used as your back end. Still, it's worth
considering, if for no other reason that to simply Write Less Code.

Good Luck. I hope this helps,

--- Nick

"Josh Carlisle" <jc*******@removeforspam.viewfusion.com> wrote in message
news:eN**************@TK2MSFTNGP12.phx.gbl... I've been tasked with developing a document/file versioning system of sorts. Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh

Nov 16 '05 #4
Nick,

I actually downloaded the windiff code so I'm going to take a look at that
but GNU diff looks interesting also. Luckily I'm comfortable enough with c
and c++ to get algorithms out of the code I need so hopefully that will be
usefull. Luckily we're not looking at some of the other features of most
source code controls systems (branching, merging, etc) so I'm hoping to keep
it simple. Also because of some of the unique aspects of what we are tieing
it into using a product like sharepoint isn't possible. However you make a
very good point about alternative mechansims. I had done some googling and
found some references about only storing differences so not knowing any
better I started to pursue that route but after reading your statement that
most modern systems store the current and only the differences for historic
purposes is some very valuable information and does make more sense. I had
originally thought of storing the complete versions but discounted it for
space concerns but I'm actually heading that direction more to reduce
complexity and as you say with the use of some compression (which I've used
with some remoting sinks in the past) makes it a very feasible.

Thanks for your replies.

Josh

"Nick Malik" <ni*******@hotmail.nospam.com> wrote in message
news:oGutd.154041$V41.20900@attbi_s52...
Hi Josh,

Comparing differences between files, especially binary files, in an
interesting topic. It has been described in various places. I did a
little
google searching and found this excerpt:
GNU diff was written by Mike Haertel, David Hayes, Richard Stallman, Len

Tower, and Paul Eggert. Wayne Davison designed and implemented the unified
output format. The basic algorithm is described in "An O(ND) Difference
Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2,
1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and
Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985,
pp.
1025--1040. The algorithm was independently discovered as described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64, 1985, pp. 100--118. <<

The cited articles are probably a good place to start. Unfortunately, I
have not read them, so I cannot comment on the algorithm itself.

I will say one thing though: modern code control systems do NOT store the
original file and then store the differences to get newer versions.

Modern systems store the MOST RECENT file and store the differences needed
to recreate Previous versions (since 99% of the time, you don't want the
first version... you want the most recent one.)

Also, given the low cost of hard drive space and the ability to simply
compress prior versions, you may want to simply consider keeping the
entire
contents of each version of each file, simply compressing old versions to
save space.

One more thing to look at: If you have Windows Server 2003, you can
download, for free, Windows Sharepoint Services. This system gives you
simple document management capabilities, including the ability to set up a
virtual "folder" tree that contains "files" where you can store every
version of any or all files. It's pretty nice, and because it's free, you
would avoid most licensing issues. That's the upside. The downside: it
only runs on Windows Server 2003. If your customers cannot upgrade their
OS, then this can't be used as your back end. Still, it's worth
considering, if for no other reason that to simply Write Less Code.

Good Luck. I hope this helps,

--- Nick

"Josh Carlisle" <jc*******@removeforspam.viewfusion.com> wrote in message
news:eN**************@TK2MSFTNGP12.phx.gbl...
I've been tasked with developing a document/file versioning system of

sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes
it

a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh


Nov 16 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2035
by: kevin.hall | last post by:
I've got a problem where I have to identify differences in network. The network may have different types of nodes and may only have a string of ring-like topology: Code: A--B--C--D--E or
4
1929
by: Eric | last post by:
Hi, I need to find a way to identify between a few different file formats WITHOUT looking at the file extension. Very often our customers will name file incorrectly. For example, they'll send us...
2
1858
by: yezi | last post by:
Hi, ALl: The following code is to canculate 2 vector distance. Suppose the vectore is stored in some txt file like -1 0.34 0 0.045 1 0.98 1 0.01
4
6076
by: Goh | last post by:
Hi, I would like to know how can we implement a web page that intelligent enough to unique identify that pc have been visit before without any cookies and login user require. I have try...
2
1546
by: Shilpa | last post by:
Hi All, I want to write C# code to identify a file type and open the file in the associated editor. For example, text files should be identified and opened in notepad, html should be opened in...
3
3586
by: Shilpa | last post by:
Hi All, I want to write C# code to identify a file type and open the file in the associated editor. For example, text files should be identified and opened in notepad, html should be opened in...
2
22871
by: h112211 | last post by:
Hi, I installed the newest available PIL (1.1.5 for Python 2.4) from their site, but cannot seem to open any files. The following from PIL import Image i =...
6
2481
by: Pieter | last post by:
Hi, For some procedures that throws exceptions, I would like to show different messages to the user depending on what type of exception he's getting. For instance this one: when the file is...
16
3719
by: Alan Jones | last post by:
Hello everyone, any help would be greatly appreciated. :) What I'm trying to do may not be advisable, but here goes... I want a page named signature.php to appear conditionally as an include...
0
7224
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7118
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7379
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
5625
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5049
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4706
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3180
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
763
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
415
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.