473,322 Members | 1,538 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Comparing Files - quickly and definitively

I would appreciate some recommendations for programmatically determining if
files differ.

I'm writing a utility that backs up files that customers upload to Web
sites. Rather than mindlessly copying any/all files from each Web site to
the backup server (and wasting space), I'm looking to copy only files that
have been modified since the last backup took place. The files include
anything from PDF to GIF/JPG to XML, text, etc. Max size is currently under
5MB, but that could be increased later depending on customer demand.

I understand that I can look to the LastModified date or other file
properties, but I would prefer something more reliable. By "more reliable" I
mean this: I have noticed that the time can differ by a couple of seconds
after copying a file from one server to another. If the logic were to
compare using those date/times, we would expect "false positives" - files
that appear to be newer (different) based on Date/Time, but are in fact no
different. At least this scenario would happen if the logic looked to the
last backup (on the backup server) and compared against the current file on
a Web server.

So I'm thinking that there may be a more reliable way to determine if the
file content is actually different. While it would be a no-brainer to open
each file and compare the contents, that could be a rather costly
operation - given the large number of files to potentially compare, and
their potential large sizes.

So I'm looking for a reliable means through which to determine which files
have, in fact, been changed - and make that determination with fast
performance.

Suggestions? Ideas?

Thanks!

-S
Sep 22 '07 #1
2 2418
Smithers wrote:
[...]
So I'm looking for a reliable means through which to determine which files
have, in fact, been changed - and make that determination with fast
performance.
Depends on your definition of "reliable". Many backup programs use only
the filename, size, and modified date to determine whether the file has
changed. Some even just use the archive bit. When they use these
things, they make sure that they copy not only the file but also the
file attributes they are checking. So if you are relying on the
modified date, for example, you'd have to copy the modified date too (I
know that the Windows Explorer does this when copying the files by hand).

But since these things aren't actually tied the actual file contents,
they aren't actually 100% reliable, though they often are "good enough".

If you really want to know whether the file is different, you have to
compare it somehow. A common method would be to generate and store an
MD5 hash on the file, and then generate the same hash for the file that
is eligible for copying. If the hash is the same, don't copy.

Of course, you would check the file size first, since that's a quick way
to know for sure if the files are different. :)

There is a theoretical possibility of hash collisions even using that
technique, so technically speaking it's not 100% reliable. But it's far
more reliable than looking just at date and file size, and is probably
good enough for almost any real-world application.

Pete
Sep 22 '07 #2
Thanks Pete - hadn't thought about the hashing alternatives.

"good enough" is criteria I can live with on this. An occasional false
positive won't be the end of the world. It would simply mean that we archive
a file unnecessarily. No big deal. I think I'll go with a comparison of the
date/times after all, do a bunch of testing, and if there are very few false
positives, then we'll be done. We can go with more involved analyses and
possibly hashing if we need to tighten things up later.

-S
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:13*************@corp.supernews.com...
Smithers wrote:
>[...]
So I'm looking for a reliable means through which to determine which
files have, in fact, been changed - and make that determination with fast
performance.

Depends on your definition of "reliable". Many backup programs use only
the filename, size, and modified date to determine whether the file has
changed. Some even just use the archive bit. When they use these things,
they make sure that they copy not only the file but also the file
attributes they are checking. So if you are relying on the modified date,
for example, you'd have to copy the modified date too (I know that the
Windows Explorer does this when copying the files by hand).

But since these things aren't actually tied the actual file contents, they
aren't actually 100% reliable, though they often are "good enough".

If you really want to know whether the file is different, you have to
compare it somehow. A common method would be to generate and store an MD5
hash on the file, and then generate the same hash for the file that is
eligible for copying. If the hash is the same, don't copy.

Of course, you would check the file size first, since that's a quick way
to know for sure if the files are different. :)

There is a theoretical possibility of hash collisions even using that
technique, so technically speaking it's not 100% reliable. But it's far
more reliable than looking just at date and file size, and is probably
good enough for almost any real-world application.

Pete

Sep 23 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: ddd | last post by:
I am trying to build a diff tool that allows me to compare two HTML files. I am looking for resources on how to achive this. The main problem is that I do not want to simply highlight the line of...
0
by: richardkreidl | last post by:
I have the following hash script that I use to compare two text files. 'Class Public Class FileComparison Public Class FileComparisonException Public Enum ExceptionType U 'Unknown A 'Add...
1
by: Donald Grove | last post by:
If I have two arrays, what is a good paradigm for comparing what is in them, to determine what elements they share, or don't share? Specifically, each array could potentially contain the integers...
20
by: mike | last post by:
I help manage a large web site, one that has over 600 html pages... It's a reference site for ham radio folks and as an example, one page indexes over 1.8 gb of on-line PDF documents. The site...
5
by: Glen Buell | last post by:
Hi all, I have a major problem with my ASP.NET website and it's SQL Server 2005 Express database, and I'm wondering if anyone could help me out with it. This site is on a webhost...
2
by: Pugi! | last post by:
hi, I am using this code for checking wether a value (form input) is an integer and wether it is smaller than a given maximum and greater then a given minimum value: function...
5
by: JasonP | last post by:
I am using Access 2003 on Windows XP. I am looking to analyse web traffic files using this - I appreciate there are bespoke applications which will do the same task. Each month there are roughly...
27
by: Thomas Kowalski | last post by:
Hi everyone, To determine equality of two doubles a and b the following is often done: bool isEqual ( double a, double b ) { return ( fabs (a-b) < THRESHOLD ); } But this a approach usually...
4
by: gillianbrooks91 | last post by:
Forgive me for asking this question, I've trawled through nearly every available post on this subject that I can find for a few weeks now but nothing quite points me in the right direction. I'm...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.