By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,679 Members | 2,036 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,679 IT Pros & Developers. It's quick & easy.

[perl-python] a program to delete duplicate files

P: n/a
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05 #1
Share this Question
Share on Google+
44 Replies


P: n/a
On 9 Mar 2005 04:56:13 -0800, rumours say that "Xah Lee" <xa*@xahlee.org> might
have written:
Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.


http://groups-beta.google.com/group/...8e292ec9adb82d

The whole thread is about finding duplicate files.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #2

P: n/a
On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.


For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.

#!/usr/bin/env python
# (C) 2003 Anansi Spaceworks
#---------------------------------------------------------------------------
# find_duplicates
"""
Utility to find duplicate files in a directory tree by
comparing their checksums.
"""
#---------------------------------------------------------------------------
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#---------------------------------------------------------------------------

import os, sys, md5, getopt
def file_walker(tbl, srcpath, files):
"""
Visit a path and collect data (including checksum) for files in it.
"""
for file in files:
filepath = os.path.join(srcpath, file)
if os.path.isfile(filepath):
chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
if not tbl.has_key(chksum): tbl[chksum]=[]
tbl[chksum].append(filepath)

def find_duplicates(treeroot, tbl=None):
"""
Find duplicate files in directory.
"""
dup = {}
if tbl is None: tbl = {}
os.path.walk(treeroot, file_walker, tbl)
for k,v in tbl.items():
if len(v) > 1:
dup[k] = v
return dup

usage = """
USAGE: find_duplicates <options> [<path ...]

Find duplicate files (by matching md5 checksums) in a
collection of paths (defaults to the current directory).

Note that the order of the paths searched will be retained
in the resulting duplicate file lists. This can be used
with --exec and --index to automate handling.

Options:
-h, -H, --help
Print this help.

-q, --quiet
Don't print normal report.

-x, --exec=<command string>
Python-formatted command string to act on the indexed
duplicate in each duplicate group found. E.g. try
--exec="ls %s"

-n, --index=<index into duplicates>
Which in a series of duplicates to use. Begins with '1'.
Default is '1' (i.e. the first file listed).

Example:
You've copied many files from path ./A into path ./B. You want
to delete all the ones you've processed already, but not
delete anything else:

% find_duplicates -q --exec="rm %s" --index=1 ./A ./B
"""

def main():
action = None
quiet = 0
index = 1
dup = {}

opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:',
['quiet', 'help', 'exec=', 'index='])

for opt, val in opts:
if opt in ('-h', '-H', '--help'):
print usage
sys.exit()
elif opt in ('-x', '--exec'):
action = str(val)
elif opt in ('-n', '--index'):
index = int(val)
elif opt in ('-q', '--quiet'):
quiet = 1

if len(args)==0:
dup = find_duplicates('.')
else:
tbl = {}
for arg in args:
dup = find_duplicates(arg, tbl=tbl)

for k, v in dup.items():
if not quiet:
print "Duplicates:"
for f in v: print "\t%s" % f
if action:
os.system(action % v[index-1])

if __name__=='__main__':
main()

--
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks http://www.anansispaceworks.com

Jul 18 '05 #3

P: n/a
I wrote something similar, have a look at
http://www.homepages.lu/pu/fdups.html.
Jul 18 '05 #4

P: n/a
On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
<ha*****@anansispaceworks.com> might have written:
For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.


The drawback of this method is that you have to read everything. For example,
if you have ten files less than 100KiB each and one file more than 2 GiB in
size, there is no need to read the 2 GiB file, is there?

If it's a one-shot attempt, I guess it won't mind a lot.

On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #5

P: n/a
P
I've written a python GUI wrapper around some shell scripts:
http://www.pixelbeat.org/fslint/

the shell script logic is essentially:

exclude hard linked files
only include files where there are more than 1 with the same size
print files with matching md5sum

Pádraig.
Jul 18 '05 #6

P: n/a
On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger
<pu*********@gmail.com> might have written:
I wrote something similar, have a look at
http://www.homepages.lu/pu/fdups.html.


That's fast and good.

A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).

Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #7

P: n/a
Christos TZOTZIOY Georgiou wrote:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.


I then have a bug here - I consider all files with the same inode equal,
but according to what you say I need to consider the tuple
(st_dev,ST_ium). I'll have to fix that for 0.13.

Thanks ;-)
-pu
Jul 18 '05 #8

P: n/a
Christos TZOTZIOY Georgiou wrote:
That's fast and good.
Nice to hear.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
I'll look into that.
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?


No. And even then, I wouldn't know how to find out.

-pu
Jul 18 '05 #9

P: n/a
In article <11**********************@l41g2000cwc.googlegroups .com>,
"Xah Lee" <xa*@xahlee.org> wrote:
a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.


You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #10

P: n/a
David Eppstein wrote:
In article <11**********************@l41g2000cwc.googlegroups .com>,
"Xah Lee" <xa*@xahlee.org> wrote:
a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.


You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.


I did it as follows (some time ago):

is filesize in hash?

calculate md5 (and store), if equal then compare
files.

store info in hash.

In some cases if might be faster to drop the md5 (since it reads all data)

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

Jul 18 '05 #11

P: n/a
On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger
<pu*********@gmail.com> might have written:
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?


No. And even then, I wouldn't know how to find out.


MSDN is our friend for Windows stuff.

http://msdn.microsoft.com/library/de...tehardlink.asp

and then

http://msdn.microsoft.com/library/de...onbyhandle.asp

http://msdn.microsoft.com/library/de...mation_str.asp

The relevant parts from this last page:

st_dev <-> dwVolumeSerialNumber

st_ino <-> (nFileIndexHigh, nFileIndexLow)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #12

P: n/a
On Fri, 11 Mar 2005 01:12:14 +0100, rumours say that Patrick Useldinger
<pu*********@gmail.com> might have written:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
I then have a bug here - I consider all files with the same inode equal,
but according to what you say I need to consider the tuple
(st_dev,ST_ium). I'll have to fix that for 0.13.


I have a bug here too-- I wrote st_inum meaning st_ino, but that would be quick
to find!
Thanks ;-)


You are very welcome.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #13

P: n/a
Christos TZOTZIOY Georgiou wrote:
The relevant parts from this last page:
st_dev <-> dwVolumeSerialNumber
st_ino <-> (nFileIndexHigh, nFileIndexLow)


I see. But if I am not mistaken, that would mean that I
(1) had to detect NTFS volumes
(2) use non-standard libraries to find these information (like the
Python Win extentions).

I am not seriously motivated to do so, but if somebody is interested to
help, I am open to it.

-pu
Jul 18 '05 #14

P: n/a
David Eppstein wrote:
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.


That's not very efficient. IMO, it only makes sense in network-based
operations such as rsync.

-pu
Jul 18 '05 #15

P: n/a
Christos TZOTZIOY Georgiou wrote:
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).


Changed.
Jul 18 '05 #16

P: n/a
In article <42********@news.vo.lu>,
Patrick Useldinger <pu*********@gmail.com> wrote:
You need do no comparisons between files. Just use a sufficiently
strong hash algorithm (SHA-256 maybe?) and compare the hashes.


That's not very efficient. IMO, it only makes sense in network-based
operations such as rsync.


Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.

More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).

I'm assuming of course that there are too many files and/or they're too
large just to keep them all in core.

Anyone have any data on whether reading files and SHA-256'ing them (or
whatever other cryptographic hash you think is strong enough) is
I/O-bound or CPU-bound? That is, is three reads with very little CPU
overhead really cheaper than one read with a strong hash?

I guess it also depends on the number of files you expect to have
duplicates of. If most of the files exist in only one copy, it's clear
that the cheap hash will find them more cheaply than the expensive hash.
In that case you could combine the (file size, cheap hash) filtering
with the expensive hash and get only two reads per copy rather than
three.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #17

P: n/a
On Thursday 10 March 2005 11:02 am, Christos "TZOTZIOY" Georgiou wrote:
On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
<ha*****@anansispaceworks.com> might have written:
For anyone interested in responding to the above, a starting
place might be this maintenance script I wrote for my own use. I don't
think it exactly matches the spec, but it addresses the problem. I wrote
this to clean up a large tree of image files once. The exact behavior
described requires the '--exec="ls %s"' option as mentioned in the help.


The drawback of this method is that you have to read everything. For example,
if you have ten files less than 100KiB each and one file more than 2 GiB in
size, there is no need to read the 2 GiB file, is there?

If it's a one-shot attempt, I guess it won't mind a lot.

On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.


Two good points, but hey, it ran fast enough for me. ;-)

Cheers,
Terry

--
--
Terry Hancock ( hancock at anansispaceworks.com )
Anansi Spaceworks http://www.anansispaceworks.com

Jul 18 '05 #18

P: n/a
David Eppstein wrote:
Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.
That's exactly what my program does.
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
My point is : forget hashes. If you work with hashes, you do have to
read each file completely, cheap hash or not. My program normally reads
*at most* 100% of the files to analyse, but usually much less. Also, I
do plain comparisons which are much cheaper than hash calculations.
I'm assuming of course that there are too many files and/or they're too
large just to keep them all in core.
I assume that file handles are sufficient to keep one open per file of
the same size. This lead to trouble on Windows installations, but I
guess that's a parameter to change. On Linux, I never had the problem.

Regarding buffer size, I use a maxumim which is then split up between
all open files.
Anyone have any data on whether reading files and SHA-256'ing them (or
whatever other cryptographic hash you think is strong enough) is
I/O-bound or CPU-bound? That is, is three reads with very little CPU
overhead really cheaper than one read with a strong hash?
It also depends on the OS. I found that my program runs much slower on
Windows, probably due to the way Linux anticipates reads and tries to
reduce head movement.
I guess it also depends on the number of files you expect to have
duplicates of. If most of the files exist in only one copy, it's clear
that the cheap hash will find them more cheaply than the expensive hash.
In that case you could combine the (file size, cheap hash) filtering
with the expensive hash and get only two reads per copy rather than
three.


Sorry, but I can still not see a point tu use hashes. Maybe you'll have
a look at my program and tell me where a hash could be useful?

It's available at http://www.homepages.lu/pu/fdups.html.

Regards,
-pu
Jul 18 '05 #19

P: n/a
In article <42********@news.vo.lu>,
Patrick Useldinger <pu*********@gmail.com> wrote:
Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.


That's exactly what my program does.


If you're doing any comparisons at all, you're not minimizing the number
of comparisons.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #20

P: n/a
On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
<ep******@ics.uci.edu> might have written:
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).


The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum). However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)

It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #21

P: n/a
On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <ep******@ics.uci.edu> wrote:
In article <42********@news.vo.lu>,
Patrick Useldinger <pu*********@gmail.com> wrote:
> Well, but the spec didn't say efficiency was the primary criterion, it
> said minimizing the number of comparisons was.


That's exactly what my program does.


If you're doing any comparisons at all, you're not minimizing the number
of comparisons.

ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori knowledge
of highly probable difference locations (e.g., internal date stamp at some offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.

Regards,
Bengt Richter
Jul 18 '05 #22

P: n/a

David Eppstein wrote:
In article <42********@news.vo.lu>,
Patrick Useldinger <pu*********@gmail.com> wrote:
Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was.
That's exactly what my program does.


If you're doing any comparisons at all, you're not minimizing the

number of comparisons.


Gah. You two should be lawyers. I don't know why you're arguing about
whatever you are arguing about.

Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

1. d <= S
2. Anyone have a usable hash function that's faster than string
comparison?
3. Anyone have an answer to PU's request for advice where his algorithm
could be improved by using a hash method?

Temporary injunction in favour of PU.

Jul 18 '05 #23

P: n/a
John Machin wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.

-pu
Jul 18 '05 #24

P: n/a
Patrick Useldinger wrote:
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.

3) Probability is on your side. If two files match in length, but differ
in contents, a good hash will have probability (1 / max_hash) of
having a matching hash. For exactly two files of the same length,
calculating the hash is clearly a waste. For even three files of
the same length, you are most likely to find them distinct in three
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
Naturally, as it grows beyond three, the deference grows drastically.

-Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #25

P: n/a
Scott David Daniels wrote:
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.


That's provided you compare always 2 files at a time. I compare n files
at a time, n being the number of files of the same size. That's quicker
than hashes because I have a fair chance of finding a difference before
the end of files. Otherwise, it's like hashes without computation and
without having to have a second go to *really* compare them.

-pu
Jul 18 '05 #26

P: n/a
[Patrick Useldinger]
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's
a large number of identical hashes, you'd still need to read all of
these files to make sure.


Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.

I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.

Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.

--
François Pinard http://pinard.progiciels-bpi.ca
Jul 18 '05 #27

P: n/a
François Pinard wrote:
Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be
100% perfect. I agree it's a rare case, but still, why settle on
something "about right" when you can have "right"?
I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.


It was not my only argument for not using hashed. My algorithm also does
less reads, for example.

-pu
Jul 18 '05 #28

P: n/a

Patrick Useldinger wrote:
John Machin wrote:
Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's

a large number of identical hashes, you'd still need to read all of these files to make sure.


Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?
[1] Yes, I am aware of the subject of this thread.

Regards,

John

Jul 18 '05 #29

P: n/a
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".
Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.
Still, if you can get it 100% right automatically, why would you bother
checking manually? Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer -
yes or no?

Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on
the context in which you use fdups, but then we'd have to discuss the
context. I never assumed any context, in order to keep it as universal
as possible.
Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a warning
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read,
and close it afterwards. In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so
that it's a proper class that can be used in other Python programs, as
you also asked. That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by
(my own) priority.
You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.
Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?


Sorry, I do not understand what you mean by this.

-pu
Jul 18 '05 #30

P: n/a

Patrick Useldinger wrote:
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2) of identical hashes, you'd still need to *RE*-read and *compare* ....".

Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further,
nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible.


Still, if you can get it 100% right automatically, why would you

bother checking manually?
A human in their right mind is required to decide what to do with the
duplicates. The proponents of hashing -- of which I'm not one -- would
point out that any false-positives would be picked up as part of the
human scrutiny.
Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer - yes or no?
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.

Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on the context in which you use fdups, but then we'd have to discuss the context. I never assumed any context, in order to keep it as universal as possible.
That's very good, but it wasn't under contention.
Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a

warning is issued. fdups does not quit, why are you saying this?
I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.

A fallback solution would be to open the file before every _block_ read, and close it afterwards.
Ugh. Better use more memory, so less blocks!!
In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles in a multitasking environment.
The pythonic way is to press ahead optimistically and recover if you
get bad news.

Not difficult to implement, but I first wanted to refactor the code so that it's a proper class that can be used in other Python programs, as you also asked.
I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by (my own) priority.
You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.


Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have?


Sorry, I do not understand what you mean by this.


Test:
!for k in range(1000):
! open('foo' + str(k), 'w')
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."

Cheers,
John

Jul 18 '05 #31

P: n/a
John Machin wrote:
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Of course, but it's good to know that the computer is right, isn't it?
That leaves the human to take decisions instead of double-checking.
I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.
Bufferpool is a parameter, and the default values allow for 4096 files
of the same size. It's more likely to run out of file handles than out
of bufferspace, don't you think?
The pythonic way is to press ahead optimistically and recover if you
get bad news.
You're right, that's what I thought about afterwards. Current idea is to
design a second class that opens/closes/reads the files and handles the
situation independantly of the main class.
I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have submitted to you. Are you sure that *I* am the
lawyer here?
Re (a): what evidence do you have?


See ;-)
Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Nothing I started manually, but the usual bunch of local firewall, virus
scanner (not doing a complete machine check at that time).
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')
I'll try that.
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."


I wouldn't count on that on a multi-tasking environment, as I said. The
class I described earlier seems a cleaner approach.

Regards,
-pu
Jul 18 '05 #32

P: n/a
John Machin wrote:
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')


I ran that and watched it open 2 million files and going strong ...
until I figured that files are closed by Python immediately because
there's no reference to them ;-)

Here's my code:

#!/usr/bin/env python
import os
print 'max number of file handles today is',
n = 0
h = []
try:
while True:
filename = 'mfh' + str(n)
h.append((file(filename,'w'),filename))
n = n + 1
except:
print n
for handle, filename in h:
handle.close()
os.remove(filename)

On Slackware 10.1, this yields 1021.
On WinXPSP2, this yields 509.

-pu
Jul 18 '05 #33

P: n/a
In article <11**********************@g14g2000cwa.googlegroups .com>,
"John Machin" <sj******@lexicon.net> wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]


I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.

The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #34

P: n/a
In article <42********@news.vo.lu>,
Patrick Useldinger <pu*********@gmail.com> wrote:
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee.


When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #35

P: n/a
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.


I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?

-pu

Jul 18 '05 #36

P: n/a
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.


If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu
Jul 18 '05 #37

P: n/a
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein <ep******@ics.uci.edu> wrote:
In article <11**********************@g14g2000cwa.googlegroups .com>,
"John Machin" <sj******@lexicon.net> wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.

What do you mean by "a comparison based method" ? Are you excluding the
possibility of parallel reading and comparing? The problem then becomes
verifying that m buffers containing corresponding segments of the files
are equal. You could do that by comparing hashes of the buffers (or updated
running hashes for the whole files so far read), or you could compare the
buffers in parallel byte by byte if that turns out efficient to implement
in the given language and os environment. Small buffers would probably
not be efficient for disk access, but too large might not be good either.
Maybe some automatic tuning algorithm could be developed to optimize
comparison operations in the shadow of i/o. But does the OS accept hints
about read-ahead buffering? Are the disks SCSI raid or USB external or
networked? Can automatic tuning achieve an optimum disregarding all that?

What is it that we're trying to optimize? Time to classify the
whole file set into groups of equivalent files? (note that there can be
multiple groups that are internally equal but uneqal to members of other groups).

The problem of limitation on the number of simultaneously open files could
be virtualized away, and only have impact when the limit is actually encountered.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different. So the worst case will be reading all through to the end once in parallel.
But pre-hashing guarantees the worst case for all -- unless disk access patterns
happen to make that the most efficient way to get everything read and compared
when most are equal.
The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.

ISTM 2(m-1) reads is not necessary, so "the question" to me doesn't involve that ;-)

Regards,
Bengt Richter
Jul 18 '05 #38

P: n/a
Patrick Useldinger <pu*********@gmail.com> writes:
John Machin wrote: [...]
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody

[...] Still, if you can get it 100% right automatically, why would you
bother checking manually? Why get back to argments like "impossible",
"implausible", "can't be" if you can have a simple and correct answer
-
yes or no?

[...]

Well, as Francois pointed out, it is strictly not physically possible
to obtain a perfectly reliable answer, even if you *do* do the
comparison.

Even so, you're right on this point (though IIUC it's not practically
important ATM): regardless of wild flukes, people can deliberately
wangle files to get a hash collision. so comparison is better than
hashing from this PoV.
John
Jul 18 '05 #39

P: n/a
Patrick Useldinger <pu*********@gmail.com> writes:
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear
to be the same, requires 2(m-1) reads through the whole files if you
use a comparison based method, or m reads if you use a strong
hashing method. You can't hope to cut the reads off early when
using comparisons, because the files won't be different.


If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than
m.


Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).
John
Jul 18 '05 #40

P: n/a
Patrick Useldinger wrote:
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.


I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?


I'm not an expert either, but I seem to remember reading recently
that, while it's been proven that it's possible for SHA1 to have
collisions, no actual collisions have been found. Even if that's not
completely correct, you're *far* more likely to be killed by a
meteorite than to stumble across a SHA1 collision. Heck, I'd expect
that it's more likely for civilization to be destroyed by a
dinosaur-killer-sized meteor.

With very few exceptions, if you're contorting yourself to avoid SHA1
hash collisions, then you should also be wearing meteor-proof (and
lightning-proof) armor everywhere you go. (Those few exceptions would
be cases where a malicious attacker stands to gain enough from
constructing a single hash collision to make it worthwhile to invest a
*large* number of petaflops of processing power.) Sure it's not "100%
perfect", but... how perfect do you *really* need?

Jeff Shannon

Jul 18 '05 #41

P: n/a
In article <87************@pobox.com>, jj*@pobox.com (John J. Lee)
wrote:
If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than
m.
Hmm, Patrick's right, David, isn't he?


Yes, I was only considering pairwise comparisons. As he says,
simultaneously comparing all files in a group would avoid repeated reads
without the CPU overhead of a strong hash. Assuming you use a system
that allows you to have enough files open at once...
And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).


Another interesting point.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #42

P: n/a
>> I'll post my version in a few days.
Have I missed something?
Where can I see your version?

Claudio
"Xah Lee" <xa*@xahlee.org> schrieb im Newsbeitrag
news:11**********************@l41g2000cwc.googlegr oups.com...
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05 #43

P: n/a
Sorry i've been busy...

Here's the Perl code. I have yet to clean up the code and make it
compatible with the cleaned spec above. The code as it is performs the
same algorithm as the spec, just doesn't print the output as such. In a
few days, i'll post a clean version, and also a Python version, as well
a sample directory for testing purposes. (The Perl code has gone thru
many testings and is considered correct.)

The Perl code comes in 3 files as it is:

Combo114.pm
Genpair114.pm
del_dup.pl

The main program is del_dup.pl. Run it on the command line as by the
spec. If you want to actually delete the dup files, uncomment the
"unlink" line at the bottom. Note: the module names don't have any
significance.
Note: here's also these python files ready to go for the final python
version. Possibly the final propram should be just a single file...

Combo114.py
Genpair114.py
Here're the files: del_dup.zip
-----
to get the code and full detail with latest update, please see:
http://xahlee.org/perl-python/delete_dup_files.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html
Claudio Grondi wrote:
I'll post my version in a few days.

Have I missed something?
Where can I see your version?

Claudio


Jul 18 '05 #44

P: n/a
Why not try to use NoClone, it finds and deletes duplicate files by
true byte-by-byte comparison. Smart marker filters duplicate files to
delete. With GUI.
http://noclone.net
Xah Lee wrote:
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.
perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm " in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html


Jul 18 '05 #45

This discussion thread is closed

Replies have been disabled for this discussion.