how can I make this script shorter?

Lowell Kirsh

I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer
than it needs to be but I can't figure out how to shorten it. Perhaps
there are some python features or libraries I'm not taking advantage of.

The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to see
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

Lowell

#!/usr/bin/env python
import os, os.path, filecmp, operator, sys

# return a list of lists of duplicates
# ie. partitionDuplicates([1, 2, 3, 1, 1, 3], operator.eq) -> [[1,1,1],[3,3]]
def partitionDuplicates(args, eq=operator.eq):
numFiles = len(args)
if numFiles <= 1:
return []
partitions = []
flags = [False] * numFiles
for i in xrange(numFiles - 1):
if not flags[i]: # file hasn't been matched to earlier file
matches = [args[i]] # the current file we're comparing others to
for j in xrange(i+1, numFiles):
if not flags[j] and eq(args[i], args[j]):
matches.append(args[j])
flags[j] = True
if len(matches) > 1:
partitions.append(matches)
assert(reduce(operator.add, map(len, partitions), 0)
<= len(args))
for partition in partitions:
for i in xrange(len(partition) - 1):
assert eq(partition[i], partition[i+1])
return partitions

assert partitionDuplicates([1, 2, 3, 1, 1, 3], operator.eq) == [[1,1,1],[3,3]]

def enumerateMatches(dir):
all = {}
# first, find duplicates strictly by their sizes
for root, dirs, files in os.walk(dir):
for file in files:
fullname = os.path.join(root, file)
size = os.path.getsize(fullname)
if size not in all:
all[size] = []
all[size].append(fullname)

# now check which files are really duplicates of each other
toreturn = []
for files in all.itervalues():
partition = partitionDuplicates(files, filecmp.cmp)
if partition:
for i in partition:
toreturn.append(i)
return toreturn

def fakeDelete(file):
print 'simulating deleting %s\n' % file

def dealWithMatches(matches, delete=fakeDelete):
"return true if files were deleted"
while True:
print 'Which file do you want to keep? (all others will be deleted)'
print 'or press "s" to skip this group'
for i in xrange(len(matches)):
print '%d) %s' % (i, matches[i])
input = raw_input().strip().lower()
if input == 's':
print
return False
try:
input = int(input)
if 0 <= input < len(matches):
for i in xrange(len(matches)):
if i != input:
delete(matches[i])
return True
except ValueError:
pass
print 'Invalid entry. Try again\n'

if __name__ == '__main__':
global matcheses # should get rid of this line
if len(sys.argv) != 2:
print 'usage: %s dirname' % os.path.basename(sys.argv[0])
else:
matcheses = enumerateMatches(sys.argv[1])
print 'Found at least %d matches' % len(matcheses)
for matches in matcheses:
dealWithMatches(matches)

Jul 18 '05 #1

Subscribe Post Reply

2303

Christos TZOTZIOY Georgiou

On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh
<lk****@cs.ubc.ca> might have written:

I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer
than it needs to be but I can't figure out how to shorten it. Perhaps
there are some python features or libraries I'm not taking advantage of.

The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to see
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

I can't advise on code length; my dupefind.py script is 361 lines, but the
algorithm is slightly more complex to speed things up (and it also optionally
hardlinks identical files on POSIX and NTFS filesystems). If in your case there
are lots of files of several MiB each, that often match on size, you could avoid
lots of comparisons if you did match based on some hash (md5 or sha).

You could also compare first on the hash of the first few kiB (I check 8 kiB) to
see if you need to read the whole file or not.

So:

for every file:
if other files exist with the same size:
calculate hash for first few kiB
if file has same "initial" hash with other files:
calculate full hash
return all files with same full hash

Something like that.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #2

John Machin

Lowell Kirsh wrote:

I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer than it needs to be but I can't figure out how to shorten it. Perhaps there are some python features or libraries I'm not taking advantage of.
The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to see if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

Lowell

To answer the question in the message subject: 1,$d

And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...

However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) "shallow" option of filecmp.cmp()?

Jul 18 '05 #3

Lowell Kirsh

Good idea about hashing part of the file before comparing entire files.
It will make the script longer but the speed increase will most likely
make it worth it.

Lowell

Christos TZOTZIOY Georgiou wrote:

On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh
<lk****@cs.ubc.ca> might have written:

I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer
than it needs to be but I can't figure out how to shorten it. Perhaps
there are some python features or libraries I'm not taking advantage of.

The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to see
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

I can't advise on code length; my dupefind.py script is 361 lines, but the
algorithm is slightly more complex to speed things up (and it also optionally
hardlinks identical files on POSIX and NTFS filesystems). If in your case there
are lots of files of several MiB each, that often match on size, you could avoid
lots of comparisons if you did match based on some hash (md5 or sha).

You could also compare first on the hash of the first few kiB (I check 8 kiB) to
see if you need to read the whole file or not.

So:

for every file:
if other files exist with the same size:
calculate hash for first few kiB
if file has same "initial" hash with other files:
calculate full hash
return all files with same full hash

Something like that.

Jul 18 '05 #4

Lowell Kirsh

Thanks for the advice. There are definitely some performance issues I
hadn't thought of before. I guess it's time to go lengthen, not shorten,
the script.

Lowell

John Machin wrote:

Lowell Kirsh wrote:
I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer

than it needs to be but I can't figure out how to shorten it. Perhaps

there are some python features or libraries I'm not taking advantage

of.
The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to

see
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

Lowell

To answer the question in the message subject: 1,$d

And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...

However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) "shallow" option of filecmp.cmp()?

Jul 18 '05 #5

Christos TZOTZIOY Georgiou

On Wed, 23 Feb 2005 01:56:02 -0800, rumours say that Lowell Kirsh
<lk****@cs.ubc.ca> might have written:

Good idea about hashing part of the file before comparing entire files.
It will make the script longer but the speed increase will most likely
make it worth it.

My dupefind module was one of the first modules I wrote in python (it was a
shell script once...), so it's not appropriate to post. However, rewriting it
was in my RSN list; after your post, I said, "what the heck" :)

I wouldn't describe it as /clean/ Python code, so any comments on style,
methodology (and of course, bugs!) are mostly welcome. If you have any ideas
how to make it more "Pythonic" in style, perhaps we can even send it to ASPN.
On the upside, it's a good demo of Python dynamism and therefore power.

All first spaces are converted to pipe symbols to keep code indents, and sorry
for the assignment operators lacking spacing at the left, but it helped me know
what I meant in C, so I kept the habit in Python.

"""dupefinder.py -- a module to find duplicate files.

find_duplicate_files(*paths):
.. A function returning an iterable of lists of duplicate files
.. To be used as
.. for duplicate_files in dupefinder.find_duplicate_files(dir1, dir2...):
.. # process every group of duplicates
"""

import os, md5, errno, sys

IGNORED_ERROR_CODES= frozenset( [errno.EACCES, errno.ENOENT] )

class File(object):
.. """A wrapper for files to facilitate their comparison.
..
.. Interface:
.. - .get_hash(level) returns appropriate key for the current cmp level
.. - .filename
.. - .size"""

.. __slots__= "filename", "size", "_hashes",

.. def __init__(self, filename):
.. self.filename= filename
.. self.size= os.path.getsize(filename)
.. if self.size == 0:
.. self._hashes= [0, None, None]
.. else:
.. self._hashes= [self.size]

.. def __repr__(self):
.. return "%s('%s')" % (self.__class__.__name__, self.filename)

.. def get_hash(self, level):
.. """Return appropriate key for level of comparison.
.. level == 0 returns file size
.. level == 1 returns hash of first few kb
.. level == 2 returns md5 sum of whole file"""
.. if level >= len(self._hashes):
.. if 1 <= len(self._hashes):
.. self._hashes.append(self._get_partial_hash())
.. if 2 <= len(self._hashes):
.. self._hashes.append(self._get_full_hash())
.. return self._hashes[level]

.. def _get_partial_hash(self):
.. fp= open(self.filename)
.. try:
.. return hash(fp.read(8192))
.. finally:
.. fp.close()

.. def _get_full_hash(self):
.. fp= open(self.filename, "rb")
.. full_hash= md5.new()
.. while 1:
.. data= fp.read(65536)
.. if not data: break
.. full_hash.update(data)
.. fp.close()
.. return full_hash.digest()

class GenericFilesDict(dict):
.. """A virtual wrapper for the dictionaries of file comparisons.
.. Not to be used on its own, but through subclassing.
..
.. Each subclass should define a _level class attribute to be used with the
.. File.get_hash method, and a _next_class attribute pointing to the class
.. managing the next level comparisons."""
.. __slots__= ()

.. def add_file(self, fileobj):
.. """Add a File object to self keyed by the appropriate key based on
.. self._level. If another file object exists in the same spot, replace it
.. by a _next_level_class instance containing the pre-existing and new file
.. obj."""

.. this_hash= fileobj.get_hash(self._level)

.. try:
.. result = self[this_hash]
.. except KeyError:
.. self[this_hash]= fileobj
.. return

.. try: # there was something in slot [this_hash]
.. result.add_file(fileobj) # so add fileobj to it
.. except AttributeError: # it has no add_file method, so it's a File
.. _= self[this_hash]= self._next_class() # make an instance
.. _.add_file(result) # add pre-existing
.. _.add_file(fileobj) # add new

.. def __repr__(self):
.. return "%s%s" % (self.__class__.__name__, dict.__repr__(self))

.. def __iter__(self):
.. "Return all instances of SameFiles (subclass of list)."
.. for item, value in self.iteritems():
.. try: _next_class= value._next_class
.. except AttributeError: continue
.. if _next_class is None:
.. yield value
.. else:
.. for item in value:
.. yield item

class SameFiles(list):
.. "A list of duplicate files."
.. __slots__= ()
.. _level= 3
.. _next_class= None # search stops here

.. def add_file(self, fileobj):
.. self.append(fileobj)

class FilesByFullHash(GenericFilesDict):
.. """A dictionary keyed on md5 hash of whole file.

.. The algorithm assures that all File objects in this dict
.. have the same size and the same hash of first few kiB."""
.. __slots__= ()
.. _level= 2
.. _next_class= SameFiles

class FilesByPartialHash(GenericFilesDict):
.. """A dictionary keyed on hosh of first few kiB of file.

.. The algorithm assures that all File objects in this dict
.. have the same size."""
.. __slots__= ()
.. _level= 1
.. _next_class= FilesByFullHash

class FilesBySize(GenericFilesDict):
.. """A dictionary keyed on file size."""
.. __slots__= ()
.. _level= 0
.. _next_class= FilesByPartialHash

def find_duplicate_files(*directories):
.. """Find all duplicate files in the specified directories.

.. The returned object can be iterated on, yielding a list of
.. duplicate files every time."""
.. dupefinder= FilesBySize()
## count= 0
.. for path in directories:
.. for dirname, dirnames, filenames in os.walk(path):
.. for filename in filenames:
.. try:
.. dupefinder.add_file(File(os.path.join(dirname, filename)))
.. except EnvironmentError, exc:
.. if exc.errno not in IGNORED_ERROR_CODES: # deleted or b0rken
symlink
.. raise
## count += 1
## if count & 15 == 0:
## sys.stderr.write("\r%d" % count)

.. return dupefinder

def pprint(obj, level=0):
.. """For testing purposes."""
.. indent= " "*level
.. if isinstance(obj, File):
.. print `obj`
.. elif isinstance(obj, list):
.. print "list"
.. for subobj in obj:
.. print indent, "-", repr(subobj)
.. else:
.. print obj.__class__.__name__
.. for key, subobj in obj.iteritems():
.. print indent, "+(%r)" % key,
.. pprint(subobj, level+1)

if __name__=="__main__":
.. result= find_duplicate_files(*sys.argv[1:])
.. import pprint
.. for group in result:
.. pprint.pprint(group)

--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #6

Lowell Kirsh

It looks pretty good, but I'll have to take a better look later. Out of
curiosity, why did you convert the first spaces to pipes rather than add
the code as an attachment?

Lowell

Christos TZOTZIOY Georgiou wrote:

On Wed, 23 Feb 2005 01:56:02 -0800, rumours say that Lowell Kirsh
<lk****@cs.ubc.ca> might have written:

Good idea about hashing part of the file before comparing entire files.
It will make the script longer but the speed increase will most likely
make it worth it.

My dupefind module was one of the first modules I wrote in python (it was a
shell script once...), so it's not appropriate to post. However, rewriting it
was in my RSN list; after your post, I said, "what the heck" :)

I wouldn't describe it as /clean/ Python code, so any comments on style,
methodology (and of course, bugs!) are mostly welcome. If you have any ideas
how to make it more "Pythonic" in style, perhaps we can even send it to ASPN.
On the upside, it's a good demo of Python dynamism and therefore power.

All first spaces are converted to pipe symbols to keep code indents, and sorry
for the assignment operators lacking spacing at the left, but it helped me know
what I meant in C, so I kept the habit in Python.

"""dupefinder.py -- a module to find duplicate files.

find_duplicate_files(*paths):
. A function returning an iterable of lists of duplicate files
. To be used as
. for duplicate_files in dupefinder.find_duplicate_files(dir1, dir2...):
. # process every group of duplicates
"""

import os, md5, errno, sys

IGNORED_ERROR_CODES= frozenset( [errno.EACCES, errno.ENOENT] )

class File(object):
. """A wrapper for files to facilitate their comparison.
.
. Interface:
. - .get_hash(level) returns appropriate key for the current cmp level
. - .filename
. - .size"""

. __slots__= "filename", "size", "_hashes",

. def __init__(self, filename):
. self.filename= filename
. self.size= os.path.getsize(filename)
. if self.size == 0:
. self._hashes= [0, None, None]
. else:
. self._hashes= [self.size]

. def __repr__(self):
. return "%s('%s')" % (self.__class__.__name__, self.filename)

. def get_hash(self, level):
. """Return appropriate key for level of comparison.
. level == 0 returns file size
. level == 1 returns hash of first few kb
. level == 2 returns md5 sum of whole file"""
. if level >= len(self._hashes):
. if 1 <= len(self._hashes):
. self._hashes.append(self._get_partial_hash())
. if 2 <= len(self._hashes):
. self._hashes.append(self._get_full_hash())
. return self._hashes[level]

. def _get_partial_hash(self):
. fp= open(self.filename)
. try:
. return hash(fp.read(8192))
. finally:
. fp.close()

. def _get_full_hash(self):
. fp= open(self.filename, "rb")
. full_hash= md5.new()
. while 1:
. data= fp.read(65536)
. if not data: break
. full_hash.update(data)
. fp.close()
. return full_hash.digest()

class GenericFilesDict(dict):
. """A virtual wrapper for the dictionaries of file comparisons.
. Not to be used on its own, but through subclassing.
.
. Each subclass should define a _level class attribute to be used with the
. File.get_hash method, and a _next_class attribute pointing to the class
. managing the next level comparisons."""
. __slots__= ()

. def add_file(self, fileobj):
. """Add a File object to self keyed by the appropriate key based on
. self._level. If another file object exists in the same spot, replace it
. by a _next_level_class instance containing the pre-existing and new file
. obj."""

. this_hash= fileobj.get_hash(self._level)

. try:
. result = self[this_hash]
. except KeyError:
. self[this_hash]= fileobj
. return

. try: # there was something in slot [this_hash]
. result.add_file(fileobj) # so add fileobj to it
. except AttributeError: # it has no add_file method, so it's a File
. _= self[this_hash]= self._next_class() # make an instance
. _.add_file(result) # add pre-existing
. _.add_file(fileobj) # add new

. def __repr__(self):
. return "%s%s" % (self.__class__.__name__, dict.__repr__(self))

. def __iter__(self):
. "Return all instances of SameFiles (subclass of list)."
. for item, value in self.iteritems():
. try: _next_class= value._next_class
. except AttributeError: continue
. if _next_class is None:
. yield value
. else:
. for item in value:
. yield item

class SameFiles(list):
. "A list of duplicate files."
. __slots__= ()
. _level= 3
. _next_class= None # search stops here

. def add_file(self, fileobj):
. self.append(fileobj)

class FilesByFullHash(GenericFilesDict):
. """A dictionary keyed on md5 hash of whole file.

. The algorithm assures that all File objects in this dict
. have the same size and the same hash of first few kiB."""
. __slots__= ()
. _level= 2
. _next_class= SameFiles

class FilesByPartialHash(GenericFilesDict):
. """A dictionary keyed on hosh of first few kiB of file.

. The algorithm assures that all File objects in this dict
. have the same size."""
. __slots__= ()
. _level= 1
. _next_class= FilesByFullHash

class FilesBySize(GenericFilesDict):
. """A dictionary keyed on file size."""
. __slots__= ()
. _level= 0
. _next_class= FilesByPartialHash

def find_duplicate_files(*directories):
. """Find all duplicate files in the specified directories.

. The returned object can be iterated on, yielding a list of
. duplicate files every time."""
. dupefinder= FilesBySize()
## count= 0
. for path in directories:
. for dirname, dirnames, filenames in os.walk(path):
. for filename in filenames:
. try:
. dupefinder.add_file(File(os.path.join(dirname, filename)))
. except EnvironmentError, exc:
. if exc.errno not in IGNORED_ERROR_CODES: # deleted or b0rken
symlink
. raise
## count += 1
## if count & 15 == 0:
## sys.stderr.write("\r%d" % count)

. return dupefinder

def pprint(obj, level=0):
. """For testing purposes."""
. indent= " "*level
. if isinstance(obj, File):
. print `obj`
. elif isinstance(obj, list):
. print "list"
. for subobj in obj:
. print indent, "-", repr(subobj)
. else:
. print obj.__class__.__name__
. for key, subobj in obj.iteritems():
. print indent, "+(%r)" % key,
. pprint(subobj, level+1)

if __name__=="__main__":
. result= find_duplicate_files(*sys.argv[1:])
. import pprint
. for group in result:
. pprint.pprint(group)

Jul 18 '05 #7

Christos TZOTZIOY Georgiou

On Thu, 24 Feb 2005 15:32:03 -0800, rumours say that Lowell Kirsh
<lk****@cs.ubc.ca> might have written:

It looks pretty good, but I'll have to take a better look later. Out of
curiosity, why did you convert the first spaces to pipes rather than add
the code as an attachment?

(As you probably noticed, the pipes ended up as dots in the final version, but I
forgot to change the text.)

I considered the code to be small enough to leave it in the message. And using
some non-space character at the start of every line (usually dots) is common use
lately in the group. Even if one doesn't use vi for editing, it's easiest with
Idle to remove the dot at the start of every line.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #8

how can I make this script shorter?

Similar topics