473,385 Members | 2,029 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

fdups: calling for beta testers

Hi all,

I am looking for beta-testers for fdups.

fdups is a program to detect duplicate files on locally mounted
filesystems. Files are considered equal if their content is identical,
regardless of their filename. Also, fdups ignores symbolic links and is
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or
other hash functions to detect potentially identical files. Instead, it
does a direct blockwise comparison and stops reading as soon as
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct
results. But as I haven't been programming in Python for a year or so,
I'd also be interested in comments on code if you happen to look at it
in detail.

Your help is much appreciated.

-pu
Jul 18 '05 #1
8 1450

Patrick Useldinger wrote:

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.


"""fdups has no installation program. Just change into a temporary
directory, and type "tar xfj fdups.tar.bz". You should also chown the
files according to your needs, and then copy the executables to your
PATH."""

(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(5) if files[subgroup[j]]['flag'] and files[subgroup[i]]['buffer'] ==
files[subgroup[j]]['buffer']:

That's not the most readable code I've ever seen.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

(7)

! def compare(self):
! """ compare all files of the same size - outer loop """
! sizes=self.compfiles.keys()
! sizes.sort()
! for size in sizes:
! self.comparefiles(size,self.compfiles[size])

Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)

(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZ E,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

! class fDups:
! """ encapsulates the whole logic """

(9) Any good reason why the "executables" don't have ".py" extensions
on their names?

All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
And what is "chown" -- any relation of Perl's "chomp"?

Jul 18 '05 #2
John Machin wrote:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was not
to save space, just to use the "standard" format. What would it be for
Windows - zip?
(4) Never used them, but are very valid point. I will look into it.
(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'
(6) Not much I can do about this. In the beginning, all files of equal
size are potentially identical. I first need to read a chunk of each,
and if I want to avoid opening & closing files all the time, I need them
open together.
What would you suggest?
Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.
Bill also thinks it is normal that half of service pack 2 lingers twice
on a harddisk. Not sure whether he's my hero ;-)
(7)
Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)
(7) I wanted the output to be sorted by file size, instead of being
random. It's psychological, but if you're chasing dups, you'd want to
start with the largest ones first. If you have more that a screen full
of info, it's the last lines which are the most interesting. And it will
produce the same info in the same order if you run it twice on the same
folders.
(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZ E,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:
(8) Agreed. I'll think about that.
(9) Any good reason why the "executables" don't have ".py" extensions
on their names?
(9) Because I am lazy and Linux doesn't care. I suppose Windows does?
All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
As I said, I did not give Windows users much thought. I will improve this.
And what is "chown" -- any relation of Perl's "chomp"?


chown is a Unix command to change the owner or the group of a file. It
has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.

Thank you very much for your feedback. Did you actually run it on your
Windows box?

-pu
Jul 18 '05 #3
Patrick Useldinger wrote:
(9) Any good reason why the "executables" don't have ".py" extensions
on their names?


(9) Because I am lazy and Linux doesn't care. I suppose Windows does?


Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.

-Peter
Jul 18 '05 #4
Peter Hansen wrote:
Patrick Useldinger wrote:
(9) Any good reason why the "executables" don't have ".py"
extensions on their names?


(9) Because I am lazy and Linux doesn't care. I suppose Windows does?


Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.


Or use exemaker, which IMHO is the best way to handle this
problem.

Serge.
Jul 18 '05 #5

Patrick Useldinger wrote:
John Machin wrote:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3) Typing that on Windows command line doesn't produce a useful result (4) Haven't you heard of distutils?
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was

not to save space, just to use the "standard" format. What would it be for Windows - zip?
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
(6) You are keeping open handles for all files of a given size -- have you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'


(6) Not much I can do about this. In the beginning, all files of

equal size are potentially identical. I first need to read a chunk of each, and if I want to avoid opening & closing files all the time, I need them open together.
What would you suggest?
Test, like I did, to see how many open handles you can get away with. I
was not joking, 20 was the max on MS-DOS at one stage and I vaguely
recall: (a) some low limits on various flavours of *x (b) the "ulimit"
command can be used to vary the per-process limit but (c) there is a
system-wide limit also.

You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
And what is "chown" -- any relation of Perl's "chomp"?


chown is a Unix command to change the owner or the group of a file.

It has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.
The question was rhetorical. Your irony detector must be on the fritz.
:-)
Did you actually run it on your
Windows box?


Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.

Cheers,
John

Jul 18 '05 #6
John Machin wrote:
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
I've added a zip file. It was made in Linux with the zip command-line
tool, the man pages say it's compatible with the Windows zip tools. I
have also added .py extentions to the 2 programs. I did however not use
distutils, because I'm not sure it is really adapted to module-less scripts.
You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
I've added it to the TODO list.
The question was rhetorical. Your irony detector must be on the fritz.
:-)


I always find it hard to detect irony by mail with people I do not know. ..
Did you actually run it on your
Windows box?

Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.


I would have been reluctant too. But I've tested it intensively, and
there's strictly no statement that actually alters the file system.

Thanks for your feedback!

-pu
Jul 18 '05 #7
Serge Orlov wrote:
Or use exemaker, which IMHO is the best way to handle this
problem.


Looks good, but I do not use Windows.

-pu
Jul 18 '05 #8
On Sat, 26 Feb 2005 23:53:10 +0100, Patrick Useldinger
<pu*********@gmail.com> wrote:
I've tested it intensively
"Famous Last Words" :-)
Thanks for your feedback!


Here's some more:

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Here's a snippet from a duplicate detection run:

DUP|393216|2|\devel\delimited\build\lib.win32-1.5\delimited.tds|\devel\delimited\build\lib.win32-2.1\delimited.tds
DUP|393216|2|\devel\delimited\build\lib.win32-2.3\delimited.tds|\devel\delimited\build\lib.win32-2.4\delimited.tds

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. The above duplicates were detected only when I made the
following changes to your script:
--- fdups Sat Feb 26 06:41:36 2005
+++ fdups_jm.py Sun Feb 27 12:18:04 2005
@@ -29,13 +29,14 @@
self.count = self.totalsize = self.inodecount =
self.slinkcount = 0
self.gain = self.bytescompared = self.bytesread =
self.inodecount = 0
for toplevel in args:
- os.path.walk(toplevel, self.buildList, None)
+ os.path.walk(toplevel, self.updateDict, None)
if self.count > 0:
self.compare()

- def buildList(self,arg,dirpath,namelist):
- """ build a dictionnary of files to be analysed, indexed by
length """
- files = {}
+ def updateDict(self,arg,dirpath,namelist):
+ """ update a dictionary of files to be analysed, indexed by
length """
+ # files = {}
+ files = self.compfiles
for filepath in namelist:
fullpath = os.path.join(dirpath,filepath)
if os.path.isfile(fullpath):
@@ -51,20 +52,23 @@
if size >= MIN_FILESIZE:
self.count += 1
self.totalsize += size
+ # is above totalling in the wrong place?
if size not in files:
files[size]=[fullpath]
else:
files[size].append(fullpath)
- for size in files:
- if len(files[size]) != 1:
- self.compfiles[size]=files[size]
+ # for size in files:
+ # if len(files[size]) != 1:
+ # self.compfiles[size]=files[size]

def compare(self):
""" compare all files of the same size - outer loop """
sizes=self.compfiles.keys()
sizes.sort()
for size in sizes:
- self.comparefiles(size,self.compfiles[size])
+ list_of_filenames = self.compfiles[size]
+ if len(list_of_filenames) > 1:
+ self.comparefiles(size, list_of_filenames)

def comparefiles(self,size,filelist):
""" compare all files of the same size - inner loop """
(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:

(1, "'{' is not recognized as an internal or external
command,\noperable program or batch file.")

Why not use the Python filecmp module?

Cheers,
John
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: jon | last post by:
SOFTWARE BETA TESTERS REQUIRED I've got a Web Editor called Webstar that I have been working on for some time that is ready to be beta-tested and if you want to experience trying out a new...
0
by: Jsobel | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. ...
0
by: Jsobel | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. ...
0
by: John_Gradian | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. ...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.