473,805 Members | 2,266 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

fdups: calling for beta testers

Hi all,

I am looking for beta-testers for fdups.

fdups is a program to detect duplicate files on locally mounted
filesystems. Files are considered equal if their content is identical,
regardless of their filename. Also, fdups ignores symbolic links and is
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or
other hash functions to detect potentially identical files. Instead, it
does a direct blockwise comparison and stops reading as soon as
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct
results. But as I haven't been programming in Python for a year or so,
I'd also be interested in comments on code if you happen to look at it
in detail.

Your help is much appreciated.

-pu
Jul 18 '05 #1
8 1469

Patrick Useldinger wrote:

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.


"""fdups has no installation program. Just change into a temporary
directory, and type "tar xfj fdups.tar.bz". You should also chown the
files according to your needs, and then copy the executables to your
PATH."""

(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(5) if files[subgroup[j]]['flag'] and files[subgroup[i]]['buffer'] ==
files[subgroup[j]]['buffer']:

That's not the most readable code I've ever seen.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

(7)

! def compare(self):
! """ compare all files of the same size - outer loop """
! sizes=self.comp files.keys()
! sizes.sort()
! for size in sizes:
! self.comparefil es(size,self.co mpfiles[size])

Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles. iteritems():
! self.comparefil es(size, file_list)

(8) global
MIN_FILESIZE,MA X_ONEBUFFER,MAX _ALLBUFFERS,BLO CKSIZE,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

! class fDups:
! """ encapsulates the whole logic """

(9) Any good reason why the "executable s" don't have ".py" extensions
on their names?

All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
And what is "chown" -- any relation of Perl's "chomp"?

Jul 18 '05 #2
John Machin wrote:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was not
to save space, just to use the "standard" format. What would it be for
Windows - zip?
(4) Never used them, but are very valid point. I will look into it.
(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'
(6) Not much I can do about this. In the beginning, all files of equal
size are potentially identical. I first need to read a chunk of each,
and if I want to avoid opening & closing files all the time, I need them
open together.
What would you suggest?
Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.
Bill also thinks it is normal that half of service pack 2 lingers twice
on a harddisk. Not sure whether he's my hero ;-)
(7)
Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles. iteritems():
! self.comparefil es(size, file_list)
(7) I wanted the output to be sorted by file size, instead of being
random. It's psychological, but if you're chasing dups, you'd want to
start with the largest ones first. If you have more that a screen full
of info, it's the last lines which are the most interesting. And it will
produce the same info in the same order if you run it twice on the same
folders.
(8) global
MIN_FILESIZE,MA X_ONEBUFFER,MAX _ALLBUFFERS,BLO CKSIZE,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:
(8) Agreed. I'll think about that.
(9) Any good reason why the "executable s" don't have ".py" extensions
on their names?
(9) Because I am lazy and Linux doesn't care. I suppose Windows does?
All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
As I said, I did not give Windows users much thought. I will improve this.
And what is "chown" -- any relation of Perl's "chomp"?


chown is a Unix command to change the owner or the group of a file. It
has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.

Thank you very much for your feedback. Did you actually run it on your
Windows box?

-pu
Jul 18 '05 #3
Patrick Useldinger wrote:
(9) Any good reason why the "executable s" don't have ".py" extensions
on their names?


(9) Because I am lazy and Linux doesn't care. I suppose Windows does?


Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.

-Peter
Jul 18 '05 #4
Peter Hansen wrote:
Patrick Useldinger wrote:
(9) Any good reason why the "executable s" don't have ".py"
extensions on their names?


(9) Because I am lazy and Linux doesn't care. I suppose Windows does?


Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.


Or use exemaker, which IMHO is the best way to handle this
problem.

Serge.
Jul 18 '05 #5

Patrick Useldinger wrote:
John Machin wrote:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3) Typing that on Windows command line doesn't produce a useful result (4) Haven't you heard of distutils?
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was

not to save space, just to use the "standard" format. What would it be for Windows - zip?
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
(6) You are keeping open handles for all files of a given size -- have you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'


(6) Not much I can do about this. In the beginning, all files of

equal size are potentially identical. I first need to read a chunk of each, and if I want to avoid opening & closing files all the time, I need them open together.
What would you suggest?
Test, like I did, to see how many open handles you can get away with. I
was not joking, 20 was the max on MS-DOS at one stage and I vaguely
recall: (a) some low limits on various flavours of *x (b) the "ulimit"
command can be used to vary the per-process limit but (c) there is a
system-wide limit also.

You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
And what is "chown" -- any relation of Perl's "chomp"?


chown is a Unix command to change the owner or the group of a file.

It has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.
The question was rhetorical. Your irony detector must be on the fritz.
:-)
Did you actually run it on your
Windows box?


Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.

Cheers,
John

Jul 18 '05 #6
John Machin wrote:
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
I've added a zip file. It was made in Linux with the zip command-line
tool, the man pages say it's compatible with the Windows zip tools. I
have also added .py extentions to the 2 programs. I did however not use
distutils, because I'm not sure it is really adapted to module-less scripts.
You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
I've added it to the TODO list.
The question was rhetorical. Your irony detector must be on the fritz.
:-)


I always find it hard to detect irony by mail with people I do not know. ..
Did you actually run it on your
Windows box?

Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.


I would have been reluctant too. But I've tested it intensively, and
there's strictly no statement that actually alters the file system.

Thanks for your feedback!

-pu
Jul 18 '05 #7
Serge Orlov wrote:
Or use exemaker, which IMHO is the best way to handle this
problem.


Looks good, but I do not use Windows.

-pu
Jul 18 '05 #8
On Sat, 26 Feb 2005 23:53:10 +0100, Patrick Useldinger
<pu*********@gm ail.com> wrote:
I've tested it intensively
"Famous Last Words" :-)
Thanks for your feedback!


Here's some more:

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Here's a snippet from a duplicate detection run:

DUP|393216|2|\d evel\delimited\ build\lib.win32-1.5\delimited.t ds|\devel\delim ited\build\lib. win32-2.1\delimited.t ds
DUP|393216|2|\d evel\delimited\ build\lib.win32-2.3\delimited.t ds|\devel\delim ited\build\lib. win32-2.4\delimited.t ds

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. The above duplicates were detected only when I made the
following changes to your script:
--- fdups Sat Feb 26 06:41:36 2005
+++ fdups_jm.py Sun Feb 27 12:18:04 2005
@@ -29,13 +29,14 @@
self.count = self.totalsize = self.inodecount =
self.slinkcount = 0
self.gain = self.bytescompa red = self.bytesread =
self.inodecount = 0
for toplevel in args:
- os.path.walk(to plevel, self.buildList, None)
+ os.path.walk(to plevel, self.updateDict , None)
if self.count > 0:
self.compare()

- def buildList(self, arg,dirpath,nam elist):
- """ build a dictionnary of files to be analysed, indexed by
length """
- files = {}
+ def updateDict(self ,arg,dirpath,na melist):
+ """ update a dictionary of files to be analysed, indexed by
length """
+ # files = {}
+ files = self.compfiles
for filepath in namelist:
fullpath = os.path.join(di rpath,filepath)
if os.path.isfile( fullpath):
@@ -51,20 +52,23 @@
if size >= MIN_FILESIZE:
self.count += 1
self.totalsize += size
+ # is above totalling in the wrong place?
if size not in files:
files[size]=[fullpath]
else:
files[size].append(fullpat h)
- for size in files:
- if len(files[size]) != 1:
- self.compfiles[size]=files[size]
+ # for size in files:
+ # if len(files[size]) != 1:
+ # self.compfiles[size]=files[size]

def compare(self):
""" compare all files of the same size - outer loop """
sizes=self.comp files.keys()
sizes.sort()
for size in sizes:
- self.comparefil es(size,self.co mpfiles[size])
+ list_of_filenam es = self.compfiles[size]
+ if len(list_of_fil enames) > 1:
+ self.comparefil es(size, list_of_filenam es)

def comparefiles(se lf,size,filelis t):
""" compare all files of the same size - inner loop """
(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:

(1, "'{' is not recognized as an internal or external
command,\nopera ble program or batch file.")

Why not use the Python filecmp module?

Cheers,
John
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1300
by: jon | last post by:
SOFTWARE BETA TESTERS REQUIRED I've got a Web Editor called Webstar that I have been working on for some time that is ready to be beta-tested and if you want to experience trying out a new product before the general public, follow the link below. I will send all the beta-testers an evaluation form in time, and you will receive in return a free licence
0
1272
by: Jsobel | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. I'm posting this because they are seeking more beta testers, particularly seeking "power-users" who understand the technology.
0
919
by: Jsobel | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. I'm posting this because they are seeking more beta testers, particularly seeking "power-users" who understand the technology.
0
1470
by: John_Gradian | last post by:
Hi all: I downloaded this new Personal Audio Link app. They issued a press release looking for beta testers, with a compensation offer of 6 months free Vonage service for qualified testers. I'm posting this because they are seeking more beta testers, particularly seeking "power-users" who understand the technology.
0
10614
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10363
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10369
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10109
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9186
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7649
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6876
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5544
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
3008
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.