473,785 Members | 2,395 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

[perl-python] a program to delete duplicate files

here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05
44 4070

Patrick Useldinger wrote:
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2) of identical hashes, you'd still need to *RE*-read and *compare* ....".

Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further,
nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible.


Still, if you can get it 100% right automatically, why would you

bother checking manually?
A human in their right mind is required to decide what to do with the
duplicates. The proponents of hashing -- of which I'm not one -- would
point out that any false-positives would be picked up as part of the
human scrutiny.
Why get back to argments like "impossible ",
"implausibl e", "can't be" if you can have a simple and correct answer - yes or no?
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.

Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on the context in which you use fdups, but then we'd have to discuss the context. I never assumed any context, in order to keep it as universal as possible.
That's very good, but it wasn't under contention.
Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a

warning is issued. fdups does not quit, why are you saying this?
I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.

A fallback solution would be to open the file before every _block_ read, and close it afterwards.
Ugh. Better use more memory, so less blocks!!
In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles in a multitasking environment.
The pythonic way is to press ahead optimistically and recover if you
get bad news.

Not difficult to implement, but I first wanted to refactor the code so that it's a proper class that can be used in other Python programs, as you also asked.
I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by (my own) priority.
You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.


Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have?


Sorry, I do not understand what you mean by this.


Test:
!for k in range(1000):
! open('foo' + str(k), 'w')
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."

Cheers,
John

Jul 18 '05 #31
John Machin wrote:
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Of course, but it's good to know that the computer is right, isn't it?
That leaves the human to take decisions instead of double-checking.
I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.
Bufferpool is a parameter, and the default values allow for 4096 files
of the same size. It's more likely to run out of file handles than out
of bufferspace, don't you think?
The pythonic way is to press ahead optimistically and recover if you
get bad news.
You're right, that's what I thought about afterwards. Current idea is to
design a second class that opens/closes/reads the files and handles the
situation independantly of the main class.
I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have submitted to you. Are you sure that *I* am the
lawyer here?
Re (a): what evidence do you have?


See ;-)
Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Nothing I started manually, but the usual bunch of local firewall, virus
scanner (not doing a complete machine check at that time).
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')
I'll try that.
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."


I wouldn't count on that on a multi-tasking environment, as I said. The
class I described earlier seems a cleaner approach.

Regards,
-pu
Jul 18 '05 #32
John Machin wrote:
Test:
!for k in range(1000):
! open('foo' + str(k), 'w')


I ran that and watched it open 2 million files and going strong ...
until I figured that files are closed by Python immediately because
there's no reference to them ;-)

Here's my code:

#!/usr/bin/env python
import os
print 'max number of file handles today is',
n = 0
h = []
try:
while True:
filename = 'mfh' + str(n)
h.append((file( filename,'w'),f ilename))
n = n + 1
except:
print n
for handle, filename in h:
handle.close()
os.remove(filen ame)

On Slackware 10.1, this yields 1021.
On WinXPSP2, this yields 509.

-pu
Jul 18 '05 #33
In article <11************ **********@g14g 2000cwa.googleg roups.com>,
"John Machin" <sj******@lexic on.net> wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]


I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.

The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #34
In article <42********@new s.vo.lu>,
Patrick Useldinger <pu*********@gm ail.com> wrote:
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee.


When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #35
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.


I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?

-pu

Jul 18 '05 #36
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.


If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu
Jul 18 '05 #37
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein <ep******@ics.u ci.edu> wrote:
In article <11************ **********@g14g 2000cwa.googleg roups.com>,
"John Machin" <sj******@lexic on.net> wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.

The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.

What do you mean by "a comparison based method" ? Are you excluding the
possibility of parallel reading and comparing? The problem then becomes
verifying that m buffers containing corresponding segments of the files
are equal. You could do that by comparing hashes of the buffers (or updated
running hashes for the whole files so far read), or you could compare the
buffers in parallel byte by byte if that turns out efficient to implement
in the given language and os environment. Small buffers would probably
not be efficient for disk access, but too large might not be good either.
Maybe some automatic tuning algorithm could be developed to optimize
comparison operations in the shadow of i/o. But does the OS accept hints
about read-ahead buffering? Are the disks SCSI raid or USB external or
networked? Can automatic tuning achieve an optimum disregarding all that?

What is it that we're trying to optimize? Time to classify the
whole file set into groups of equivalent files? (note that there can be
multiple groups that are internally equal but uneqal to members of other groups).

The problem of limitation on the number of simultaneously open files could
be virtualized away, and only have impact when the limit is actually encountered.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different. So the worst case will be reading all through to the end once in parallel.
But pre-hashing guarantees the worst case for all -- unless disk access patterns
happen to make that the most efficient way to get everything read and compared
when most are equal.
The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.

ISTM 2(m-1) reads is not necessary, so "the question" to me doesn't involve that ;-)

Regards,
Bengt Richter
Jul 18 '05 #38
Patrick Useldinger <pu*********@gm ail.com> writes:
John Machin wrote: [...]
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody

[...] Still, if you can get it 100% right automatically, why would you
bother checking manually? Why get back to argments like "impossible ",
"implausibl e", "can't be" if you can have a simple and correct answer
-
yes or no?

[...]

Well, as Francois pointed out, it is strictly not physically possible
to obtain a perfectly reliable answer, even if you *do* do the
comparison.

Even so, you're right on this point (though IIUC it's not practically
important ATM): regardless of wild flukes, people can deliberately
wangle files to get a hash collision. so comparison is better than
hashing from this PoV.
John
Jul 18 '05 #39
Patrick Useldinger <pu*********@gm ail.com> writes:
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear
to be the same, requires 2(m-1) reads through the whole files if you
use a comparison based method, or m reads if you use a strong
hashing method. You can't hope to cut the reads off early when
using comparisons, because the files won't be different.


If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than
m.


Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).
John
Jul 18 '05 #40

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4691
by: Julia Bell | last post by:
I would like to run the same script on two different platforms. The directory in which the script(s) will be stored is common to the two platforms. (I see the same directory contents regardless of which platform I use to access the directory.) Platform 1: perl is installed in /tps/bin/perl. CPAN modules are available Perl is also installed in /usr/bin/perl Platform 1, but the modules are not accessible with this version. Platform...
1
668
by: sm00thcrimnl13 | last post by:
if i have windows 2000 and know how to write perl scripts, how to i actuvate the script through perl?
4
9050
by: Firewalker | last post by:
Hey guys, I am newbie to perl. I am trying to deal with dates ( such trying to find what the date would be after month). Is therea function or date class( I am a java programmer, I couldnt find the right word instead of class) to do the job? Thanks for any help.
1
3642
by: smsabu2002 | last post by:
Hi, I am facing the build problem while installing the DBD-MySql perl module (ver 2.9008) using both GCC and CC compilers in HP-UX machine. For the Build using GCC, the compiler error is produced due to the unknown GCC compiler option "+DAportable". For the Build using CC, the preprocessor error is produced due to the recursive declration of macro "PerlIO" in perlio.h file.
13
3264
by: Otto J. Makela | last post by:
I'm trying to install to php the Perl-1.0.0.tgz package (from http://pecl.php.net/package/perl, enabling one to call perl libraries) to a pre-existing Solaris system. Unfortunately, the attempt fails in a rather dramatic way, spewing out thousands of "relocation remains"... I'm somewhat lost on what to do next, the documentation that came along with the Perl package is somewhat sparse. Anyone have suggestions? % uname -a
6
3013
by: surfivor | last post by:
I may be involved in a data migration project involving databases and creating XML feeds. Our site is PHP based, so I imagine the team might suggest PHP, but I had a look at the PHP documentation for one of the Pear modules for creating XML and it didn't look like much. I've used Perl XML:Twig and I have the impression that there is more Perl stuff available as well as the Perl stuff being well documented as I have a Perl DBI book, a Perl...
4
3710
by: billb | last post by:
I installed a perl extension for PHP to use some perl inside my php primarily because I have perl working with oracle and not php and oracle. So I want to use my old perl scripts, and use the functionality of php. The extension uses the '$perl->eval' function. i am kind of stuck with the syntax or how to put the php variable into the perl script. I have a form where the user puts in a grid reference. Then a php script that gets the...
4
8626
by: vijayarl | last post by:
Hi All, i have the following software installed in my system : 1.OS: Win2k 2.Eclipse Version used :3.4.0 & even the perl too... 1. I have imported the my own perl project in Eclipse, when i tried to run the External Tools --> Perl -w am getting the popup saying then it says : " Variable references empty selection:
0
9647
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10161
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10098
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9958
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8986
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7506
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6743
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5390
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
2890
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.