[perl-python] a program to delete duplicate files - Page 3

Xah Lee

here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05

Subscribe Reply

4066

Christos TZOTZIOY Georgiou

On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
<ep******@ics.u ci.edu> might have written:

More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).

The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum). However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)

It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #21

Bengt Richter

On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <ep******@ics.u ci.edu> wrote:

In article <42********@new s.vo.lu>,
Patrick Useldinger <pu*********@gm ail.com> wrote:
> Well, but the spec didn't say efficiency was the primary criterion, it
> said minimizing the number of comparisons was.

That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the number
of comparisons.

ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori knowledge
of highly probable difference locations (e.g., internal date stamp at some offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.

Regards,
Bengt Richter

Jul 18 '05 #22

John Machin

David Eppstein wrote:

In article <42********@new s.vo.lu>,
Patrick Useldinger <pu*********@gm ail.com> wrote:
Well, but the spec didn't say efficiency was the primary criterion, it said minimizing the number of comparisons was.
That's exactly what my program does.

If you're doing any comparisons at all, you're not minimizing the

number of comparisons.

Gah. You two should be lawyers. I don't know why you're arguing about
whatever you are arguing about.

Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

1. d <= S
2. Anyone have a usable hash function that's faster than string
comparison?
3. Anyone have an answer to PU's request for advice where his algorithm
could be improved by using a hash method?

Temporary injunction in favour of PU.

Jul 18 '05 #23

Patrick Useldinger

John Machin wrote:

Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.

-pu

Jul 18 '05 #24

Scott David Daniels

Patrick Useldinger wrote:

Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.

3) Probability is on your side. If two files match in length, but differ
in contents, a good hash will have probability (1 / max_hash) of
having a matching hash. For exactly two files of the same length,
calculating the hash is clearly a waste. For even three files of
the same length, you are most likely to find them distinct in three
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
Naturally, as it grows beyond three, the deference grows drastically.

-Scott David Daniels
Sc***********@A cm.Org

Jul 18 '05 #25

Patrick Useldinger

Scott David Daniels wrote:

comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.

That's provided you compare always 2 files at a time. I compare n files
at a time, n being the number of files of the same size. That's quicker
than hashes because I have a fair chance of finding a difference before
the end of files. Otherwise, it's like hashes without computation and
without having to have a second go to *really* compare them.

-pu

Jul 18 '05 #26

François Pinard

[Patrick Useldinger]

Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's
a large number of identical hashes, you'd still need to read all of
these files to make sure.

Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.

I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.

Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.

--
François Pinard http://pinard.progiciels-bpi.ca

Jul 18 '05 #27

Patrick Useldinger

François Pinard wrote:

Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be
100% perfect. I agree it's a rare case, but still, why settle on
something "about right" when you can have "right"?
I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.

It was not my only argument for not using hashed. My algorithm also does
less reads, for example.

-pu

Jul 18 '05 #28

John Machin

Patrick Useldinger wrote:

John Machin wrote:
Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S]

PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].

Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's

a large number of identical hashes, you'd still need to read all of these files to make sure.

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?
[1] Yes, I am aware of the subject of this thread.

Regards,

John

Jul 18 '05 #29

Patrick Useldinger

John Machin wrote:

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".
Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.
Still, if you can get it 100% right automatically, why would you bother
checking manually? Why get back to argments like "impossible ",
"implausibl e", "can't be" if you can have a simple and correct answer -
yes or no?

Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on
the context in which you use fdups, but then we'd have to discuss the
context. I never assumed any context, in order to keep it as universal
as possible.
Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a warning
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read,
and close it afterwards. In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so
that it's a proper class that can be used in other Python programs, as
you also asked. That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by
(my own) priority.
You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.
Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?

Sorry, I do not understand what you mean by this.

-pu

Jul 18 '05 #30

Similar topics

4689

accessing different versions of perl

by: Julia Bell | last post by:

I would like to run the same script on two different platforms. The directory in which the script(s) will be stored is common to the two platforms. (I see the same directory contents regardless of which platform I use to access the directory.) Platform 1: perl is installed in /tps/bin/perl. CPAN modules are available Perl is also installed in /usr/bin/perl Platform 1, but the modules are not accessible with this version. Platform...

Perl

668

perl

by: sm00thcrimnl13 | last post by:

if i have windows 2000 and know how to write perl scripts, how to i actuvate the script through perl?

Perl

9050

Dates with Perl

by: Firewalker | last post by:

Hey guys, I am newbie to perl. I am trying to deal with dates ( such trying to find what the date would be after month). Is therea function or date class( I am a java programmer, I couldnt find the right word instead of class) to do the job? Thanks for any help.

Perl

3642

Mysql-DBD Perl module Installation Problem in HPUX

by: smsabu2002 | last post by:

Hi, I am facing the build problem while installing the DBD-MySql perl module (ver 2.9008) using both GCC and CC compilers in HP-UX machine. For the Build using GCC, the compiler error is produced due to the unknown GCC compiler option "+DAportable". For the Build using CC, the preprocessor error is produced due to the recursive declration of macro "PerlIO" in perlio.h file.

MySQL Database

3264

Perl-1.0.0

by: Otto J. Makela | last post by:

I'm trying to install to php the Perl-1.0.0.tgz package (from http://pecl.php.net/package/perl, enabling one to call perl libraries) to a pre-existing Solaris system. Unfortunately, the attempt fails in a rather dramatic way, spewing out thousands of "relocation remains"... I'm somewhat lost on what to do next, the documentation that came along with the Perl package is somewhat sparse. Anyone have suggestions? % uname -a

PHP

3013

Perl DBI/XML processing versus PHP ?

by: surfivor | last post by:

I may be involved in a data migration project involving databases and creating XML feeds. Our site is PHP based, so I imagine the team might suggest PHP, but I had a look at the PHP documentation for one of the Pear modules for creating XML and it didn't look like much. I've used Perl XML:Twig and I have the impression that there is more Perl stuff available as well as the Perl stuff being well documented as I have a Perl DBI book, a Perl...

PHP

3709

perl extension for PHP - stuck?

by: billb | last post by:

I installed a perl extension for PHP to use some perl inside my php primarily because I have perl working with oracle and not php and oracle. So I want to use my old perl scripts, and use the functionality of php. The extension uses the '$perl->eval' function. i am kind of stuck with the syntax or how to put the php variable into the perl script. I have a form where the user puts in a grid reference. Then a php script that gets the...

PHP

8624

unable to run the perl script using Eclipse

by: vijayarl | last post by:

Hi All, i have the following software installed in my system : 1.OS: Win2k 2.Eclipse Version used :3.4.0 & even the perl too... 1. I have imported the my own perl project in Eclipse, when i tried to run the External Tools --> Perl -w am getting the popup saying then it says : " Variable references empty selection:

Perl

9617

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10254

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10099

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10036

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9904

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8929

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7451

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5354

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4007

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp