473,770 Members | 1,806 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

[perl-python] a program to delete duplicate files

here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05
44 4067
Patrick Useldinger wrote:
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong
cryptographic hashes, good enough that you can trust equal results to
really be equal without having to verify by a comparison.


I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?


I'm not an expert either, but I seem to remember reading recently
that, while it's been proven that it's possible for SHA1 to have
collisions, no actual collisions have been found. Even if that's not
completely correct, you're *far* more likely to be killed by a
meteorite than to stumble across a SHA1 collision. Heck, I'd expect
that it's more likely for civilization to be destroyed by a
dinosaur-killer-sized meteor.

With very few exceptions, if you're contorting yourself to avoid SHA1
hash collisions, then you should also be wearing meteor-proof (and
lightning-proof) armor everywhere you go. (Those few exceptions would
be cases where a malicious attacker stands to gain enough from
constructing a single hash collision to make it worthwhile to invest a
*large* number of petaflops of processing power.) Sure it's not "100%
perfect", but... how perfect do you *really* need?

Jeff Shannon

Jul 18 '05 #41
In article <87************ @pobox.com>, jj*@pobox.com (John J. Lee)
wrote:
If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than
m.
Hmm, Patrick's right, David, isn't he?


Yes, I was only considering pairwise comparisons. As he says,
simultaneously comparing all files in a group would avoid repeated reads
without the CPU overhead of a strong hash. Assuming you use a system
that allows you to have enough files open at once...
And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).


Another interesting point.

--
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/
Jul 18 '05 #42
>> I'll post my version in a few days.
Have I missed something?
Where can I see your version?

Claudio
"Xah Lee" <xa*@xahlee.org > schrieb im Newsbeitrag
news:11******** **************@ l41g2000cwc.goo glegroups.com.. .
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.

perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html

Jul 18 '05 #43
Sorry i've been busy...

Here's the Perl code. I have yet to clean up the code and make it
compatible with the cleaned spec above. The code as it is performs the
same algorithm as the spec, just doesn't print the output as such. In a
few days, i'll post a clean version, and also a Python version, as well
a sample directory for testing purposes. (The Perl code has gone thru
many testings and is considered correct.)

The Perl code comes in 3 files as it is:

Combo114.pm
Genpair114.pm
del_dup.pl

The main program is del_dup.pl. Run it on the command line as by the
spec. If you want to actually delete the dup files, uncomment the
"unlink" line at the bottom. Note: the module names don't have any
significance.
Note: here's also these python files ready to go for the final python
version. Possibly the final propram should be just a single file...

Combo114.py
Genpair114.py
Here're the files: del_dup.zip
-----
to get the code and full detail with latest update, please see:
http://xahlee.org/perl-python/delete_dup_files.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html
Claudio Grondi wrote:
I'll post my version in a few days.

Have I missed something?
Where can I see your version?

Claudio


Jul 18 '05 #44
Why not try to use NoClone, it finds and deletes duplicate files by
true byte-by-byte comparison. Smart marker filters duplicate files to
delete. With GUI.
http://noclone.net
Xah Lee wrote:
here's a large exercise that uses what we built before.

suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.

Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or more full paths of directories.

perl del_dup.pl dir1

prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.

Here's a sample output.

inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg

inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg

order does not matter. (i.e. which file will not be "rm " does not
matter.)

------------------------

perl del_dup.pl dir1 dir2

will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.

One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.
perl del_dup.pl dir1 dir2 dir3 ...

does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:

dir2/a
dir2/b
dir4/c
dir4/d

the c and d will both have "rm " prefix for sure. (which one has "rm " in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------

Write a Perl or Python version of the program.

a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.

feel free to write it however you want. I'll post my version in a few
days.

http://www.xahlee.org/perl-python/python.html

Xah
xa*@xahlee.org
http://xahlee.org/PageTwo_dir/more.html


Jul 18 '05 #45

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4689
by: Julia Bell | last post by:
I would like to run the same script on two different platforms. The directory in which the script(s) will be stored is common to the two platforms. (I see the same directory contents regardless of which platform I use to access the directory.) Platform 1: perl is installed in /tps/bin/perl. CPAN modules are available Perl is also installed in /usr/bin/perl Platform 1, but the modules are not accessible with this version. Platform...
1
668
by: sm00thcrimnl13 | last post by:
if i have windows 2000 and know how to write perl scripts, how to i actuvate the script through perl?
4
9050
by: Firewalker | last post by:
Hey guys, I am newbie to perl. I am trying to deal with dates ( such trying to find what the date would be after month). Is therea function or date class( I am a java programmer, I couldnt find the right word instead of class) to do the job? Thanks for any help.
1
3642
by: smsabu2002 | last post by:
Hi, I am facing the build problem while installing the DBD-MySql perl module (ver 2.9008) using both GCC and CC compilers in HP-UX machine. For the Build using GCC, the compiler error is produced due to the unknown GCC compiler option "+DAportable". For the Build using CC, the preprocessor error is produced due to the recursive declration of macro "PerlIO" in perlio.h file.
13
3264
by: Otto J. Makela | last post by:
I'm trying to install to php the Perl-1.0.0.tgz package (from http://pecl.php.net/package/perl, enabling one to call perl libraries) to a pre-existing Solaris system. Unfortunately, the attempt fails in a rather dramatic way, spewing out thousands of "relocation remains"... I'm somewhat lost on what to do next, the documentation that came along with the Perl package is somewhat sparse. Anyone have suggestions? % uname -a
6
3013
by: surfivor | last post by:
I may be involved in a data migration project involving databases and creating XML feeds. Our site is PHP based, so I imagine the team might suggest PHP, but I had a look at the PHP documentation for one of the Pear modules for creating XML and it didn't look like much. I've used Perl XML:Twig and I have the impression that there is more Perl stuff available as well as the Perl stuff being well documented as I have a Perl DBI book, a Perl...
4
3709
by: billb | last post by:
I installed a perl extension for PHP to use some perl inside my php primarily because I have perl working with oracle and not php and oracle. So I want to use my old perl scripts, and use the functionality of php. The extension uses the '$perl->eval' function. i am kind of stuck with the syntax or how to put the php variable into the perl script. I have a form where the user puts in a grid reference. Then a php script that gets the...
4
8624
by: vijayarl | last post by:
Hi All, i have the following software installed in my system : 1.OS: Win2k 2.Eclipse Version used :3.4.0 & even the perl too... 1. I have imported the my own perl project in Eclipse, when i tried to run the External Tools --> Perl -w am getting the popup saying then it says : " Variable references empty selection:
0
9617
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10257
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10099
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9904
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8931
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7456
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6710
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5354
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4007
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.