here's a large exercise that uses what we built before.
suppose you have tens of thousands of files in various directories.
Some of these files are identical, but you don't know which ones are
identical with which. Write a program that prints out which file are
redundant copies.
Here's the spec.
--------------------------
The program is to be used on the command line. Its arguments are one or
more full paths of directories.
perl del_dup.pl dir1
prints the full paths of all files in dir1 that are duplicate.
(including files in sub-directories) More specifically, if file A has
duplicates, A's full path will be printed on a line, immediately
followed the full paths of all other files that is a copy of A. These
duplicates's full paths will be prefixed with "rm " string. A empty
line follows a group of duplicates.
Here's a sample output.
inPath/a.jpg
rm inPath/b.jpg
rm inPath/3/a.jpg
rm inPath/hh/eu.jpg
inPath/ou.jpg
rm inPath/23/a.jpg
rm inPath/hh33/eu.jpg
order does not matter. (i.e. which file will not be "rm " does not
matter.)
------------------------
perl del_dup.pl dir1 dir2
will do the same as above, except that duplicates within dir1 or dir2
themselves not considered. That is, all files in dir1 are compared to
all files in dir2. (including subdirectories) And, only files in dir2
will have the "rm " prefix.
One way to understand this is to imagine lots of image files in both
dir. One is certain that there are no duplicates within each dir
themselves. (imagine that del_dup.pl has run on each already) Files in
dir1 has already been categorized into sub directories by human. So
that when there are duplicates among dir1 and dir2, one wants the
version in dir2 to be deleted, leaving the organization in dir1 intact.
perl del_dup.pl dir1 dir2 dir3 ...
does the same as above, except files in later dir will have "rm "
first. So, if there are these identical files:
dir2/a
dir2/b
dir4/c
dir4/d
the c and d will both have "rm " prefix for sure. (which one has "rm "
in dir2 does not matter) Note, although dir2 doesn't compare files
inside itself, but duplicates still may be implicitly found by indirect
comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
never compared.
--------------------------
Write a Perl or Python version of the program.
a absolute requirement in this problem is to minimize the number of
comparison made between files. This is a part of the spec.
feel free to write it however you want. I'll post my version in a few
days. http://www.xahlee.org/perl-python/python.html
Xah xa*@xahlee.org http://xahlee.org/PageTwo_dir/more.html
Jul 18 '05
44 4066
On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
<ep******@ics.u ci.edu> might have written: More seriously, the best I can think of that doesn't use a strong slow hash would be to group files by (file size, cheap hash) then compare each file in a group with a representative of each distinct file found among earlier files in the same group -- that leads to an average of about three reads per duplicated file copy: one to hash it, and two for the comparison between it and its representative (almost all of the comparisons will turn out equal but you still need to check unless you use a strong hash).
The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum). However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)
It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <ep******@ics.u ci.edu> wrote: In article <42********@new s.vo.lu>, Patrick Useldinger <pu*********@gm ail.com> wrote:
> Well, but the spec didn't say efficiency was the primary criterion, it > said minimizing the number of comparisons was.
That's exactly what my program does.
If you're doing any comparisons at all, you're not minimizing the number of comparisons.
ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori knowledge
of highly probable difference locations (e.g., internal date stamp at some offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets by some
simple heuristic probes.
If differences are likely to appear early in a sequential scan, perhaps developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.
Regards,
Bengt Richter
David Eppstein wrote: In article <42********@new s.vo.lu>, Patrick Useldinger <pu*********@gm ail.com> wrote:
Well, but the spec didn't say efficiency was the primary
criterion, it said minimizing the number of comparisons was. That's exactly what my program does.
If you're doing any comparisons at all, you're not minimizing the
number of comparisons.
Gah. You two should be lawyers. I don't know why you're arguing about
whatever you are arguing about.
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].
Hashing method: O(NS) reading time, O(NS) hash calc time
1. d <= S
2. Anyone have a usable hash function that's faster than string
comparison?
3. Anyone have an answer to PU's request for advice where his algorithm
could be improved by using a hash method?
Temporary injunction in favour of PU.
John Machin wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d which is important for small N and large d].
Hashing method: O(NS) reading time, O(NS) hash calc time
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:
1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."
2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.
-pu
Patrick Useldinger wrote: Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons:
1) it's easier to program: you don't compare several files in parallel, but process one by one. But it's not perfect and you still need to compare afterwards. In the worst case, you end up with 3 files with identical hashes, of which 2 are identical and 1 is not. In order to find this, you'd still have to program the algorithm I use, unless you say "oh well, there's a problem with the hash, go and look yourself."
2) it's probably useful if you compare files over a network and you want to reduce bandwidth. A hash lets you do that at the cost of local CPU and disk usage, which may be OK. That was not my case.
3) Probability is on your side. If two files match in length, but differ
in contents, a good hash will have probability (1 / max_hash) of
having a matching hash. For exactly two files of the same length,
calculating the hash is clearly a waste. For even three files of
the same length, you are most likely to find them distinct in three
comparisons. Using hashes, three file reads and three comparisons
of hash values. Without hashes, six file reads; you must read both
files to do a file comparison, so three comparisons is six files.
Naturally, as it grows beyond three, the deference grows drastically.
-Scott David Daniels Sc***********@A cm.Org
Scott David Daniels wrote: comparisons. Using hashes, three file reads and three comparisons of hash values. Without hashes, six file reads; you must read both files to do a file comparison, so three comparisons is six files.
That's provided you compare always 2 files at a time. I compare n files
at a time, n being the number of files of the same size. That's quicker
than hashes because I have a fair chance of finding a difference before
the end of files. Otherwise, it's like hashes without computation and
without having to have a second go to *really* compare them.
-pu
[Patrick Useldinger] Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's a large number of identical hashes, you'd still need to read all of these files to make sure.
Identical hashes for different files? The probability of this happening
should be extremely small, or else, your hash function is not a good one.
I once was over-cautious about relying on hashes only, without actually
comparing files. A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray. So,
my cautious attitude was by far, for all practical means, a waste.
Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.
--
François Pinard http://pinard.progiciels-bpi.ca
François Pinard wrote: Identical hashes for different files? The probability of this happening should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be
100% perfect. I agree it's a rare case, but still, why settle on
something "about right" when you can have "right"?
I once was over-cautious about relying on hashes only, without actually comparing files. A friend convinced me, doing maths, that with a good hash function, the probability of a false match was much, much smaller than the probability of my computer returning the wrong answer, despite thorough comparisons, due to some electronic glitch or cosmic ray. So, my cautious attitude was by far, for all practical means, a waste.
It was not my only argument for not using hashed. My algorithm also does
less reads, for example.
-pu
Patrick Useldinger wrote: John Machin wrote:
Just look at the efficiency of processing N files of the same size
S, where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d which is important for small N and large d].
Hashing method: O(NS) reading time, O(NS) hash calc time
Shouldn't you add the additional comparison time that has to be done after hash calculation? Hashes do not give 100% guarantee. If there's
a large number of identical hashes, you'd still need to read all of
these files to make sure.
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".
1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.
Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.
Re (a): what evidence do you have?
Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?
[1] Yes, I am aware of the subject of this thread.
Regards,
John
John Machin wrote: Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: "If there are *any* number (n >= 2) of identical hashes, you'd still need to *RE*-read and *compare* ...".
Right, that is what I meant.
2. As others have explained, with a decent hash function, the probability of a false positive is vanishingly small. Further, nobody in their right mind [1] would contemplate automatically deleting n-1 out of a bunch of n reportedly duplicate files without further investigation. Duplicate files are usually (in the same directory with different names or in different-but-related directories with the same names) and/or (have a plausible explanation for how they were duplicated) -- the one-in-zillion-chance false-positive should stand out as implausible.
Still, if you can get it 100% right automatically, why would you bother
checking manually? Why get back to argments like "impossible ",
"implausibl e", "can't be" if you can have a simple and correct answer -
yes or no?
Anyway, fdups does not do anything else than report duplicates.
Deleting, hardlinking or anything else might be an option depending on
the context in which you use fdups, but then we'd have to discuss the
context. I never assumed any context, in order to keep it as universal
as possible.
Different subject: maximum number of files that can be open at once. I raised this issue with you because I had painful memories of having to work around max=20 years ago on MS-DOS and was aware that this magic number was copied blindly from early Unix. I did tell you that empirically I could get 509 successful opens on Win 2000 [add 3 for stdin/out/err to get a plausible number] -- this seems high enough to me compared to the likely number of files with the same size -- but you might like to consider a fall-back detection method instead of just quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a warning
is issued. fdups does not quit, why are you saying this?
A fallback solution would be to open the file before every _block_ read,
and close it afterwards. In my mind, it would be a command-line option,
because it's difficult to determine the number of available file handles
in a multitasking environment.
Not difficult to implement, but I first wanted to refactor the code so
that it's a proper class that can be used in other Python programs, as
you also asked. That is what I have sent you tonight. It's not that I
don't care about the file handle problem, it's just that I do changes by
(my own) priority.
You wrote at some stage in this thread that (a) this caused problems on Windows and (b) you hadn't had any such problems on Linux.
Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly
less than 500 files of the same length.
Re (b): famous last words! How long would it take you to do a test and announce the margin of safety that you have?
Sorry, I do not understand what you mean by this.
-pu This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Julia Bell |
last post by:
I would like to run the same script on two different platforms. The
directory in which the script(s) will be stored is common to the two
platforms. (I see the same directory contents regardless of which
platform I use to access the directory.)
Platform 1: perl is installed in /tps/bin/perl. CPAN modules are available
Perl is also installed in /usr/bin/perl Platform 1, but the modules are
not accessible with this version.
Platform...
|
by: sm00thcrimnl13 |
last post by:
if i have windows 2000 and know how to write perl scripts, how to i
actuvate the script through perl?
|
by: Firewalker |
last post by:
Hey guys,
I am newbie to perl. I am trying to deal with dates ( such trying to find
what the date would be after month). Is therea function or date class( I am
a java programmer, I couldnt find the right word instead of class) to do the
job?
Thanks for any help.
|
by: smsabu2002 |
last post by:
Hi,
I am facing the build problem while installing the DBD-MySql perl
module (ver 2.9008) using both GCC and CC compilers in HP-UX machine.
For the Build using GCC, the compiler error is produced due to the
unknown GCC compiler option "+DAportable".
For the Build using CC, the preprocessor error is produced due to the
recursive declration of macro "PerlIO" in perlio.h file.
|
by: Otto J. Makela |
last post by:
I'm trying to install to php the Perl-1.0.0.tgz package (from
http://pecl.php.net/package/perl, enabling one to call perl libraries)
to a pre-existing Solaris system.
Unfortunately, the attempt fails in a rather dramatic way, spewing out
thousands of "relocation remains"... I'm somewhat lost on what to do
next, the documentation that came along with the Perl package is
somewhat sparse. Anyone have suggestions?
% uname -a
| |
by: surfivor |
last post by:
I may be involved in a data migration project involving databases and
creating XML feeds. Our site is PHP based, so I imagine the team might
suggest PHP, but I had a look at the PHP documentation for one of the
Pear modules for creating XML and it didn't look like much. I've used
Perl XML:Twig and I have the impression that there is more Perl stuff
available as well as the Perl stuff being well documented as I have a
Perl DBI book, a Perl...
|
by: billb |
last post by:
I installed a perl extension for PHP to use some perl inside my php
primarily because I have perl working with oracle and not php and
oracle. So I want to use my old perl scripts, and use the
functionality of php. The extension uses the '$perl->eval' function.
i am kind of stuck with the syntax or how to put the php variable into
the perl script. I have a form where the user puts in a grid
reference. Then a php script that gets the...
|
by: vijayarl |
last post by:
Hi All,
i have the following software installed in my system :
1.OS: Win2k
2.Eclipse Version used :3.4.0 & even the perl too...
1. I have imported the my own perl project in Eclipse, when i tried to run the External Tools --> Perl -w
am getting the popup saying then it says :
" Variable references empty selection:
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |