sorting with expensive compares?

Dan Stromberg

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

Dec 22 '05 #1

Subscribe Post Reply

2182

gene tani

Dan Stromberg wrote:

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

might be simpler to memoize cmp(), look in online cookbook or
something like this decorator

http://mail.python.org/pipermail/pyt...er/303035.html
http://aspn.activestate.com/ASPN/Python/Cookbook/

Dec 23 '05 #2

bonono

Dan Stromberg wrote:

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

Sounds like DSU time.

[a] -> [ (hash(a), a) ]

Dec 23 '05 #3

gene tani

bo****@gmail.com wrote:

Dan Stromberg wrote:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

Sounds like DSU time.

[a] -> [ (hash(a), a) ]

Aha! OR: take a log of the array, e.g. log base 10 or some other
monotonic transform and permutation order indexes
http://aspn.activestate.com/ASPN/Coo.../Recipe/306862

Dec 23 '05 #4

bonono

gene tani wrote:

bo****@gmail.com wrote:
Dan Stromberg wrote:
Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

Sounds like DSU time.

[a] -> [ (hash(a), a) ]

Aha! OR: take a log of the array, e.g. log base 10 or some other
monotonic transform and permutation order indexes
http://aspn.activestate.com/ASPN/Coo.../Recipe/306862

I may have made a mistaken in that hash(a) should be some function that
returns the "order" of a, rather than the built-in hash() function.

Dec 23 '05 #5

Ben Sizer

Dan Stromberg wrote:

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

It's not really practical - if the list is unsorted, it's non-trivial
to determine how many times a given element is duplicated until you've
compared it with everything else. That is roughly an O(N*N/2) operation
whereas sorting typically is O(NlogN). This is why C++'s 'std::unique'
function only operates on sorted lists.

So instead, one alternative would be to use a comparison function that
takes the 2 objects and looks for the pair in a dictionary: if that
pair is not found, perform the normal comparison and store it in the
dictionary for next time, and if it is found, just return it. This way
the actual comparison is only done once for each pair.

Alternatively you might be able to produce a cheap comparison from the
expensive one if you can summarise the information in a simpler form.
Perhaps each sorting criterion can be represented as an integer, and
you can instead sort a list of lists containing integers.

--
Ben Sizer

Dec 23 '05 #6

Kent Johnson

bo****@gmail.com wrote:

Dan Stromberg wrote:
Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Sounds like DSU time.

[a] -> [ (hash(a), a) ]

This won't work - elements with different hashes will sort by hash and
elements with the same hash will still be compared which is exactly what
the OP is trying to avoid.

If there is some function of the arrays which sorts in the same order as
the natural comparison then that function can be used as a sort key.
sort(arrayList, key=some_proxy_function)

Kent

Dec 23 '05 #7

Paul Rubin

Kent Johnson <ke**@kentsjohnson.com> writes:

[a] -> [ (hash(a), a) ]

This won't work - elements with different hashes will sort by hash and
elements with the same hash will still be compared which is exactly
what the OP is trying to avoid.

ds = sorted([(hash(c), i) for i,c in enumerate(a)])
dsu = [a[i] for hc,i in ds]

Is that what you mean? I didn't answer the OP because I couldn't
understand the original question. The above brings together clusters
of elements with the same hash, so if the clusters are large you can
finish the sort with relatively few comparisons.

Dec 23 '05 #8

Aahz

In article <pa****************************@dcs.nac.uci.edu> ,
Dan Stromberg <st******@dcs.nac.uci.edu> wrote:

Python appears to have a good sort method, but when sorting array
elements that are very large, and hence have very expensive compares,
is there some sort of already-available sort function that will merge
like elements into a chain, so that they won't have to be recompared as
many times?

I'll just note that Python's sort is specifically designed to reduce the
number of compares precisely because *all* compares in Python are
relatively expensive. I'll suggest a variant on the previous suggestion
of hash:

[a] -> [hash(a), index(a), a]
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

"Don't listen to schmucks on USENET when making legal decisions. Hire
yourself a competent schmuck." --USENET schmuck (aka Robert Kern)

Dec 23 '05 #9

Dan Stromberg

On Thu, 22 Dec 2005 22:06:42 +0000, Dan Stromberg wrote:

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Thanks!

def __cmp__(self,other):
# sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))
if verbosity >= 1:
sys.stderr.write('Comparing file_class objects %s and %s\n' %
(self.name, other.name))
if self.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % self.name)
result = os.stat(self.name)
self.device = result[stat.ST_DEV]
self.inode = result[stat.ST_INO]
self.size = result[stat.ST_SIZE]
if other.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % other.name)
result = os.stat(other.name)
other.device = result[stat.ST_DEV]
other.inode = result[stat.ST_INO]
other.size = result[stat.ST_SIZE]
if self.device == other.device and self.inode == other.inode:
# same device, same inode, must be identical files return 0
if self.length < other.length:
return -1
elif self.length > other.length:
return 1
# if we've reached this point, the files are not hardlinks, and their
lengths are identical # so slurp in the prefixes if needed, then compare
them if self.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % self.name)
file = open(self.name, 'r')
self.prefix = file.read(self.max_prefix_length) file.close()
if other.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % other.name)
file = open(other.name, 'r')
other.prefix = file.read(self.max_prefix_length) file.close()
if self.prefix < other.prefix:
return -1
elif self.prefix > other.prefix:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) We haven't yet tried doing a cryptographic hash, so compute them if
needed, and compare them if self.hash == '':
self.gen_hash()
if other.hash == '':
other.gen_hash()
if self.hash < other.hash:
return -1
elif self.hash > other.hash:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) The cryptographic hashes are identical # 5) All that remains is a
"long form" comparison, so bite the bullet and do the I/O if verbosity:
sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
(self.name, other.name))
self_fp = open(self.name, 'r')
other_fp = open(other.name, 'r')
while 1:
self_buf = self_fp.read(self.bufsize) other_buf =
other_fp.read(self.bufsize) if self_buf < other_buf:
self_fp.close()
other_fp.close()
return -1
elif self_buf > other_buf:
self_fp.close()
other_fp.close()
return 1
if not self_buf and not other_buf:
self_fp.close()
other_fp.close()
return 0
if not self_buf:
self_fp.close()
other_fp.close()
return -1
if not other_buf:
self_fp.close()
other_fp.close()
return 1

Dec 23 '05 #10

bonono

Dan Stromberg wrote:

On Thu, 22 Dec 2005 22:06:42 +0000, Dan Stromberg wrote:

Hi folks.

Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?

Thanks!

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Why would #5 not enough as an indicator that the files are indentical ?

Dec 23 '05 #11

Alex Martelli

Dan Stromberg <st******@dcs.nac.uci.edu> wrote:
...

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".
OK, very clear.
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file
Makes sense, including the possibility of computing some items on
demand. However, using a key-callable rather than a cmp-callable may
still make sense -- you just need a callable that extracts the
attributes on demand (and caches them thereafter... assuming you have
enough memory to keep all the caches, I guess [6] might be a problem).
I do see that key-callables won't necessarily work easily here, but
since the key-callable's __cmp__ will in turn be called, that might
still work better... but, on reflection, it's probably just a minor
gain.
However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

The comparison should never get called twice on the same two objects,
anyway. So, I'm not sure what you think you could gain by optimizing
future comparisons of two objects once you've ascertained they are in
fact equal. Still, assuming for example that self.name is a unique
identifier (i.e. the so-called name is in fact a complete path), the
optimization (memoization) is quite easy to perform. Rename what you
currently have as __cmp__ by another name, say _real_cmp, add a
_compared dict to the class, and code the following __cmp__ ...:

_compared = {}
def __cmp__(self, other):
try: return -self._compared[other.name, self.name]
except KeyError: pass
key = self.name, other.name
if key in self._compared: return self._compared[key]
result = self_compared[key] = self._real_cmp(other)
return result

I would not expect this optimization to matter at all, because no key
should ever be already present in the self._compared dictionary (the
same two files should never, ever get compared twice anyway).

However, it might be possible to extend this idea by using the
properties you know an ordering should have -- if A and B have never
been compared, but both have been compared with C, in some cases you
don't need to compare A and B but may exploit already-known results.
For example, if A==C and B==C, you already know that A==B without any
actual comparison; if A<C and B==C, you already know that A<B; and so
on. Of course in some cases you still must work, e.g. if you know A<C
and B<C that doesn't help. I'm not sure if this would help at all,
either: it's quite possible that the sort algorithm already exploits
these possibilities internally. But, perhaps, it may be worth a try.
Alex

Dec 23 '05 #12

Peter Otten

Dan Stromberg wrote:

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

Are you really trying to establish an order or do want to eliminate the
duplicates?

File("perfectly_legal.ogg") < File("free_of_charge.mp3")

True

doesn't make that much sense to me, regardless of what the comparison may
actually do.

Peter

Dec 23 '05 #13

Steve Holden

bo****@gmail.com wrote:

Dan Stromberg wrote: [...]
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

[...] Why would #5 not enough as an indicator that the files are indentical ?

Because it doesn't guarantee that the files are identical. It indicates,
to a very high degree of probability (particularly when the file lengths
are equal), that the two files are the same, but it doesn't guarantee it.

Technically there are in infinite number of inputs that can produce the
same md5 hash.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Dec 23 '05 #14

Steven D'Aprano

On Fri, 23 Dec 2005 09:20:55 -0800, bonono wrote:

Dan Stromberg wrote:

[snip]

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Why would #5 not enough as an indicator that the files are indentical ?

(1) Because the MD5 algorithm does include collisions. I was going to say
"rare collisions", but there is actually an infinite number of them. The
collisions are just distributed thinly -- because MD5 is a good hash
algorithm, *very* thinly.

(Proof of this comes from the pigeon-hole principle: there is an infinite
number of possible files of arbitrary length, and only a finite number of
possible hashes. Therefore, an infinite number of files must hash to each
possible hash.)

(2) Having said that, unless the original poster is dealing with billions
(plus) of files, it is unlikely that he is finding any of the collisions
unless there is a bug in his sort routine. Since he claims to be doing
more comparisions-by-file-contents than expected (I would suggest *one*
is more than expected), I predict a bug in his code, his algorithm, or
both.
--
Steven.

Dec 23 '05 #15

Steven D'Aprano

On Fri, 23 Dec 2005 19:26:11 +0100, Peter Otten wrote:

Dan Stromberg wrote:
I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

Are you really trying to establish an order or do want to eliminate the
duplicates?
File("perfectly_legal.ogg") < File("free_of_charge.mp3")

True

doesn't make that much sense to me, regardless of what the comparison may
actually do.

If I have understood the poster's algorithm correctly, it gets even
weirder:
Sorted list of files ->

[parrot.ogg, redhat.iso, george.log, fred.log, rhino.ogg, cat.ogg,
debian.iso, sys_restore.iso, adrian.log, fox.ogg, ...]

It seems to this little black duck that by sorting by file contents in
this way, the effect to the human reader is virtually to randomise the
list of file names.

Even if you limit yourself to (say) a set of ogg files, and sort by the
binary contents ->

# album-track
[parrot-6.ogg, rhino-1.ogg, cat-12.ogg, fox-2.ogg, parrot-3.ogg, ...]

most people looking at the list would guess it had been shuffled, not
sorted. So I too don't know what the original poster hopes to accomplish
by sorting on the content of large binary files.

--
Steven.

Dec 23 '05 #16

Paul Rubin

Dan Stromberg <st******@dcs.nac.uci.edu> writes:

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems

Just don't do any #6 comparisons. If the hashes are identical, the
files are identical--that's the point of a cryptographic hash. OK, to
be pedantic:

1) there's a microscopic chance of an accidental collision, but
it's much lower than the chance of a hardware failure making your
comparison function give the wrong answer.

2) there are known cryptanalytic attacks against md5 that can let
someone deliberately construct two distinct files with the same
hash, with a fair amount of efort. If you care about this, use
sha-1 instead, or better yet, sha-256. (sha-1 has an attack
similar to the md5 attack, but the amount of work required is not
really practical today, unlike the md5 attack).

Dec 23 '05 #17

Steven D'Aprano

On Fri, 23 Dec 2005 17:10:22 +0000, Dan Stromberg wrote:

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and "sorting
the strings".
Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.
I've been using the following compare function,
My first suggestion is that you invest a bit of time in building a
small-scale set of data that you can test your sorting on, if you haven't
already done so. On multi-megabyte files, taking the MD5 hash or comparing
the content byte by byte is quite slow. So I would spend a few minutes
populating a directory with a dozen or so fake iso and ogg files, only a
few bytes in length each, and run my tests on them.
which in short checks, in order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the
other file. Also, files that do not have the same length can never be
identical. And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)
My second suggestion is to encapsulate vigorously, at least for testing.
At the moment, you have a single __cmp__ function that does everything
in-line. I would do something like this:

def __cmp__(self, other):
result = compare_by_device_number_and_inode(self, other)
if result is None:
result = compare_by_file_length(self, other)
if result is None:
...
return result

You can always unroll the function calls later, but for now it helps in
debugging: you only need to think about one type of comparison at a time,
and makes it easier to think about what each function needs to do.

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.
As others have pointed out, Python's sort never compares the same objects
more than once.

Some more comments follow interleaved with your code:

def __cmp__(self,other):
# sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name))
Are you sure you don't want to just sort by file name? That would be much
simpler *wink*
if verbosity >= 1:
sys.stderr.write('Comparing file_class objects %s and %s\n' %
(self.name, other.name))
if self.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file %s\n' % self.name)
result = os.stat(self.name)
self.device = result[stat.ST_DEV]
self.inode = result[stat.ST_INO]
self.size = result[stat.ST_SIZE]
Since you need to compare stat data for all these objects, why not
simply pre-fetch them when you create the object? That way you can
simplify your __cmp__ function significantly.

[snip]
if self.device == other.device and self.inode == other.inode:
# same device, same inode, must be identical files return 0
At first I thought that my news client had mangled your function, but
going back to the original post, either *your* client has mangled it, or
I've found a bug in your code. "return 0" is commented out.

Is it possible that all the excess file comparisons are being done only
for hard links?

[snip]

if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % self.name)
file = open(self.name, 'r')
self.prefix = file.read(self.max_prefix_length) file.close()
"prefix" is probably not the best name. Maybe "header" (as in "file
header")?

If the prefix/header is short enough, don't worry about fetching on
demand, at least not while you are still debugging. Just pre-fetch it, get
your code working, and then convert back to fetching the header on demand.

if other.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' % other.name)
file = open(other.name, 'r')
other.prefix = file.read(self.max_prefix_length) file.close()
if self.prefix < other.prefix:
return -1
elif self.prefix > other.prefix:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) We haven't yet tried doing a cryptographic hash, so compute them
if needed, and compare them if self.hash == '':
self.gen_hash()
if other.hash == '':
other.gen_hash()
Without knowing what gen_hash() does, it is hard to say whether the bug
might lie here. If you are using the tried-and-tested MD5 module, that
should be good, but if you have tried to roll your own, I would seriously
consider the possibility that you have a bug in the gen_hash() method.

if self.hash < other.hash:
return -1
elif self.hash > other.hash:
return 1
# if we've reached this point, we know that: # 1) The files are not
hardlinks of each other # 2) They have identical sizes # 3) Their
prefixes are identical
# 4) The cryptographic hashes are identical # 5) All that remains is a
"long form" comparison, so bite the bullet and do the I/O
Or just accept that the probability of a collision is so small that you
aren't ever going to find one.

According to this website:

http://www.iusmentis.com/technology/...ttackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.

Somehow, I don't think you've got that many legal ogg files *grin*.

If you are finding even a single MD5 collision in a trillion comparisons,
it would be a miracle, or more likely a bug.

if
verbosity:
sys.stderr.write('Doing byte for byte comparison of %s and %s\n' %
(self.name, other.name))
self_fp = open(self.name, 'r')
other_fp = open(other.name, 'r')
while 1:
self_buf = self_fp.read(self.bufsize) other_buf =
other_fp.read(self.bufsize) if self_buf < other_buf:
self_fp.close()
other_fp.close()
return -1
elif self_buf > other_buf:
self_fp.close()
other_fp.close()
return 1
if not self_buf and not other_buf:
self_fp.close()
other_fp.close()
return 0
if not self_buf:
self_fp.close()
other_fp.close()
return -1
if not other_buf:
self_fp.close()
other_fp.close()
return 1

But you've stated that an invariant of the problem is if you get to
the point of comparing file contents, the files must be the same size.
That means that it is impossible for one buffer to be empty when the other
is not. If you find that condition, you should raise an error -- it means
the file has changed size in the time between checking that the lengths
are equal, and reading the file contents into a buffer.

--
Steven.

Dec 24 '05 #18

Paul Rubin

Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:

http://www.iusmentis.com/technology/...ttackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.

That's not right, the right number is around 2**64 or 2e19. See the
section of that page about the birthday attack. It's still an awful lot.

There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken). Whether the OP should care about that depends
on the application.

Dec 24 '05 #19

Steven D'Aprano

On Fri, 23 Dec 2005 21:07:57 -0800, Paul Rubin wrote:

Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
http://www.iusmentis.com/technology/...ttackfaq/hash/

the expected number of random unique files you would need to compare
before finding a single collision in the MD5 hashes is (very roughly)
10**70, or ten billion trillion trillion trillion trillion trillion.
That's not right, the right number is around 2**64 or 2e19. See the
section of that page about the birthday attack. It's still an awful lot.

Oh poot, a stupid typo! I entered 1e60 in my calculation instead of 1e6,
which is doubly embarrassing because the correct number to use is 1e9:

[quote]
In MD5's case, 2**64 messages need to be tried. This is not a feasible
attack given today's technology. If you could try 1,000,000 messages per
second, it would take 584,942 years to find a collision. A machine that
could try 1,000,000,000 messages per second would take 585 years, on
average.
[end quote]

So the expected number of files (messages) needed to check to find a
collision is:

585*365*24*60*60*1e9 = 1.8e19

or more than ten million million million, which of course equals 2**64.

There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken). Whether the OP should care about that depends
on the application.

Sure, but I don't he is deliberately trying to sabotage his own files :-)
--
Steven.

Dec 24 '05 #20

Tim Peters

[Steven D'Aprano]

...
As others have pointed out, Python's sort never compares the same objects
more than once.

Others have pointed it out, and it's getting repeated now, but it's
not true. Since I wrote Python's sorting implementation, it's
conceivable that I'm not just making this up ;-)

Here's the shortest counter-example: [10, 30, 20].

The first thing sort() does is compute the length of the longest
natural run (whether increasing or decreasing) starting at index 0.
It compares 10 and 30, sees that the list starts with an increasing
run, then compares 30 and 20 to see whether this run can be extended
to length 3. It can't. Since the list has only 3 elements, it
decides to use a binary insertion sort to move the 3rd element into
place. That requires 2 comparisons, the first of which _again_
compares 30 and 20.

This doesn't bother me, and it would cost more to avoid this than it
could possibly save in general. It's true that the workhorse binary
insertion and merge sorts never compare the same elements more than
once, but there is some potential duplication between those and
comparisons done to identify natural runs.

Dec 24 '05 #21

Paul Rubin

Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:

There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken). Whether the OP should care about that depends
on the application.

Sure, but I don't he is deliberately trying to sabotage his own files :-)

He might have downloaded a file created by a saboteur to have the same
md5 as some popular music file, but which contains a subliminal
hypnotic message which will brainwash him if played. Using a stronger
hash, such as sha256, should protect him from this fate.

Dec 24 '05 #22

Alex Martelli

Tim Peters <ti********@gmail.com> wrote:

[Steven D'Aprano]
...
As others have pointed out, Python's sort never compares the same objects
more than once.
Others have pointed it out, and it's getting repeated now, but it's
not true. Since I wrote Python's sorting implementation, it's
conceivable that I'm not just making this up ;-)

... could possibly save in general. It's true that the workhorse binary
insertion and merge sorts never compare the same elements more than
once, but there is some potential duplication between those and
comparisons done to identify natural runs.

Since I probably was the first one to point out the erroneous factoid in
question, I apologize -- obviously, my memories of the inner workings of
sort were faulty, and didn't consider that potential duplication.

In this case, the tricks I already (though dubiously;-) suggested in
order to avoid any avoidable comparisons might pay for themselves and
then some, if comparisons are indeed extremely costly.
Alex

Dec 24 '05 #23

Thomas Wouters

On Fri, 23 Dec 2005 21:56:44 -0800, Paul Rubin wrote:

Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
> There are also known ways of deliberately constructing md5 collisions
> (i.e. md5 is broken). Whether the OP should care about that depends
> on the application.

Sure, but I don't he is deliberately trying to sabotage his own files :-)

He might have downloaded a file created by a saboteur to have the same
md5 as some popular music file, but which contains a subliminal
hypnotic message which will brainwash him if played. Using a stronger
hash, such as sha256, should protect him from this fate.

But the odds of such a message having the same MD5 as an existing
song on his disk is quite a lot higher than 2**64, unless he has a really,
really large music collection ;) In the case you propose, two files don't
just need to have the same MD5, but they also need to have a whole lot of
other characterstics; both need to be (somewhat) valid MP3's, one needs to
be a piece of music (or other sound) that is somewhat to the target's
liking, and the other needs to be something playable with a subliminal
message the target is likely to respond to.

Calculate-me-odds-on-THAT-<wink>-ly 'yrs,
--
Thomas Wouters <th****@xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Dec 25 '05 #24

Paul Rubin

Thomas Wouters <th****@xs4all.nl> writes:

But the odds of such a message having the same MD5 as an existing
song on his disk is quite a lot higher than 2**64, unless he has a really,
really large music collection ;) In the case you propose, two files don't
just need to have the same MD5, but they also need to have a whole lot of
other characterstics; both need to be (somewhat) valid MP3's, one needs to
be a piece of music (or other sound) that is somewhat to the target's
liking, and the other needs to be something playable with a subliminal
message the target is likely to respond to.

The way the known collision attack works, the saboteur has to
construct both files. However, the attacker does have a fair amount
of control over the content. So he can start an innocent file
circulating, then replace it with a sabotaged file on some network.
A user might possibly somehow end up with both versions.

See: http://www.cits.rub.de/MD5Collisions/ for how that kind of attack
can work.

Dec 25 '05 #25

Stuart D. Gathman

On Sat, 24 Dec 2005 15:47:17 +1100, Steven D'Aprano wrote:

On Fri, 23 Dec 2005 17:10:22 +0000, Dan Stromberg wrote:
I'm treating each file as a potentially very large string, and "sorting
the strings".

Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.

I believe what the original poster wants to do is eliminate duplicate
content from a collection of ogg/whatever files with different names.
E.g., he has a python script that goes out and collects all the free music
it can find on the web. The same song may appear on many sites under
different names, and he wants only one copy of a given song.

In any case, as others have pointed out, sorting by MD5 is sufficient
except in cases far less probable than hardware failure - and deliberate
collisions. E.g., the RIAA creates collision pairs of MP3 files where one
member carries a freely redistributable license, and the other a "copy
this and we'll sue your ass off" license in an effort to trap the unwary.

--
Stuart D. Gathman <st****@bmsi.com>
Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

Dec 28 '05 #26

sorting with expensive compares?

Similar topics