By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,312 Members | 1,786 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,312 IT Pros & Developers. It's quick & easy.

Efficient checksum calculating on lagre files

P: n/a
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.
--
--------------------------------------
Ola Natvig <ol********@infosense.no>
infoSense AS / development
Jul 18 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Ola Natvig wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

def md5sum(fn):
import subprocess
return subprocess.Popen(["md5sum.exe", fn],
stdout=subprocess.PIPE).communicate()[0]

import time
t0 = time.time()
print md5sum('test.rml')
t1 = time.time()
print t1-t0

and got

C:\Tmp>md5sum.py
b68e4efa5e5dbca37718414f6020f6ff *test.rml

0.0160000324249
Tried with the original
C:\Tmp>timethis md5sum.exe test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005

b68e4efa5e5dbca37718414f6020f6ff *test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005
TimeThis : End Time : Tue Feb 08 16:12:26 2005
TimeThis : Elapsed Time : 00:00:00.437

C:\Tmp>ls -l test.rml
-rw-rw-rw- 1 user group 996688 Dec 31 09:57 test.rml

C:\Tmp>

--
Robin Becker

Jul 18 '05 #2

P: n/a
Ola Natvig wrote:
Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Is there a reason you can't use the sha module? Using a random large file I had
lying around:

sha.new(file("jdk-1_5_0-linux-i586.rpm").read()).hexdigest() # loads all into memory first

If you don't want to load the whole object into memory at once you can always call out to the sha1sum utility yourself as well.
subprocess.Popen(["sha1sum", ".bashrc"], stdout=subprocess.PIPE).communicate()[0].split()[0]

'5c59906733bf780c446ea290646709a14750eaad'
--
Michael Hoffman
Jul 18 '05 #3

P: n/a
Michael Hoffman wrote:
Is there a reason you can't use the sha module?


BTW, I'm using SHA-1 instead of MD5 because of the reported vulnerabilities
in MD5, which may not be important for your application, but I consider it
best to just avoid MD5 entirely in the future.
--
Michael Hoffman
Jul 18 '05 #4

P: n/a
On Tue, 08 Feb 2005 16:13:43 +0000, rumours say that Robin Becker
<ro***@reportlab.com> might have written:
Ola Natvig wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.


[snip use of some md5sum.exe]

Why not use the md5 module?

The following md5sum.py is in use and tested, but not "failproof".

|import sys, os, md5
|from glob import glob
|
|for arg in sys.argv[1:]:
| for filename in glob(arg):
| fp= file(filename, "rb")
| md5sum= md5.new()
| while True:
| data= fp.read(65536)
| if not data: break
| md5sum.update(data)
| fp.close()
| print md5sum.hexdigest(), filename

It's fast enough, especially if you cache results.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #5

P: n/a
Robin Becker wrote:
Does anyone know of a fast way to calculate checksums for a large file. I need a way to generate
ETag keys for a webserver, the ETag of large files are not realy nececary, but it would be nice
if I could do it. I'm using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's copyfileobject function and the hash
of a fileobject's hash are it's handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps something like the md5sum
utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module


on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

</F>

Jul 18 '05 #6

P: n/a
Ola Natvig <ol********@infosense.no> wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.046s
user 0m0.946s
sys 0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.033s
user 0m0.926s
sys 0m0.108s

$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file
"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()

def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
md5sum(sys.argv[1:])

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #7

P: n/a
Nick Craig-Wood <ni**@craig-wood.com> writes:
Ola Natvig <ol********@infosense.no> wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).


Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

Thomas
Jul 18 '05 #8

P: n/a
On Tue, 8 Feb 2005 17:26:07 +0100, rumours say that "Fredrik Lundh"
<fr*****@pythonware.com> might have written:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")

[snip]

My first reaction was that "r+" should be "r+b"... but then one presumes that an
mmap'ed file does not care about stdio text-binary conventions (on platforms
that matters).
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #9

P: n/a
Fredrik Lundh <fr*****@pythonware.com> wrote:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)


But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z
fn="z"
import os, md5, mmap
file = open(fn, "rb")
size = os.path.getsize(fn)
size 8590983168L hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest() Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: memory mapped size is too large (limited by C int)


--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #10

P: n/a
Thomas Heller <th*****@python.net> wrote:
Nick Craig-Wood <ni**@craig-wood.com> writes:
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).
Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.


Yes you are correct (good old Windows ;-)
But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py


The above is easier to understand though.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #11

P: n/a
On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:
Fredrik Lundh <fr*****@pythonware.com> wrote:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)


But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)


Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #12

P: n/a
Christos TZOTZIOY Georgiou <tz**@sil-tec.gr> wrote:
On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:
But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)


Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)


You are certainly right ;-)

However I did want to make the point that while mmap is extremely
attractive for certain things, it does limit you to files < 4 Gb which
is something that people don't always realise.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.