473,405 Members | 2,349 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Efficient checksum calculating on lagre files

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.
--
--------------------------------------
Ola Natvig <ol********@infosense.no>
infoSense AS / development
Jul 18 '05 #1
12 12197
Ola Natvig wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

def md5sum(fn):
import subprocess
return subprocess.Popen(["md5sum.exe", fn],
stdout=subprocess.PIPE).communicate()[0]

import time
t0 = time.time()
print md5sum('test.rml')
t1 = time.time()
print t1-t0

and got

C:\Tmp>md5sum.py
b68e4efa5e5dbca37718414f6020f6ff *test.rml

0.0160000324249
Tried with the original
C:\Tmp>timethis md5sum.exe test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005

b68e4efa5e5dbca37718414f6020f6ff *test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005
TimeThis : End Time : Tue Feb 08 16:12:26 2005
TimeThis : Elapsed Time : 00:00:00.437

C:\Tmp>ls -l test.rml
-rw-rw-rw- 1 user group 996688 Dec 31 09:57 test.rml

C:\Tmp>

--
Robin Becker

Jul 18 '05 #2
Ola Natvig wrote:
Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Is there a reason you can't use the sha module? Using a random large file I had
lying around:

sha.new(file("jdk-1_5_0-linux-i586.rpm").read()).hexdigest() # loads all into memory first

If you don't want to load the whole object into memory at once you can always call out to the sha1sum utility yourself as well.
subprocess.Popen(["sha1sum", ".bashrc"], stdout=subprocess.PIPE).communicate()[0].split()[0]

'5c59906733bf780c446ea290646709a14750eaad'
--
Michael Hoffman
Jul 18 '05 #3
Michael Hoffman wrote:
Is there a reason you can't use the sha module?


BTW, I'm using SHA-1 instead of MD5 because of the reported vulnerabilities
in MD5, which may not be important for your application, but I consider it
best to just avoid MD5 entirely in the future.
--
Michael Hoffman
Jul 18 '05 #4
On Tue, 08 Feb 2005 16:13:43 +0000, rumours say that Robin Becker
<ro***@reportlab.com> might have written:
Ola Natvig wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.


[snip use of some md5sum.exe]

Why not use the md5 module?

The following md5sum.py is in use and tested, but not "failproof".

|import sys, os, md5
|from glob import glob
|
|for arg in sys.argv[1:]:
| for filename in glob(arg):
| fp= file(filename, "rb")
| md5sum= md5.new()
| while True:
| data= fp.read(65536)
| if not data: break
| md5sum.update(data)
| fp.close()
| print md5sum.hexdigest(), filename

It's fast enough, especially if you cache results.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #5
Robin Becker wrote:
Does anyone know of a fast way to calculate checksums for a large file. I need a way to generate
ETag keys for a webserver, the ETag of large files are not realy nececary, but it would be nice
if I could do it. I'm using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's copyfileobject function and the hash
of a fileobject's hash are it's handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps something like the md5sum
utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module


on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

</F>

Jul 18 '05 #6
Ola Natvig <ol********@infosense.no> wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.046s
user 0m0.946s
sys 0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.033s
user 0m0.926s
sys 0m0.108s

$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file
"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()

def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
md5sum(sys.argv[1:])

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #7
Nick Craig-Wood <ni**@craig-wood.com> writes:
Ola Natvig <ol********@infosense.no> wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.


Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).


Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

Thomas
Jul 18 '05 #8
On Tue, 8 Feb 2005 17:26:07 +0100, rumours say that "Fredrik Lundh"
<fr*****@pythonware.com> might have written:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")

[snip]

My first reaction was that "r+" should be "r+b"... but then one presumes that an
mmap'ed file does not care about stdio text-binary conventions (on platforms
that matters).
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #9
Fredrik Lundh <fr*****@pythonware.com> wrote:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)


But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z
fn="z"
import os, md5, mmap
file = open(fn, "rb")
size = os.path.getsize(fn)
size 8590983168L hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest() Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: memory mapped size is too large (limited by C int)


--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #10
Thomas Heller <th*****@python.net> wrote:
Nick Craig-Wood <ni**@craig-wood.com> writes:
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).
Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.


Yes you are correct (good old Windows ;-)
But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py


The above is easier to understand though.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #11
On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:
Fredrik Lundh <fr*****@pythonware.com> wrote:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)


But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)


Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
Jul 18 '05 #12
Christos TZOTZIOY Georgiou <tz**@sil-tec.gr> wrote:
On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:
But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)


Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)


You are certainly right ;-)

However I did want to make the point that while mmap is extremely
attractive for certain things, it does limit you to files < 4 Gb which
is something that people don't always realise.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Terry | last post by:
I'm trying to calculate the checksum for UDP packet. The algorithm itself is not difficult (lots of examples out there), but what I'm having the most trouble with is determining the byte order...
1
by: grzybek | last post by:
Hi, I have question about techniques of using files in SQL Server in Web Application. Assuming that I send files from my Web App ( client ) to server and located these files on hard disk ( on...
7
by: RayAll | last post by:
I am using HTTP to upload files to a database ,I'm just curious if I can do it through FTP too and what would be the advantages and disadvantages of these two method? Thanks
3
by: Protoman | last post by:
How would I make a program to calculate a checksum? I have the code to sum up the digits, but what then? Here's the digit summer: template <unsigned int I> int SumDigits() {...
7
by: jccorreu | last post by:
I've got to read info from multiple files that will be given to me. I know the format and what the data is. The thing is each time we run the program we may be using a differnt number of files,...
8
by: romayankin | last post by:
I'm making some kind of news-board site on PHP (tthat's why I guess it's the right place to post the question here). Each entry would have a small picture (less then 100kb). Taking into...
1
by: bay_dar | last post by:
It seems there has got to be a better way to work with log files where I want to keep 8 days of logs. For instance if I wanted to keep 80 days, this would be a horrible approach. How can I make...
13
by: Jonathan Wood | last post by:
I'd like to build a Website that contains many articles. Two basic approaches are to either store the articles in aspx files, possibly indexed by the database, or to store the article text in the...
3
by: cutecutemouse | last post by:
I'm using TinyXML right now and found it not enough efficient to process large files because it should initialize a new instance and save it to file while creating a new XML node. Thanks in...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.