Efficient checksum calculating on lagre files

Ola Natvig

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.
--
--------------------------------------
Ola Natvig <ol********@infosense.no>
infoSense AS / development

Jul 18 '05 #1

Subscribe Post Reply

12197

Robin Becker

Ola Natvig wrote:

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

def md5sum(fn):
import subprocess
return subprocess.Popen(["md5sum.exe", fn],
stdout=subprocess.PIPE).communicate()[0]

import time
t0 = time.time()
print md5sum('test.rml')
t1 = time.time()
print t1-t0

and got

C:\Tmp>md5sum.py
b68e4efa5e5dbca37718414f6020f6ff *test.rml

0.0160000324249
Tried with the original
C:\Tmp>timethis md5sum.exe test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005

b68e4efa5e5dbca37718414f6020f6ff *test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005
TimeThis : End Time : Tue Feb 08 16:12:26 2005
TimeThis : Elapsed Time : 00:00:00.437

C:\Tmp>ls -l test.rml
-rw-rw-rw- 1 user group 996688 Dec 31 09:57 test.rml

C:\Tmp>

--
Robin Becker

Jul 18 '05 #2

Michael Hoffman

Ola Natvig wrote:

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Is there a reason you can't use the sha module? Using a random large file I had
lying around:

sha.new(file("jdk-1_5_0-linux-i586.rpm").read()).hexdigest() # loads all into memory first

If you don't want to load the whole object into memory at once you can always call out to the sha1sum utility yourself as well.

subprocess.Popen(["sha1sum", ".bashrc"], stdout=subprocess.PIPE).communicate()[0].split()[0]

'5c59906733bf780c446ea290646709a14750eaad'
--
Michael Hoffman

Jul 18 '05 #3

Michael Hoffman

Michael Hoffman wrote:

Is there a reason you can't use the sha module?

BTW, I'm using SHA-1 instead of MD5 because of the reported vulnerabilities
in MD5, which may not be important for your application, but I consider it
best to just avoid MD5 entirely in the future.
--
Michael Hoffman

Jul 18 '05 #4

Christos TZOTZIOY Georgiou

On Tue, 08 Feb 2005 16:13:43 +0000, rumours say that Robin Becker
<ro***@reportlab.com> might have written:

Ola Natvig wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

[snip use of some md5sum.exe]

Why not use the md5 module?

The following md5sum.py is in use and tested, but not "failproof".

|import sys, os, md5
|from glob import glob
|
|for arg in sys.argv[1:]:
| for filename in glob(arg):
| fp= file(filename, "rb")
| md5sum= md5.new()
| while True:
| data= fp.read(65536)
| if not data: break
| md5sum.update(data)
| fp.close()
| print md5sum.hexdigest(), filename

It's fast enough, especially if you cache results.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #5

Fredrik Lundh

Robin Becker wrote:

Does anyone know of a fast way to calculate checksums for a large file. I need a way to generate
ETag keys for a webserver, the ETag of large files are not realy nececary, but it would be nice
if I could do it. I'm using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's copyfileobject function and the hash
of a fileobject's hash are it's handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps something like the md5sum
utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

</F>

Jul 18 '05 #6

Nick Craig-Wood

Ola Natvig <ol********@infosense.no> wrote:

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.046s
user 0m0.946s
sys 0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.033s
user 0m0.926s
sys 0m0.108s

$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file
"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()

def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
md5sum(sys.argv[1:])

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #7

Thomas Heller

Nick Craig-Wood <ni**@craig-wood.com> writes:

Ola Natvig <ol********@infosense.no> wrote:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

Thomas

Jul 18 '05 #8

Christos TZOTZIOY Georgiou

On Tue, 8 Feb 2005 17:26:07 +0100, rumours say that "Fredrik Lundh"
<fr*****@pythonware.com> might have written:

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")

[snip]

My first reaction was that "r+" should be "r+b"... but then one presumes that an
mmap'ed file does not care about stdio text-binary conventions (on platforms
that matters).
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #9

Nick Craig-Wood

Fredrik Lundh <fr*****@pythonware.com> wrote:

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z

fn="z"
import os, md5, mmap
file = open(fn, "rb")
size = os.path.getsize(fn)
size 8590983168L hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest() Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: memory mapped size is too large (limited by C int)

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #10

Nick Craig-Wood

Thomas Heller <th*****@python.net> wrote:

Nick Craig-Wood <ni**@craig-wood.com> writes:
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).
Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

Yes you are correct (good old Windows ;-)
But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

The above is easier to understand though.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #11

Christos TZOTZIOY Georgiou

On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:

Fredrik Lundh <fr*****@pythonware.com> wrote:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #12

Nick Craig-Wood

Christos TZOTZIOY Georgiou <tz**@sil-tec.gr> wrote:

On 09 Feb 2005 10:31:22 GMT, rumours say that Nick Craig-Wood
<ni**@craig-wood.com> might have written:
But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)

You are certainly right ;-)

However I did want to make the point that while mmap is extremely
attractive for certain things, it does limit you to files < 4 Gb which
is something that people don't always realise.

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick

Jul 18 '05 #13

Similar topics

Calculating CheckSum for Raw UDP packet

by: Terry | last post by:

I'm trying to calculate the checksum for UDP packet. The algorithm itself is not difficult (lots of examples out there), but what I'm having the most trouble with is determining the byte order...

C# / C Sharp

Reference to files in SQL database Web App

by: grzybek | last post by:

Hi, I have question about techniques of using files in SQL Server in Web Application. Assuming that I send files from my Web App ( client ) to server and located these files on hard disk ( on...

ASP.NET

FTP or HTTP to upload files through ASP.NET?

by: RayAll | last post by:

I am using HTTP to upload files to a database ,I'm just curious if I can do it through FTP too and what would be the advantages and disadvantages of these two method? Thanks

ASP.NET

Checksum calculating program

by: Protoman | last post by:

How would I make a program to calculate a checksum? I have the code to sum up the digits, but what then? Here's the digit summer: template <unsigned int I> int SumDigits() {...

C / C++

reading from "unknown" number and names of files using fstream

by: jccorreu | last post by:

I've got to read info from multiple files that will be given to me. I know the format and what the data is. The thing is each time we run the program we may be using a differnt number of files,...

C / C++

Where to store pictures, in DB or in files?

by: romayankin | last post by:

I'm making some kind of news-board site on PHP (tthat's why I guess it's the right place to post the question here). Each entry would have a small picture (less then 100kb). Taking into...

PHP

Dynamically iterating through log files

by: bay_dar | last post by:

It seems there has got to be a better way to work with log files where I want to keep 8 days of logs. For instance if I wanted to keep 80 days, this would be a horrible approach. How can I make...

Visual Basic .NET

Article Storage: Files vs. Database

by: Jonathan Wood | last post by:

I'd like to build a Website that contains many articles. Two basic approaches are to either store the articles in aspx files, possibly indexed by the database, or to store the article text in the...

ASP.NET

Any efficient XML parser for C++?

by: cutecutemouse | last post by:

I'm using TinyXML right now and found it not enough efficient to process large files because it should initialize a new instance and save it to file while creating a new XML node. Thanks in...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA