472,119 Members | 1,271 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,119 software developers and data experts.

Fatest standard way to sum bytes (and their squares)?

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx
Aug 12 '07 #1
10 4750
And of course I meant fastest, not "fatest." Fattest wouldn't be good,
either.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx
Aug 12 '07 #2
Erik Max Francis <ma*@alcyone.comwrites:
For a file hashing system (finding similar files, rather than identical ones),
I need to be able to efficiently and quickly sum the ordinals of the bytes of
a file and their squares. Because of the nature of the application, it's a
requirement that I do it in Python, or only with standard library modules (if
such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to be
processing massive amounts of data, the faster the better. Are there any
tricks I'm not thinking of, or perhaps helper functions in other modules that
I'm not thinking of?
Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

'as
Aug 12 '07 #3
Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop

Your actual code may be even faster because you can read the bytes directly
from the file.

Peter
Aug 12 '07 #4
Peter Otten wrote:
Erik Max Francis wrote:
>For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))
Oops, make that a list:

lookup = [i**2 for i in range(256)]
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 13.4 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop

Not a big gain, but still...
Your actual code may be even faster because you can read the bytes
directly from the file.
Peter
Aug 12 '07 #5
Erik Max Francis <ma*@alcyone.comwrote:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
Aug 12 '07 #6
Erik Max Francis <ma*@alcyone.comwrites:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Aug 12 '07 #7
Hrvoje Niksic wrote:
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop
You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap optimization
doesn't appear to work quite as dramatically well for me with strings
instead of lists, but it certainly is an improvement.
Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Right.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Experience is the name everyone gives to their mistakes.
-- Oscar Wilde
Aug 13 '07 #8
Duncan Booth wrote:
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
This approach is definitely faster than using a generator in my tests,
and was second fastest overall. I'm actually a bit surprised that it's
as fast as it is; it never would have occurred to me to try it this way.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
It is morale that wins the victory.
-- Gen. George C. Marshall
Aug 13 '07 #9
Erik Max Francis <ma*@alcyone.comwrites:
Hrvoje Niksic wrote:
>For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]'
'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap
optimization doesn't appear to work quite as dramatically well for
me with strings instead of lists, but it certainly is an
improvement.
I wouldn't expect to see any difference in strings and lists. In this
simple test I get approximately the same ~1.7x speedup:

$ python -m timeit 'sum(ord(x) for x in "abcdefghijklmnopqrstuvwxyz")'
100000 loops, best of 3: 12.7 usec per loop
$ python -m timeit -s 'from itertools import imap' 'sum(imap(ord, "abcdefghijklmnopqrstuvwxyz"))'
100000 loops, best of 3: 7.42 usec per loop
Aug 13 '07 #10
On Sun, 12 Aug 2007 02:26:59 -0700, Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.

You might experiment with using mmap() instead of read()... If it helps,
it may help big, because the I/O time is likely to dominate the CPU time.
Aug 19 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

99 posts views Thread by David MacQuigg | last post: by
7 posts views Thread by Steven T. Hatton | last post: by
1 post views Thread by John Bailo | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.