For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.
So far the fastest way I've found is using the `sum` builtin and
generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx 10 4750
And of course I meant fastest, not "fatest." Fattest wouldn't be good,
either.
--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx
Erik Max Francis <ma*@alcyone.comwrites:
For a file hashing system (finding similar files, rather than identical ones),
I need to be able to efficiently and quickly sum the ordinals of the bytes of
a file and their squares. Because of the nature of the application, it's a
requirement that I do it in Python, or only with standard library modules (if
such facilities exist) that might assist.
So far the fastest way I've found is using the `sum` builtin and generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to be
processing massive amounts of data, the faster the better. Are there any
tricks I'm not thinking of, or perhaps helper functions in other modules that
I'm not thinking of?
Is this any faster?
ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))
'as
Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.
So far the fastest way I've found is using the `sum` builtin and
generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
Two ideas:
Use a lookup-table for ord(c)**2
Use array.array()
$ cat summit.py
import array
data = "dere gewizzede bizzede bizz" * 1000 + chr(255)
lookup = dict((i, i**2) for i in range(256))
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)
def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)
if __name__ == "__main__":
assert summit_array() == summit_str()
$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop
Your actual code may be even faster because you can read the bytes directly
from the file.
Peter
Peter Otten wrote:
Erik Max Francis wrote:
>For a file hashing system (finding similar files, rather than identical ones), I need to be able to efficiently and quickly sum the ordinals of the bytes of a file and their squares. Because of the nature of the application, it's a requirement that I do it in Python, or only with standard library modules (if such facilities exist) that might assist.
So far the fastest way I've found is using the `sum` builtin and generators::
ordinalSum = sum(ord(x) for x in data) ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to be processing massive amounts of data, the faster the better. Are there any tricks I'm not thinking of, or perhaps helper functions in other modules that I'm not thinking of?
Two ideas:
Use a lookup-table for ord(c)**2
Use array.array()
$ cat summit.py
import array
data = "dere gewizzede bizzede bizz" * 1000 + chr(255)
lookup = dict((i, i**2) for i in range(256))
Oops, make that a list:
lookup = [i**2 for i in range(256)]
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)
def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)
if __name__ == "__main__":
assert summit_array() == summit_str()
$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 13.4 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop
Not a big gain, but still...
Your actual code may be even faster because you can read the bytes
directly from the file.
Peter
Erik Max Francis <ma*@alcyone.comwrote:
So far the fastest way I've found is using the `sum` builtin and
generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I haven't timed it, but at the very least I'd avoid going through all the
data twice:
count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1
ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
Erik Max Francis <ma*@alcyone.comwrites:
So far the fastest way I've found is using the `sum` builtin and
generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop
Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Hrvoje Niksic wrote:
For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop
You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap optimization
doesn't appear to work quite as dramatically well for me with strings
instead of lists, but it certainly is an improvement.
Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Right.
--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Experience is the name everyone gives to their mistakes.
-- Oscar Wilde
Duncan Booth wrote:
I haven't timed it, but at the very least I'd avoid going through all the
data twice:
count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1
ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
This approach is definitely faster than using a generator in my tests,
and was second fastest overall. I'm actually a bit surprised that it's
as fast as it is; it never would have occurred to me to try it this way.
--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
It is morale that wins the victory.
-- Gen. George C. Marshall
Erik Max Francis <ma*@alcyone.comwrites:
Hrvoje Niksic wrote:
>For ordinalSum, using imap is almost twice as fast: $ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)' 10000 loops, best of 3: 92.4 usec per loop $ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))' 10000 loops, best of 3: 55.4 usec per loop
You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap
optimization doesn't appear to work quite as dramatically well for
me with strings instead of lists, but it certainly is an
improvement.
I wouldn't expect to see any difference in strings and lists. In this
simple test I get approximately the same ~1.7x speedup:
$ python -m timeit 'sum(ord(x) for x in "abcdefghijklmnopqrstuvwxyz")'
100000 loops, best of 3: 12.7 usec per loop
$ python -m timeit -s 'from itertools import imap' 'sum(imap(ord, "abcdefghijklmnopqrstuvwxyz"))'
100000 loops, best of 3: 7.42 usec per loop
On Sun, 12 Aug 2007 02:26:59 -0700, Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.
So far the fastest way I've found is using the `sum` builtin and
generators::
ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.
You might experiment with using mmap() instead of read()... If it helps,
it may help big, because the I/O time is likely to dominate the CPU time. This discussion thread is closed Replies have been disabled for this discussion. Similar topics
99 posts
views
Thread by David MacQuigg |
last post: by
|
7 posts
views
Thread by Steven T. Hatton |
last post: by
|
1 post
views
Thread by John Bailo |
last post: by
| | |
41 posts
views
Thread by c |
last post: by
| | |
4 posts
views
Thread by Evelien |
last post: by
| | | | | | | | | | |