473,320 Members | 1,846 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Fatest standard way to sum bytes (and their squares)?

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx
Aug 12 '07 #1
10 5180
And of course I meant fastest, not "fatest." Fattest wouldn't be good,
either.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Whoever named it necking was a poor judge of anatomy.
-- Groucho Marx
Aug 12 '07 #2
Erik Max Francis <ma*@alcyone.comwrites:
For a file hashing system (finding similar files, rather than identical ones),
I need to be able to efficiently and quickly sum the ordinals of the bytes of
a file and their squares. Because of the nature of the application, it's a
requirement that I do it in Python, or only with standard library modules (if
such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to be
processing massive amounts of data, the faster the better. Are there any
tricks I'm not thinking of, or perhaps helper functions in other modules that
I'm not thinking of?
Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

'as
Aug 12 '07 #3
Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop

Your actual code may be even faster because you can read the bytes directly
from the file.

Peter
Aug 12 '07 #4
Peter Otten wrote:
Erik Max Francis wrote:
>For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))
Oops, make that a list:

lookup = [i**2 for i in range(256)]
def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 13.4 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop

Not a big gain, but still...
Your actual code may be even faster because you can read the bytes
directly from the file.
Peter
Aug 12 '07 #5
Erik Max Francis <ma*@alcyone.comwrote:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
Aug 12 '07 #6
Erik Max Francis <ma*@alcyone.comwrites:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Aug 12 '07 #7
Hrvoje Niksic wrote:
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop
You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap optimization
doesn't appear to work quite as dramatically well for me with strings
instead of lists, but it certainly is an improvement.
Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.
Right.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
Experience is the name everyone gives to their mistakes.
-- Oscar Wilde
Aug 13 '07 #8
Duncan Booth wrote:
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)
This approach is definitely faster than using a generator in my tests,
and was second fastest overall. I'm actually a bit surprised that it's
as fast as it is; it never would have occurred to me to try it this way.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
It is morale that wins the victory.
-- Gen. George C. Marshall
Aug 13 '07 #9
Erik Max Francis <ma*@alcyone.comwrites:
Hrvoje Niksic wrote:
>For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]'
'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap
optimization doesn't appear to work quite as dramatically well for
me with strings instead of lists, but it certainly is an
improvement.
I wouldn't expect to see any difference in strings and lists. In this
simple test I get approximately the same ~1.7x speedup:

$ python -m timeit 'sum(ord(x) for x in "abcdefghijklmnopqrstuvwxyz")'
100000 loops, best of 3: 12.7 usec per loop
$ python -m timeit -s 'from itertools import imap' 'sum(imap(ord, "abcdefghijklmnopqrstuvwxyz"))'
100000 loops, best of 3: 7.42 usec per loop
Aug 13 '07 #10
On Sun, 12 Aug 2007 02:26:59 -0700, Erik Max Francis wrote:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?
I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.

You might experiment with using mmap() instead of read()... If it helps,
it may help big, because the I/O time is likely to dominate the CPU time.
Aug 19 '07 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

99
by: David MacQuigg | last post by:
I'm not getting any feedback on the most important benefit in my proposed "Ideas for Python 3" thread - the unification of methods and functions. Perhaps it was buried among too many other less...
7
by: Steven T. Hatton | last post by:
I haven't looked very closely, but from what I'm seeing, it looks like there are no buffered I/O streams in the Standard Library. There are stream buffers, but not buffered streams. I don't have...
1
by: John Bailo | last post by:
For some reason, blue squares with rounded corners started appearing on the left side of VS.NET -- in the same place were the brown circles that represent breakpoints are. What do these...
1
by: Reiiiii | last post by:
Not sure if this fits under this topic, sorry if it does not. Anyways, everytime I try to see special characters such as "♥金沙巧克力專賣 : 我熱愛的馬來西亞"(No clue what it says, my friend gave it to me as an...
1
by: jyohere | last post by:
I have a problem to solve.I want to draw squares and lines connecting them.The number of squares is only known at runtime.How can it be done.if someone has some code like this,pls help.I am able to...
41
by: c | last post by:
Hi every one, Me and my Cousin were talking about C and C#, I love C and he loves C#..and were talking C is ...blah blah...C# is Blah Blah ...etc and then we decided to write a program that...
0
by: Sebarry | last post by:
Hi, I have the following XML file which looks fine rendered in a browser but when its read by a SWF as a news ticker I get all these weird squares appearing. Any ideas? <?xml version="1.0"?>...
2
jwwicks
by: jwwicks | last post by:
C/C++ Programs and Debugging in Linux This tutorial will give you a basic idea how to debug a program in Linux using GDB. As you are aware Visual Studio doesn’t run on Linux so you have to use...
4
by: Evelien | last post by:
Dear python-users, I am trying to do a non-linear least squares fitting. Maybe trying is not the best word, as I already succeeded in that. At the moment I am using leastSquaresFit from...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.