Feature suggestion: sum() ought to use a compensated summation algorithm

=?ISO-8859-1?Q?Szabolcs_Horv=E1t?=

I did the following calculation: Generated a list of a million random
numbers between 0 and 1, constructed a new list by subtracting the mean
value from each number, and then calculated the mean again.

The result should be 0, but of course it will differ from 0 slightly
because of rounding errors.

However, I noticed that the simple Python program below gives a result
of ~ 10^-14, while an equivalent Mathematica program (also using double
precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
more precise.

Here's the program (pardon my style, I'm a newbie/occasional user):

from random import random

data = [random() for x in xrange(1000000)]

mean = sum(data)/len(data)
print sum(x - mean for x in data)/len(data)

A little research shows that Mathematica uses a "compensated summation"
algorithm. Indeed, using the algorithm described at
http://en.wikipedia.org/wiki/Kahan_summation_algorithm
gives us a result around ~ 10^-17:

def compSum(arr):
s = 0.0
c = 0.0
for x in arr:
y = x-c
t = s+y
c = (t-s) - y
s = t
return s

mean = compSum(data)/len(data)
print compSum(x - mean for x in data)/len(data)
I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

Szabolcs Horvát

Jun 27 '08 #1

Subscribe Reply

2152

Arnaud Delobelle

Szabolcs Horvát <sz******@gmail.comwrites:

[...]

A little research shows that Mathematica uses a "compensated
summation" algorithm. Indeed, using the algorithm described at
http://en.wikipedia.org/wiki/Kahan_summation_algorithm
gives us a result around ~ 10^-17:

def compSum(arr):
s = 0.0
c = 0.0
for x in arr:
y = x-c
t = s+y
c = (t-s) - y
s = t
return s

mean = compSum(data)/len(data)
print compSum(x - mean for x in data)/len(data)
I thought that it would be very nice if the built-in sum() function
used this algorithm by default. Has this been brought up before?
Would this have any disadvantages (apart from a slight performance
impact, but Python is a high-level language anyway ...)?

Szabolcs Horvát

sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

--
Arnaud

Jun 27 '08 #2

Ivan Illarionov

On Sat, 03 May 2008 18:50:34 +0200, Szabolcs HorvÃ¡t wrote:

I did the following calculation: Generated a list of a million random
numbers between 0 and 1, constructed a new list by subtracting the mean
value from each number, and then calculated the mean again.

The result should be 0, but of course it will differ from 0 slightly
because of rounding errors.

However, I noticed that the simple Python program below gives a result
of ~ 10^-14, while an equivalent Mathematica program (also using double
precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
more precise.

Here's the program (pardon my style, I'm a newbie/occasional user):

from random import random

data = [random() for x in xrange(1000000)]

mean = sum(data)/len(data)
print sum(x - mean for x in data)/len(data)

A little research shows that Mathematica uses a "compensated summation"
algorithm. Indeed, using the algorithm described at
http://en.wikipedia.org/wiki/Kahan_summation_algorithm gives us a result
around ~ 10^-17:

def compSum(arr):
s = 0.0
c = 0.0
for x in arr:
y = x-c
t = s+y
c = (t-s) - y
s = t
return s

mean = compSum(data)/len(data)
print compSum(x - mean for x in data)/len(data)
I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

Szabolcs HorvÃ¡t

Built-in sum should work with everything, not just floats. I think it
would be useful addition to standard math module.

If you know C you could write a patch to mathmodule.c and submit it to
Python devs.

--
Ivan

Jun 27 '08 #3

Nick Craig-Wood

Szabolcs Horvát <sz******@gmail.comwrote:

I did the following calculation: Generated a list of a million random
numbers between 0 and 1, constructed a new list by subtracting the mean
value from each number, and then calculated the mean again.

The result should be 0, but of course it will differ from 0 slightly
because of rounding errors.

However, I noticed that the simple Python program below gives a result
of ~ 10^-14, while an equivalent Mathematica program (also using double
precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
more precise.

Here's the program (pardon my style, I'm a newbie/occasional user):

from random import random

data = [random() for x in xrange(1000000)]

mean = sum(data)/len(data)
print sum(x - mean for x in data)/len(data)

A little research shows that Mathematica uses a "compensated summation"
algorithm. Indeed, using the algorithm described at
http://en.wikipedia.org/wiki/Kahan_summation_algorithm
gives us a result around ~ 10^-17:

I was taught in my numerical methods course (about 20 years ago now!)
that the way to do this sum with most accuracy is to sum from the
smallest magnitude to the largest magnitude.

Eg

>>from random import random
data = [random() for x in xrange(1000000)]
mean = sum(data)/len(data)
print sum(x - mean for x in data)/len(data)

1.64152134108e-14

>>mean = sum(sorted(data))/len(data)
print sum(x - mean for x in data)/len(data)

5.86725534824e-15

>>>

It isn't as good as the compensated sum but it is easy!

I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

sum() gets used for any numerical types not just floats...

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Jun 27 '08 #4

=?ISO-8859-1?Q?Szabolcs_Horv=E1t?=

Arnaud Delobelle wrote:

>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

and it did not work. Or is this just a special exception to prevent the
misuse of sum to join strings? (As I said, I'm only an occasional user.)

Generally, sum() seems to be most useful for numeric types (i.e. those
that form a group with respect to __add__ and __neg__/__sub__), which
may be either exact (e.g. integers) or inexact (e.g. floating point
types). For exact types it does not make sense to use compensated
summation (though it wouldn't give an incorrect answer, either), and
sum() cannot decide whether a user-defined type is exact or inexact.

But I guess that it would still be possible to make sum() use
compensated summation for built-in floating point types (float/complex).

Or, to go further, provide a switch to choose between the two methods,
and make use compensated summation for float/complex by default. (But
perhaps people would consider this last alternative a bit too messy.)

(Just some thoughts ...)

Jun 27 '08 #5

hdante

On May 3, 3:44*pm, Szabolcs Horvát <szhor...@gmail.comwrote:

Arnaud Delobelle wrote:

sum() works for any sequence of objects with an __add__ method, not
just floats! *Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

and it did not work. *Or is this just a special exception to prevent the
* misuse of sum to join strings? *(As I said, I'm only an occasional user.)

Generally, sum() seems to be most useful for numeric types (i.e. those
that form a group with respect to __add__ and __neg__/__sub__), which
may be either exact (e.g. integers) or inexact (e.g. floating point
types). *For exact types it does not make sense to use compensated
summation (though it wouldn't give an incorrect answer, either), and
sum() cannot decide whether a user-defined type is exact or inexact.

But I guess that it would still be possible to make sum() use
compensated summation for built-in floating point types (float/complex).

Or, to go further, provide a switch to choose between the two methods,
and make use compensated summation for float/complex by default. *(But
perhaps people would consider this last alternative a bit too messy.)

(Just some thoughts ...)

The benefits should be weighted considering the common case. For
example, do you find an error of 10^-14 unbearable ? How many times
someone will add 1.000.000 numbers in a typical scientific application
written in python ?

I believe that moving this to third party could be better. What about
numpy ? Doesn't it already have something similar ?

Jun 27 '08 #6

Arnaud Delobelle

Szabolcs Horvát <sz******@gmail.comwrites:

Arnaud Delobelle wrote:
>>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

and it did not work. Or is this just a special exception to prevent
the misuse of sum to join strings? (As I said, I'm only an occasional
user.)

I think that's right: anything with an __add__ method, apart from
string, can be sum()ed.

--
Arnaud

Jun 27 '08 #7

Erik Max Francis

Szabolcs Horvát wrote:

Arnaud Delobelle wrote:
>>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

and it did not work. Or is this just a special exception to prevent the
misuse of sum to join strings? (As I said, I'm only an occasional user.)

What you wrote is nonsensical there, no different from 'a' + 1 -- which
is why it quite rightly raises a TypeError.

You're trying to add a list to a string, which is nonsensical. You add
strings to strings, or lists to lists, but mixing them up doesn't make
sense. Python can't guess what you mean when you write something like
['abc', 'def'] + '' -- which is the functional equivalent of your call
to sum -- and so doesn't try. It indicates this by raising a TypeError.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 18 N 121 57 W && AIM, Y!M erikmaxfrancis

Jun 27 '08 #8

Torsten Bronger

Hallöchen!

Erik Max Francis writes:

Szabolcs Horvát wrote:

>Arnaud Delobelle wrote:
>>>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

and it did not work. Or is this just a special exception to
prevent the misuse of sum to join strings? (As I said, I'm only
an occasional user.)

What you wrote is nonsensical there, no different from 'a' + 1 --
which is why it quite rightly raises a TypeError.

No, the above expression should yield ''+'abc'+'efg', look for the
signature of sum in the docs.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for further contact info.)

Jun 27 '08 #9

Erik Max Francis

Torsten Bronger wrote:

No, the above expression should yield ''+'abc'+'efg', look for the
signature of sum in the docs.

You're absolutely right, I misread it. Sorry about that.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 18 N 121 57 W && AIM, Y!M erikmaxfrancis

Jun 27 '08 #10

Ivan Illarionov

On Sat, 03 May 2008 20:44:19 +0200, Szabolcs HorvÃ¡t wrote:

Arnaud Delobelle wrote:
>>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works
'abcefg'

--
Ivan

Jun 27 '08 #11

sturlamolden

On May 3, 10:13 pm, hdante <hda...@gmail.comwrote:

I believe that moving this to third party could be better. What about
numpy ? Doesn't it already have something similar ?

Yes, Kahan summation makes sence for numpy arrays. But the problem
with this algorithm is optimizing compilers. The programmer will be
forced to use tricks like inline assembly to get around the optimizer.
If not, the optimizer would remove the desired features of the
algorithm. But then we have a serious portability problem.

Jun 27 '08 #12

Thomas Dybdahl Ahle

On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:

On Sat, 03 May 2008 20:44:19 +0200, Szabolcs HorvÃ¡t wrote:

Arnaud Delobelle wrote:
>
sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.
This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works
'abcefg'

Hm, it works for lists:
sum(([1], [2]), [])
[1, 2]

However I find the seccond argument hack ugly.
Does the sum way have any performance advantages over the reduce way?

--
Best Regards,
Med Venlig Hilsen,
Thomas

Jun 27 '08 #13

Ivan Illarionov

On Sun, 04 May 2008 00:31:01 +0200, Thomas Dybdahl Ahle wrote:

On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
>On Sat, 03 May 2008 20:44:19 +0200, Szabolcs HorvÃ¡t wrote:

Arnaud Delobelle wrote:

sum() works for any sequence of objects with an __add__ method, not
just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works 'abcefg'

Hm, it works for lists:
sum(([1], [2]), [])
[1, 2]

However I find the seccond argument hack ugly. Does the sum way have any
performance advantages over the reduce way?

Yes, sum() is faster:

$ python -m timeit "" "sum([[1], [2], [3, 4]], [])"
100000 loops, best of 3: 6.16 usec per loop

$ python -m timeit "import operator" \
"reduce(operator.add, [[1], [2], [3, 4]])"
100000 loops, best of 3: 11.9 usec per loop

--
Ivan

Jun 27 '08 #14

George Sakkis

On May 3, 7:12*pm, Ivan Illarionov <ivan.illario...@gmail.comwrote:

On Sun, 04 May 2008 00:31:01 +0200, Thomas Dybdahl Ahle wrote:
On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:

Arnaud Delobelle wrote:

sum() works for any sequence of objects with an __add__ method, not
just floats! *Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works 'abcefg'
Hm, it works for lists:
sum(([1], [2]), [])
[1, 2]

So it's not reduce() that is more forgiving, it's sum() that makes an
exception for strings only.

However I find the seccond argument hack ugly. Does the sum way have any
performance advantages over the reduce way?

Yes, sum() is faster:

$ python -m timeit "" "sum([[1], [2], [3, 4]], [])"
100000 loops, best of 3: 6.16 usec per loop

$ python -m timeit "import operator" \
"reduce(operator.add, [[1], [2], [3, 4]])"
100000 loops, best of 3: 11.9 usec per loop

Not really; you measure the import and the attribute access time in
the second case. sum() is 2x faster for adding numbers only:

# Adding integers
python -mtimeit --setup="x=[1]*100" "sum(x)"
100000 loops, best of 3: 4.87 usec per loop
python -mtimeit --setup="from operator import add; x=[1]*100"
"reduce(add,x)"
100000 loops, best of 3: 10.1 usec per loop

# Adding floats
python -mtimeit --setup="x=[1.0]*100" "sum(x)"
100000 loops, best of 3: 5.14 usec per loop
python -mtimeit --setup="from operator import add; x=[1.0]*100"
"reduce(add,x)"
100000 loops, best of 3: 10.1 usec per loop

# Adding tuples
python -mtimeit --setup="x=[(1,)]*100" "sum(x,())"
10000 loops, best of 3: 61.6 usec per loop
python -mtimeit --setup="from operator import add; x=[(1,)]*100"
"reduce(add,x,())"
10000 loops, best of 3: 66.3 usec per loop

# Adding lists
python -mtimeit --setup="x=[[1]]*100" "sum(x,[])"
10000 loops, best of 3: 79.9 usec per loop
python -mtimeit --setup="from operator import add; x=[[1]]*100"
"reduce(add,x,[])"
10000 loops, best of 3: 84.3 usec per loop

George

Jun 27 '08 #15

Ivan Illarionov

On Sat, 03 May 2008 17:43:57 -0700, George Sakkis wrote:

On May 3, 7:12Â*pm, Ivan Illarionov <ivan.illario...@gmail.comwrote:
>On Sun, 04 May 2008 00:31:01 +0200, Thomas Dybdahl Ahle wrote:
On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
On Sat, 03 May 2008 20:44:19 +0200, Szabolcs HorvÃ¡t wrote:

Arnaud Delobelle wrote:

>sum() works for any sequence of objects with an __add__ method,
not just floats! Â*Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

>Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

>reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works 'abcefg'
Hm, it works for lists:
sum(([1], [2]), [])
[1, 2]

So it's not reduce() that is more forgiving, it's sum() that makes an
exception for strings only.

However I find the seccond argument hack ugly. Does the sum way have
any performance advantages over the reduce way?

Yes, sum() is faster:

$ python -m timeit "" "sum([[1], [2], [3, 4]], [])" 100000 loops, best
of 3: 6.16 usec per loop

$ python -m timeit "import operator" \ "reduce(operator.add, [[1], [2],
[3, 4]])" 100000 loops, best of 3: 11.9 usec per loop

Not really; you measure the import and the attribute access time in the
second case. sum() is 2x faster for adding numbers only:

# Adding integers
python -mtimeit --setup="x=[1]*100" "sum(x)" 100000 loops, best of 3:
4.87 usec per loop python -mtimeit --setup="from operator import add;
x=[1]*100" "reduce(add,x)"
100000 loops, best of 3: 10.1 usec per loop

# Adding floats
python -mtimeit --setup="x=[1.0]*100" "sum(x)" 100000 loops, best of 3:
5.14 usec per loop python -mtimeit --setup="from operator import add;
x=[1.0]*100" "reduce(add,x)"
100000 loops, best of 3: 10.1 usec per loop

# Adding tuples
python -mtimeit --setup="x=[(1,)]*100" "sum(x,())" 10000 loops, best of
3: 61.6 usec per loop python -mtimeit --setup="from operator import add;
x=[(1,)]*100" "reduce(add,x,())"
10000 loops, best of 3: 66.3 usec per loop

# Adding lists
python -mtimeit --setup="x=[[1]]*100" "sum(x,[])" 10000 loops, best of
3: 79.9 usec per loop python -mtimeit --setup="from operator import add;
x=[[1]]*100" "reduce(add,x,[])"
10000 loops, best of 3: 84.3 usec per loop

George

Thanks for correction. Forget about --setup.

--
Ivan

Jun 27 '08 #16

hdante

On May 3, 7:05*pm, sturlamolden <sturlamol...@yahoo.nowrote:

On May 3, 10:13 pm, hdante <hda...@gmail.comwrote:

*I believe that moving this to third party could be better. What about
numpy ? Doesn't it already have something similar ?

Yes, Kahan summation makes sence for numpy arrays. But the problem
with this algorithm is optimizing compilers. The programmer will be

No, optimizing compilers shouldn't discard floating point operations
by default, since it changes program behavior. If they do, at least
there should be a flag to turn it off.

forced to use tricks like inline assembly to get around the optimizer.
If not, the optimizer would remove the desired features of the
algorithm. But then we have a serious portability problem.

Jun 27 '08 #17

Duncan Booth

Szabolcs Horvát <sz******@gmail.comwrote:

I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

There's a thread you might find interesting:

http://groups.google.com/group/comp....7cef7d4429aa3a

In that thread I suggested that Python ought to implement sum by adding
together each pair of values, then each pair of results and so on. This
means that for input values of a similar magnitude the intermediate results
stay balanced. The implementation doesn't actually do all the first level
sums first, it builds up a stack containing the sum of 2^k, 2^(k-1)...2
values. Also it works for other types as well, and for strings or lists has
the advantage of minimising the number of large string or list concatenations.

If you follow that thread far enough you'll find a post by Jussi Salmela
which shows that summing adjacent pairs performs as well as Kahan summation
(he says 'almost as good' but in fact the only time they get different
results the 'correct' answer is 500000.000005 and Kahan and sumpairs get
the two nearest representable results either side: 500000.00000499998 and
500000.00000500004 respectively).

I never did get round to re-coding my sumpairs() function in C, I would be
interested to see just how much overhead it introduces (and I suspect it
would be faster than the existing sum in some cases such as for lists).

Jun 27 '08 #18

Gabriel Genellina

En Sun, 04 May 2008 12:58:25 -0300, Duncan Booth <du**********@invalid.invalidescribió:

Szabolcs Horvát <sz******@gmail.comwrote:

>I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

There's a thread you might find interesting:

http://groups.google.com/group/comp....7cef7d4429aa3a

In that thread I suggested that Python ought to implement sum by adding
together each pair of values, then each pair of results and so on. This
means that for input values of a similar magnitude the intermediate results
stay balanced. The implementation doesn't actually do all the first level

Python doesn't require __add__ to be associative, so this should not be used as a general sum replacement. But if you know that you're adding floating point numbers you can use whatever algorithm best fits you. Or use numpy arrays; I think they implement Kahan summation or a similar algorithm.

--
Gabriel Genellina

Jun 27 '08 #19

=?ISO-8859-1?Q?Szabolcs_Horv=E1t?=

Duncan Booth wrote:

Szabolcs Horvát <sz******@gmail.comwrote:

>I thought that it would be very nice if the built-in sum() function used
this algorithm by default. Has this been brought up before? Would this
have any disadvantages (apart from a slight performance impact, but
Python is a high-level language anyway ...)?

There's a thread you might find interesting:

http://groups.google.com/group/comp....7cef7d4429aa3a

In that thread I suggested that Python ought to implement sum by adding
together each pair of values, then each pair of results and so on. This
means that for input values of a similar magnitude the intermediate results
stay balanced. The implementation doesn't actually do all the first level
sums first, it builds up a stack containing the sum of 2^k, 2^(k-1)...2
values. Also it works for other types as well, and for strings or lists has
the advantage of minimising the number of large string or list concatenations.

If you follow that thread far enough you'll find a post by Jussi Salmela
which shows that summing adjacent pairs performs as well as Kahan summation
(he says 'almost as good' but in fact the only time they get different
results the 'correct' answer is 500000.000005 and Kahan and sumpairs get
the two nearest representable results either side: 500000.00000499998 and
500000.00000500004 respectively).

I never did get round to re-coding my sumpairs() function in C, I would be
interested to see just how much overhead it introduces (and I suspect it
would be faster than the existing sum in some cases such as for lists).

Thanks for this link! Though some of the thread is missing, it was an
interesting read.

This sounds like a very good idea: it is more generally applicable than
the Kahan summation because it does not require subtraction/negation, so
it would work with user-defined types, too.

While now I am convinced that using Kahan summation by default is not
such a good idea, I cannot see any serious disadvantages/problems with
pairwise summation!

Jun 27 '08 #20

=?ISO-8859-15?Q?Szabolcs_Horv=E1t?=

Gabriel Genellina wrote:

>
Python doesn't require __add__ to be associative, so this should not be used as a general sum replacement.

It does not _require_ this, but using an __add__ that is not commutative
and associative, or has side effects, would qualify as a serious misuse,
anyway. So I think that this isn't a real disadvantage (it can always
be documented that sum() expects __add__ to have these properties).

But you are right that summing floats with a requirement of high
precision is a quite specific use case, so there are no serious
arguments to do this, either.

Jun 27 '08 #21

Raymond Hettinger

On May 3, 9:50*am, Szabolcs Horvát <szhor...@gmail.comwrote:

I did the following calculation: *Generated a list of a million random
numbers between 0 and 1, constructed a new list by subtracting the mean
value from each number, and then calculated the mean again.

The result should be 0, but of course it will differ from 0 slightly
because of rounding errors.

See:
http://aspn.activestate.com/ASPN/Coo.../Recipe/393090

Here's the program (pardon my style, I'm a newbie/occasional user):

. . .

mean = sum(data)/len(data)
print sum(x - mean for x in data)/len(data)

See:
http://svn.python.org/view/sandbox/t...81&view=markup
Raymond

Jun 27 '08 #22

Raymond Hettinger

However I find the seccond argument hack ugly. Does the sum way have any

performance advantages over the reduce way?

Yes, sum() is faster:

....

Not really; you measure the import and the attribute access time in
the second case. sum() is 2x faster for adding numbers only:

Try it with Py2.6 or Py3.0. The sum() function has been optimized and
should be at least 6x faster for summing ints and floats. It should
even beat psyco optimized sums!
Raymond

Jun 27 '08 #23

Szabolcs

On May 5, 9:37 am, Szabolcs Horvát <szhor...@gmail.comwrote:

Gabriel Genellina wrote:

Python doesn't require __add__ to be associative, so this should not be used as a general sum replacement.

It does not _require_ this, but using an __add__ that is not commutative
and associative, or has side effects, would qualify as a serious misuse,
anyway.

Sorry, I meant associative only, not commutative.

Jun 27 '08 #24

Duncan Booth

"Gabriel Genellina" <ga*******@yahoo.com.arwrote:

En Sun, 04 May 2008 12:58:25 -0300, Duncan Booth
<du**********@invalid.invalidescribió:

>Szabolcs Horvát <sz******@gmail.comwrote:

>>I thought that it would be very nice if the built-in sum() function
used this algorithm by default. Has this been brought up before?
Would this have any disadvantages (apart from a slight performance
impact, but Python is a high-level language anyway ...)?

There's a thread you might find interesting:

http://groups.google.com/group/comp....hread/thread/9
eda29faf92f532e/027cef7d4429aa3a

In that thread I suggested that Python ought to implement sum by
adding together each pair of values, then each pair of results and so
on. This means that for input values of a similar magnitude the
intermediate results stay balanced. The implementation doesn't
actually do all the first level

Python doesn't require __add__ to be associative, so this should not
be used as a general sum replacement. But if you know that you're
adding floating point numbers you can use whatever algorithm best fits
you. Or use numpy arrays; I think they implement Kahan summation or a
similar algorithm.

Yes, my argument is more along the line that it should have been
implemented that way in the first place, but changing things now would
require time machine intervention. :)

Jun 27 '08 #25

Duncan Booth

Szabolcs <sz******@gmail.comwrote:

On May 5, 9:37 am, Szabolcs Horvát <szhor...@gmail.comwrote:
>Gabriel Genellina wrote:

Python doesn't require __add__ to be associative, so this should
not be

used as a general sum replacement.
>>
It does not _require_ this, but using an __add__ that is not
commutative and associative, or has side effects, would qualify as a
serious misuse, anyway.

Sorry, I meant associative only, not commutative.

Unfortunately an __add__ which is not associative is actually perfectly
reasonable. For example you could have a Tree class where adding two trees
returns a new tree so (a+b)+c gives you:

Jun 27 '08 #26

Szabolcs

On May 5, 12:24*pm, Duncan Booth <duncan.bo...@invalid.invalidwrote:

Szabolcs <szhor...@gmail.comwrote:
On May 5, 9:37 am, Szabolcs Horvát <szhor...@gmail.comwrote:
Gabriel Genellina wrote:

Python doesn't require __add__ to be associative, so this should
not be
used as a general sum replacement.

It does not _require_ this, but using an __add__ that is not
commutative and associative, or has side effects, would qualify as a
serious misuse, anyway.

Sorry, I meant associative only, not commutative.

Unfortunately an __add__ which is not associative is actually perfectly
reasonable.

Well, 'reasonable' is a subjective notion in this case :-) I have
never seen a non-associative use of the plus sign in maths, so I would
tend to take associativity for granted when seeing a + b + c in code,
therefore I would avoid such uses. (Multiplication signs are a
different story.) Of course others may feel differently about this,
and technically there's nothing in the way of using a non-associative
__add__! :)

For example you could have a Tree class where adding two trees
returns a new tree so (a+b)+c gives you:

* * .
* */ \
* *. *c
* / \
* a *b

but a+(b+c) gives:

* * .
* */ \
* *a *.
* * */ \
* * *b *c

Jun 27 '08 #27

Rhamphoryncus

On May 3, 4:31 pm, Thomas Dybdahl Ahle <lob...@gmail.comwrote:

On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:

Arnaud Delobelle wrote:

>sum() works for any sequence of objects with an __add__ method, not
>just floats! Your algorithm is specific to floats.

This occurred to me also, but then I tried

sum(['abc', 'efg'], '')

Interesting, I always thought that sum is like shortcut of
reduce(operator.add, ...), but I was mistaken.

reduce() is more forgiving:
reduce(operator.add, ['abc', 'efg'], '' ) # it works
'abcefg'

Hm, it works for lists:
sum(([1], [2]), [])
[1, 2]

However I find the seccond argument hack ugly.
Does the sum way have any performance advantages over the reduce way?

The relative differences are irrelevant. The important thing is they
both perform quadratically, which makes them the wrong choice (unless
you can guarantee your inputs are trivial.)

$ python2.5 -m timeit 'x = [[0]*1000]*10' 'sum(x, [])'
1000 loops, best of 3: 566 usec per loop
$ python2.5 -m timeit 'x = [[0]*1000]*100' 'sum(x, [])'
10 loops, best of 3: 106 msec per loop
$ python2.5 -m timeit 'x = [[0]*1000]*1000' 'sum(x, [])'
10 loops, best of 3: 18.1 sec per loop

Per element of the outer list, that's 0.0566, 1.06, 18.1 msec.
Nasty. Use this instead:

$ python2.5 -m timeit -s 'x = [[0]*1000]*10' 'y = []' 'for a in x:
y.extend(a)'
10000 loops, best of 3: 80.2 usec per loop
$ python2.5 -m timeit -s 'x = [[0]*1000]*100' 'y = []' 'for a in x:
y.extend(a)'
1000 loops, best of 3: 821 usec per loop
$ python2.5 -m timeit -s 'x = [[0]*1000]*1000' 'y = []' 'for a in x:
y.extend(a)'
10 loops, best of 3: 24.3 msec per loop
$ python2.5 -m timeit -s 'x = [[0]*1000]*10000' 'y = []' 'for a in x:
y.extend(a)'
10 loops, best of 3: 241 msec per loop
$ python2.5 -m timeit -s 'x = [[0]*1000]*100000' 'y = []' 'for a in x:
y.extend(a)'
10 loops, best of 3: 2.35 sec per loop

Per element of the outer list t hat's 0.00802, 0.00821, 0.0243,
0.0241, 0.0235 msec. At 1000 elements it seemed to cross some
threshold (maybe cache related, maybe allocation related), but overall
it scales linearly. Know your algorithm's complexity.

Jun 27 '08 #28

Robert Kern

Gabriel Genellina wrote:

Python doesn't require __add__ to be associative, so this should not be used as a general sum replacement. But if you know that you're adding floating point numbers you can use whatever algorithm best fits you. Or use numpy arrays; I think they implement Kahan summation or a similar algorithm.

No, we do a straightforward sum, too. Less magic in the default case.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '08 #29

Gabriel Genellina

En Mon, 05 May 2008 05:08:15 -0300, Raymond Hettinger <py****@rcn.com>
escribió:

On May 3, 9:50*am, Szabolcs Horvát <szhor...@gmail.comwrote:
>I did the following calculation: *Generated a list of a million random
numbers between 0 and 1, constructed a new list by subtracting the mean
value from each number, and then calculated the mean again.

The result should be 0, but of course it will differ from 0 slightly
because of rounding errors.

See:
http://aspn.activestate.com/ASPN/Coo.../Recipe/393090

Awesome!

See:
http://svn.python.org/view/sandbox/t...81&view=markup

Wonderful!

--
Gabriel Genellina

Jun 27 '08 #30

Terry Reedy

"Szabolcs Horvát" <sz******@gmail.comwrote in message
news:fv***********@toralf.uib.no...
| Gabriel Genellina wrote:
| >
| Python doesn't require __add__ to be associative, so this should not be
used as a general sum replacement.
|
| It does not _require_ this, but using an __add__ that is not commutative
| and associative, or has side effects, would qualify as a serious misuse,

seq + seq is not commutative ;-)

Jun 27 '08 #31

Rhamphoryncus

On May 8, 3:09 pm, "Terry Reedy" <tjre...@udel.eduwrote:

"Szabolcs Horvát" <szhor...@gmail.comwrote in message

news:fv***********@toralf.uib.no...| Gabriel Genellina wrote:

| >
| Python doesn't require __add__ to be associative, so this should not be
used as a general sum replacement.
|
| It does not _require_ this, but using an __add__ that is not commutative
| and associative, or has side effects, would qualify as a serious misuse,

seq + seq is not commutative ;-)

There's no point switching to a special algorithm unless a float is
detected. What you're really arguing is seq + seq + float is not
commutative. In either case we may have to hurt you. ;)

Jun 27 '08 #32

Duncan Booth

"Terry Reedy" <tj*****@udel.eduwrote:

>
"Szabolcs Horvát" <sz******@gmail.comwrote in message
news:fv***********@toralf.uib.no...
| Gabriel Genellina wrote:
| >
| Python doesn't require __add__ to be associative, so this should
| not be
used as a general sum replacement.
|
| It does not _require_ this, but using an __add__ that is not
| commutative and associative, or has side effects, would qualify as a
| serious misuse,

seq + seq is not commutative ;-)

And three days ago Szabolcs already posted a followup to that
comment saying: "Sorry, I meant associative only, not commutative.".

Jun 27 '08 #33

Feature suggestion: sum() ought to use a compensated summation algorithm

Similar topics