By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,652 Members | 1,698 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,652 IT Pros & Developers. It's quick & easy.

Lists: Converting Double to Single

P: n/a
I start with a list of tuples retrieved from a database table. These
tuples are extracted and put into individual lists. So I have lists that
look like this: [1, 2, 3, 4, 5]. When I concatenate lists, I end up with a
list of lists that looks like this: [[1, 2, 3. 4, 5]. [6, 7. 8, 9. 10]].
Then, I average the column values so I end up with a single list, but with
two brackets on each end, for example, [[3.5, 4.5, 5.5, 6.5, 7.5]].

Unfortunately, when I try to use that last list in a NumPy function, I'm
told that it cannot be broadcast to the correct shape. So, what I want to do
is strip the extra brackes from each end to leave just [3.5, 4.5, 5.5, 6.5,
7.5].

How do I do this, please?

Rich
Feb 27 '07 #1
Share this Question
Share on Google+
11 Replies


P: n/a
rs******@nospam.appl-ecosys.com wrote:
So I have lists that look like this: [1, 2, 3, 4, 5]. When I
concatenate lists, I end up with a list of lists that looks like
this: [[1, 2, 3. 4, 5]. [6, 7. 8, 9. 10]].
Really?
>>[1, 2, 3, 4, 5] + [6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Then, I average the column values so I end up with a single list, but
with two brackets on each end, for example, [[3.5, 4.5, 5.5, 6.5,
7.5]].

Unfortunately, when I try to use that last list in a NumPy function,
I'm told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].
l = l[0]

Feb 27 '07 #2

P: n/a
rs******@nospam.appl-ecosys.com wrote:
I end up with a single list, but with two brackets on each end,
for example, [[3.5, 4.5, 5.5, 6.5, 7.5]].

Unfortunately, when I try to use that last list in a NumPy
function, I'm
told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].

How do I do this, please?
Easy :)

I'll show you by example:
>>ham = [[1,2,3],[4,5,6]]
ham
[[1, 2, 3], [4, 5, 6]]
>>ham[0]
[1, 2, 3]
>>ham.remove([4,5,6])
ham[0]
[1, 2, 3]
>>ham
[[1, 2, 3]]
>>>
HTH&Regards,
Björn

--
BOFH excuse #453:

Spider infestation in warm case parts

Feb 27 '07 #3

P: n/a
On 2007-02-27, Leif K-Brooks <eu*****@ecritters.bizwrote:

Lief, Bjoern:
l = l[0]
Of course! If I had let it work in my mind overnight I would almost
certainly have seen this.

Thank you both for your patient responses,

Rich
Feb 27 '07 #4

P: n/a
Leif K-Brooks wrote:
rs******@nospam.appl-ecosys.com wrote:
>...
Unfortunately, when I try to use that last list in a NumPy function,
I'm told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].

l = l[0]
Or, if you want to simultaneously assert there is only one list,
use unpacking like:
[l] = l

Although I'd prefer to use a better name than l; something like:
lst = [[3.5, 4.5, 5.5, 6.5, 7.5]]
[inner] = lst
print inner, lst

--
--Scott David Daniels
sc***********@acm.org
Feb 27 '07 #5

P: n/a
Dennis Lee Bieber <wl*****@ix.netcom.comwrote:
Something like (pseudo-code):
cnt = 0
for rw in cursor():
if cnt:
for i,v in enumerate(rw):
sum[i] += v #accumulate next row
else:
sum = rw #initialize to first row
cnt += 1
avg = [ float(v) / cnt for v in sum]

Don't know if this is faster than numpy operations, but it
definitely reduces the amount of memory consumed by that list of
lists...
A lot of course depends on the input data, and the example given wasn't
even syntactically valid so I can't tell if he started with integers or
floats, but if the lists are long and the OP is at all worried about
accuracy, this code may be a bad idea. Adding up a long list of values
and then dividing by the number of values is the classic computer
science example of how to get an inaccurate answer from a floating point
calculation.

I wouldn't have mentioned it, but since the OP said numpy it might just
be that he cares about the result. What I really don't understand though
is why, if he wants the final result in NumPy, he doesn't just use it
for the calculation?
>>from numpy import array
lst = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
array(lst).mean(axis=0)
array([ 3.5, 4.5, 5.5, 6.5, 7.5])

What I haven't tried to figure out is whether, having said all that,
NumPy's mean is any more accurate than the naive sum and divide
calculation? So far as I can tell, it isn't any more accurate although
it does give different results.
Feb 27 '07 #6

P: n/a
On Tue, 27 Feb 2007 10:06:29 +0000, Duncan Booth wrote:
Adding up a long list of values
and then dividing by the number of values is the classic computer
science example of how to get an inaccurate answer from a floating point
calculation.
I'm not entirely ignorant when it comes to computational mathematics, but
I must admit this is a new one to me.

If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

The only alternative I thought of was to divide each value by the count
before summing, but that would be horribly inaccurate.

Or would it?
>>def mean1(*args):
.... return sum(args)/len(args)
....
>>def mean2(*args):
.... n = len(args)
.... return sum([x/n for x in args])
....
>>L = range(25, 597) # 572 values
L = [x/3.3 for x in L]

mean1(*L)
94.090909090909108
>>mean2(*L)
94.090909090909108

The first calculation has 571 additions and one division; the second
calculation has 571 additions and 572 divisions, but they both give the
same result to 15 decimal places.

--
Steven.

Feb 27 '07 #7

P: n/a
On 2007-02-27, Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrote:
On Tue, 27 Feb 2007 10:06:29 +0000, Duncan Booth wrote:
>Adding up a long list of values and then dividing by the
number of values is the classic computer science example of
how to get an inaccurate answer from a floating point
calculation.

I'm not entirely ignorant when it comes to computational
mathematics, but I must admit this is a new one to me.

If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

The only alternative I thought of was to divide each value by the count
before summing, but that would be horribly inaccurate.

Or would it?
>>>def mean1(*args):
... return sum(args)/len(args)
...
>>>def mean2(*args):
... n = len(args)
... return sum([x/n for x in args])
...
>>>L = range(25, 597) # 572 values
L = [x/3.3 for x in L]

mean1(*L)
94.090909090909108
>>>mean2(*L)
94.090909090909108

The first calculation has 571 additions and one division; the
second calculation has 571 additions and 572 divisions, but
they both give the same result to 15 decimal places.
How about this?
>>L = range(5, 10)
L = [pow(2, -x) for x in L]
"%40.40f" % mean1(*L)
'0.0121093750000000000000000000000000000000'
>>"%40.40f" % mean2(*L)
'0.0121093750000000020000000000000000000000'

Offhand, I think the first is "righter". Weird!

--
Neil Cerutti
Feb 27 '07 #8

P: n/a
Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrote:
If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?
It doesn't have to be smallest to largest, but the important thing is not
to be adding the millionth element to the sum of the preceding 999999
values. So if I remember correctly one option is to add pairs of numbers,
then sum the pairs and so on.

I think my point really is that nobody is going to bother doing this when
they just want to sum a few numbers, and I'm not really surprised that the
builtin sum doesn't do it either, but I was slightly surprised that numpy
doesn't try to minimise loss of precision.

Here's a quick attempt at implementing it which demonstrates how the
builtin sum loses precision somewhat faster than a pairwise sum (and is
rather dependant on the input order). sumpair is of course a lot slower
than the builtin sum, but it might be interesting to see how a C coded
version compared. I can't see that it would necessarily be much slower: it
just adds some counters, bit twiddling and a short temporary array to the
calculation.

C:\temp>sumpairs.py
builtin sum 500000.00000083662
pairwise sum 500000.00000499998
sorted
builtin sum 500000.00000499998
pairwise sum 500000.00000499998
reverse sorted
builtin sum 500000.0
pairwise sum 500000.00000500004
all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

------------ sumpairs.py ------------------------------------------
# Sum implemented so that intermediate results sum a similar number
# of the input values.
def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

tmp.reverse()
return sum(tmp)

v = [1000000, 0.00001] * 1000000
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "sorted"
v.sort()
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "all the same"
v = [1000000.00001] * 1000000
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))-----------------------------
----------------------------------

In case the code is a bit hard to follow this version makes it clearer what
is happening:

------------- sumpair2.py ---------------
def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
print "add ", tmp
i = i + 1
n = i & -i
while n 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
print "reduce", tmp
else:
tmp.append(v)
print "append", tmp

tmp.reverse()
return sum(tmp)

print sumpairs([1]*40)
-----------------------------------------
which gives the output:

append [1]
add [2]
append [2, 1]
add [2, 2]
reduce [4]
append [4, 1]
add [4, 2]
append [4, 2, 1]
add [4, 2, 2]
reduce [4, 4]
reduce [8]
append [8, 1]
add [8, 2]
append [8, 2, 1]
add [8, 2, 2]
reduce [8, 4]
append [8, 4, 1]
add [8, 4, 2]
append [8, 4, 2, 1]
add [8, 4, 2, 2]
reduce [8, 4, 4]
reduce [8, 8]
reduce [16]
append [16, 1]
add [16, 2]
append [16, 2, 1]
add [16, 2, 2]
reduce [16, 4]
append [16, 4, 1]
add [16, 4, 2]
append [16, 4, 2, 1]
add [16, 4, 2, 2]
reduce [16, 4, 4]
reduce [16, 8]
append [16, 8, 1]
add [16, 8, 2]
append [16, 8, 2, 1]
add [16, 8, 2, 2]
reduce [16, 8, 4]
append [16, 8, 4, 1]
add [16, 8, 4, 2]
append [16, 8, 4, 2, 1]
add [16, 8, 4, 2, 2]
reduce [16, 8, 4, 4]
reduce [16, 8, 8]
reduce [16, 16]
reduce [32]
append [32, 1]
add [32, 2]
append [32, 2, 1]
add [32, 2, 2]
reduce [32, 4]
append [32, 4, 1]
add [32, 4, 2]
append [32, 4, 2, 1]
add [32, 4, 2, 2]
reduce [32, 4, 4]
reduce [32, 8]
40
Feb 27 '07 #9

P: n/a
Dennis Lee Bieber <wl*****@ix.netcom.comwrote:
For floating point, smallest magnitude to largest IS the most
precise.

Pretend you only have 2 significant digits of precision...

10 + 0.4 + 0.4 + 0.4 =10

0.4 + 0.4 + 0.4 + 10 =11
and if you try the way I suggested then initial input order doesn't
matter:

(10 + 0.4) = 10, (0.4 + 0.4) = 0.8, (10 + 0.8) = 11
(0.4 + 0.4) = 0.8, (10 + 0.4) = 10, (0.8 + 10) = 11
Pretend you ran the example code I posted. Specifically the bit where
the output is:

all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

It isn't so much the order of the initial values that matters
(especially if the values are all similar). What *really* matters is
keeping the magnitude of the intermediate results as small as possible.

Summing in pairs ensures that you keep the precision as long as
possible. The order of the initial values is less important than this
(although it does still have a minor effect). For a solution which works
with a sequence of unknown length and doesn't require lookahead I don't
think you can do much better than that.

BTW, in case someone thinks my example numbers are a bit too weird you
can show the same problem averaging a list of 10 digit floating point
numbers with exact representations:

v = [9999999999.] * 10000000
print "builtin sum ", (sum(v)/len(v))
print "pairwise sum", (sumpairs(v)/len(v))
gives:
builtin sum 9999999999.91
pairwise sum 9999999999.0

In both cases the total is large enough to go beyond the range of
integers that can be expressed exactly in floating point and as soon as
that happens the builtin sum loses precision on every subsequent
addition. pairwise summation only loses precision on a very few of the
additions.

P.S. Apologies for the sloppy coding in the sumpairs function. It should
of course do the final addition manually otherwise it is going to give
odd results summing lists or strings or anything else where the order
matters. Revised version:

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]
Feb 28 '07 #10

P: n/a
Duncan Booth kirjoitti:
Dennis Lee Bieber <wl*****@ix.netcom.comwrote:
> For floating point, smallest magnitude to largest IS the most
precise.

Pretend you only have 2 significant digits of precision...

10 + 0.4 + 0.4 + 0.4 =10

0.4 + 0.4 + 0.4 + 10 =11

and if you try the way I suggested then initial input order doesn't
matter:

(10 + 0.4) = 10, (0.4 + 0.4) = 0.8, (10 + 0.8) = 11
(0.4 + 0.4) = 0.8, (10 + 0.4) = 10, (0.8 + 10) = 11
Pretend you ran the example code I posted. Specifically the bit where
the output is:

all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

It isn't so much the order of the initial values that matters
(especially if the values are all similar). What *really* matters is
keeping the magnitude of the intermediate results as small as possible.

Summing in pairs ensures that you keep the precision as long as
possible. The order of the initial values is less important than this
(although it does still have a minor effect). For a solution which works
with a sequence of unknown length and doesn't require lookahead I don't
think you can do much better than that.

BTW, in case someone thinks my example numbers are a bit too weird you
can show the same problem averaging a list of 10 digit floating point
numbers with exact representations:

v = [9999999999.] * 10000000
print "builtin sum ", (sum(v)/len(v))
print "pairwise sum", (sumpairs(v)/len(v))
gives:
builtin sum 9999999999.91
pairwise sum 9999999999.0

In both cases the total is large enough to go beyond the range of
integers that can be expressed exactly in floating point and as soon as
that happens the builtin sum loses precision on every subsequent
addition. pairwise summation only loses precision on a very few of the
additions.

P.S. Apologies for the sloppy coding in the sumpairs function. It should
of course do the final addition manually otherwise it is going to give
odd results summing lists or strings or anything else where the order
matters. Revised version:

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]

Warning: I'm no floating point guru! Awfully long post following!

I've run a couple of tests and it seems to me that Dennis Lee Bieber is
on the trail of the truth when he claims that smallest magnitude to
the largest is the way to do the summation. Actually it isn't THE way
although it diminishes the error. I was sort of astonished because I
also had somewhere along the years formed the same notion as Dennis.

"Your" pairwise method beats the former method by a large margin
although I don't understand why. To tell you the truth: I started out to
show you were wrong because intuitively (for me at least) the former
method should work better than yours.

I also found a method called "Kahan summation algorithm" which probably
is the best way to do this.

What have I done then? I used your examples and code and added a
straight summation and the Kahan method using the pseudocode presented
in the Wikipedia article. I also made an experiment using randomized
floats merged wildly into a crazy list.

The results (as I interpret them):
1. The straight summation is the same method that the builtin sum uses
because their results are equal(ly poor).
2. Sorting to increasing order helps these two methods to have better
accuracy.
3. The pairwise method is almost as good as the Kahan method.

Here's the code and the results:

================================================== =======================
CODE:
================================================== =======================
import random

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]

def straightSum(seq):
result = 0.0
for elem in seq: result += elem
return result

def kahanSum(input):
sum = input[0]
n = len(input)
c = 0.0 # A running compensation for lost low-order
bits.
for i in range(1, n):
y = input[i] - c # So far, so good: c is zero.
t = sum + y # Alas, sum is big, y small,
# so low-order digits of y are lost.
c = (t - sum) - y # (t - sum) recovers the high-order part of y;
# subtracting y recovers -(low part of y)
sum = t # Algebraically, c should always be zero.
# Beware eagerly optimising compilers!
continue # Next time around, the lost low part will be
# added to y in a fresh attempt.
return sum

print '\n======================v = [1000000, 0.00001] * 1000000'
v = [1000000, 0.00001] * 1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "sorted"
v.sort()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================v = [1000000]*1000000 + [0.00001]*1000000'
v = [1000000]*1000000 + [0.00001]*1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================v = [0.00001]*1000000 + [1000000]*1000000'
v = [0.00001]*1000000 + [1000000]*1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================v = [1000000.00001] * 1000000'
v = [1000000.00001] * 1000000
print "CORRECT ", '1000000.00001 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================Randomized lists'
print '========lst1 = [random.random() for i in range(1000000)]'
print '========lst2 = [1000000*random.random() for i in range(1000000)]'
N = 100000
lst1 = [random.random() for i in range(N)]
lst2 = [1000000.0 + random.random() for i in range(N)]
v = []
for i in range(N):
v.append(lst1[i])
v.append(lst2[i])
v.append(lst2[i])
v.append(lst2[i])
v.append(lst1[i])
v.append(lst2[i])
v.append(lst2[i])
v.append(lst2[i])
v.append(lst2[i])
print '========v = "Crazy merge of lst1 and lst2'
print 'min, max'
print 'lst1:', min(lst1), max(lst1)
print 'lst2:', min(lst2), max(lst2)
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "sorted"
v.sort()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))
================================================== =======================
RESULTS:
================================================== =======================

======================v = [1000000, 0.00001] * 1000000
CORRECT 500000.000005 <========
builtin 500000.00000083662
straight 500000.00000083662
pairwise 500000.00000499998
Kahan 500000.00000499998
sorted
builtin 500000.00000499998
straight 500000.00000499998
pairwise 500000.00000499998
Kahan 500000.00000499998
reverse sorted
builtin 500000.0
straight 500000.0
pairwise 500000.00000500004
Kahan 500000.00000499998

======================v = [1000000]*1000000 + [0.00001]*1000000
CORRECT 500000.000005 <========
builtin 500000.0
straight 500000.0
pairwise 500000.00000500004
Kahan 500000.00000499998

======================v = [0.00001]*1000000 + [1000000]*1000000
CORRECT 500000.000005 <========
builtin 500000.00000499998
straight 500000.00000499998
pairwise 500000.00000499998
Kahan 500000.00000499998

======================v = [1000000.00001] * 1000000
CORRECT 1000000.00001 <========
builtin 1000000.0000016732
straight 1000000.0000016732
pairwise 1000000.00001
Kahan 1000000.00001

======================Randomized lists
========lst1 = [random.random() for i in range(1000000)]
========lst2 = [1000000*random.random() for i in range(1000000)]
========v = "Crazy merge of lst1 and lst2
min, max
lst1: 1.59494420963e-005 0.999990993581
lst2: 1000000.00001 1000001.0
builtin 777778.27774196747
straight 777778.27774196747
pairwise 777778.27774199261
Kahan 777778.27774199261
sorted
builtin 777778.2777420698
straight 777778.2777420698
pairwise 777778.27774199261
Kahan 777778.27774199261
reverse sorted
builtin 777778.27774202416
straight 777778.27774202416
pairwise 777778.27774199261
Kahan 777778.27774199261
================================================== =======================
Cheers,
Jussi
Mar 1 '07 #11

P: n/a
Jussi Salmela <ti*********@hotmail.comwrote:
I've run a couple of tests and it seems to me that Dennis Lee Bieber
is
on the trail of the truth when he claims that smallest magnitude to
the largest is the way to do the summation. Actually it isn't THE way
although it diminishes the error. I was sort of astonished because I
also had somewhere along the years formed the same notion as Dennis.

"Your" pairwise method beats the former method by a large margin
although I don't understand why. To tell you the truth: I started out
to
show you were wrong because intuitively (for me at least) the former
method should work better than yours.
I think Dennis is spot on when he says that smallest to largest is the
best way. What he has missed though is that if you have a list of
similar sized values, then after you have added the first two together
the next addition still needs to follow the smallest to largest rule so
it shouldn't include the result of the first addition. One option, if
you have access to the full list when you start, would be to reinsert
each partial sum back into the sorted list as you calculate it: a heap
queue might be an appropriate implementation for that.

The pairwise sum gives you most, although not quite all, of the benefit
of sorting the data: obviously if the input values are identical it is
almost exactly equivalent to adding the smallest values and then
inserting the partial result back into the list. Even in the 1000000,
0.00001 case it avoids most of the precision-losing additions such as
500000000005.0 + 0.00001 which the simple sum hits.

Also it has the further advantages of still working with other summable
objects (e.g. lists) so long as they are associative and not requiring
you to know all the values or even the length of the sequence in
advance.

For floating point you can of course use methods to preserve the lost
bits such as the kahan sum, or even just do all imtermediate
calculations to a higher internal precision.

By the way, I found another bug in my sumpairs: += is a very bad idea
here if the input can contain mutables, so the relevant line should be:

tmp[-1] = tmp[-1] + v

Once I fixed that one, so it no longer attempts to modify lists in-
place, it appears that the more balanced intermediate results give a
benefit when summing lists as well. sumpairs on a list of lists seems to
beat the builtin sum as soon as the length of the list exceeds about 210
items. At 1000 items it is about 3 times faster and 30 times faster at
10,000 items. If I get the time I think I really should code up a C
version of sumpairs and see how that compares against Pythons builtin
sum for numerical sums.

I guess either everyone else missed that computer science lecture I
thought I remembered, or maybe I was the one who was asleep and I
dreamed it all. :)
Mar 2 '07 #12

This discussion thread is closed

Replies have been disabled for this discussion.