I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):
gcc O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631 http://yaroslav.hopto.org/russianwik...numericspeed
numarray takes over Numeric at about 1100 floats
I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.
Yaroslav 9 2341 bu*****@engr.orst.edu (Yaroslav Bulatov) wrote in
news:4d**************************@posting.google.c om: I made an array of 10 million floats timed how long it takes to sum the elements, here's what I got (millis):
I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.
Also, without running it I don't see how most of your Python examples work
since you seem to call the timing function with the result of the
calculation which doesn't make sense.
On 1 Aug 2004, Duncan Booth wrote: I just had a brief look at the code you posted. Are you not concerned about accuracy in any of your calculations? Summing a 10 million element array by simply accumulating each element into a running total is guaranteed to give a lousy result.
Lousy or not, I believe that's how numarray is implemented internally, so
at least all the benchmarks are the same. If you want accuracy summing
that many numbers, you're going to have to do it in software (likely by
summing each mantissa bit individually and reconstructing the float
afterward), so it will be abysmally slow (obviously not what the OP
wants).
On 31 Jul 2004, Yaroslav Bulatov wrote: I'm doing intensive computation on arrays in Python, so if you have suggestions on Python/C solutions that could push the envelope, please let me know.
If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with notsopleasing results
(mostly due to Python overhead).
Regarding your results:
numarray should be just as fast as the O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the O2 flag: gcc O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:
call clock
movl %eax, %esi
movl $9999999, %ebx
..L11:
decl %ebx
jns .L11
subl $16, %esp
call clock
As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n",sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.
The above fix will likely render numarray faster than the C version.
Using gcc O3 rather than gcc O2 will get fairer results, as this is what
numarray uses.
Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random_array? It shouldn't
affect your results, but it would speed up initialization time a bit.
There are a few inefficiences in the pytime module (mostly involving
range() and *args/**kwargs), but I don't think they'll have too big of an
impact on your results. Instead, I'd suggest running the numarray/Numeric
tests using Psyco to remove much of the Python overhead.
For completeness, I'd also suggest both running the Java version using a
JIT compiler such as Kaffe, and compiling it natively using gcj (the
latter should approach the speed of C).
Christopher T King <sq******@WPI.EDU> wrote in
news:Pi**************************************@ccc4 .wpi.edu: On 1 Aug 2004, Duncan Booth wrote:
I just had a brief look at the code you posted. Are you not concerned about accuracy in any of your calculations? Summing a 10 million element array by simply accumulating each element into a running total is guaranteed to give a lousy result.
Lousy or not, I believe that's how numarray is implemented internally, so at least all the benchmarks are the same. If you want accuracy summing that many numbers, you're going to have to do it in software (likely by summing each mantissa bit individually and reconstructing the float afterward), so it will be abysmally slow (obviously not what the OP wants).
My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):
return 0
The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.
If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel. bu*****@engr.orst.edu (Yaroslav Bulatov) wrote in message news:<4d**************************@posting.google. com>... I made an array of 10 million floats timed how long it takes to sum the elements, here's what I got (millis):
gcc O2: 21 Python with numarray: 104 Python with Numeric: 302 java: 325 gcc: 348 Python with Psyco: 1317 Pure Python using sum: 2312 Pure Python: 5631
http://yaroslav.hopto.org/russianwik...numericspeed
numarray takes over Numeric at about 1100 floats
I'm doing intensive computation on arrays in Python, so if you have suggestions on Python/C solutions that could push the envelope, please let me know.
What hardware are you using?
On a 2.8 GHz Intel Pentium 4, a Fortran 95 code compiled with Compaq
Visual Fortran 6.6C using the optimize:5 option takes 40
milliseconds. Probably you could improve the speed of the C program by
compiling it with the Intel C compiler. The Fortran program speed is
constrained by array accesses, not the additions (which are done in
hardware). A program to compute sum(xx**4) takes the same time as one
to compute sum(xx).
program xsum_double
! sum 1e7 doubles
implicit none
integer, parameter :: n = 10000000
real(kind=8) :: xx(n)
real :: t1,t2,xsum
call random_seed()
call random_number(xx)
call cpu_time(t1)
xsum = sum(xx)
call cpu_time(t2)
print*,1000*(t2t1),xsum
end program xsum_double
Duncan Booth <me@privacy.net> wrote in message news:<Xn**************************@127.0.0.1>... Christopher T King <sq******@WPI.EDU> wrote in news:Pi**************************************@ccc4 .wpi.edu:
On 1 Aug 2004, Duncan Booth wrote:
I just had a brief look at the code you posted. Are you not concerned about accuracy in any of your calculations? Summing a 10 million element array by simply accumulating each element into a running total is guaranteed to give a lousy result.
Lousy or not, I believe that's how numarray is implemented internally, so at least all the benchmarks are the same. If you want accuracy summing that many numbers, you're going to have to do it in software (likely by summing each mantissa bit individually and reconstructing the float afterward), so it will be abysmally slow (obviously not what the OP wants).
My point being that speed isn't everything. Most applications doing large floating point calculations should be concerned about accuracy, and how not to add up a large array of floating point numbers was one of the first things I was taught in computer science classes. The fastest way to sum 10 million numbers (albeit at some considerable loss of accuracy):
return 0
The 'correct' way to sum a large array of floats is, of course, to calculate a lot of partial sums and sum them. For example, naively, we might say that to sum the array we sum the first half, sum the second half and add the results together. This definition being recursive ensures that if the inputs are all of a similar size then all the intermediate calculations involve summing similarly sized values. A more sophisticated technique might also try to handle the case where not all values are a similar size.
If you are worried about speed, then calculating partial sums is also the way to go: the naive technique used by the original poster leaves no scope for parallel calculation. It should be possible to write a slightly more complex loop in the C version that runs a lot faster on systems that are capable of performing multiple floating point instructions in parallel.
You are right, naive summing generates significant accuracy losses.
I estimated the error by summing [0.123456789012345]*1000000 and
comparing it to
1234567.89012345. All methods have error about 1e4 .
The method below sums the array at the same speed as regular Python
sum loop, but reports error < 1e15 .
def sum2(arr):
size = len(arr)
for itr in range(int(ceil(log(size)/log(2)))):
for i in xrange(0, size, 2**(itr+1)):
next_i = i+2**itr
if next_i<size:
arr[i]+=arr[next_i]
return arr[0]
When arbitrary precision numbers are being used, this method has
performance O(nd) vs. regular summing O(n(d+log d)) where n is size of
array, d is number of digits of the elements. In practice I got about
20% performance increase when summing an array of 1000 Decimals using
method above.
A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))
Yaroslav
Christopher T King <sq******@WPI.EDU> wrote in message news:<Pi**************************************@ccc 4.wpi.edu>... On 31 Jul 2004, Yaroslav Bulatov wrote:
I'm doing intensive computation on arrays in Python, so if you have suggestions on Python/C solutions that could push the envelope, please let me know. If you're doing mostly vector calculations as opposed to summing, I've been doing some work on adding SIMD support to numarray, with pleasing results (around 2x speedups). I've also done some work adding local parallel processing support to numarray, with notsopleasing results (mostly due to Python overhead).
Regarding your results:
numarray should be just as fast as the O2 C version. I was puzzled at first as to where the speed discrepancy came from, but the culprit is in the O2 flag: gcc O2 noticies that sum is never used, and thus removes the loop entirely. As a matter of fact, there isn't even any fadd instruction in the assembler output:
call clock movl %eax, %esi movl $9999999, %ebx .L11: decl %ebx jns .L11 subl $16, %esp call clock
As you can see, the 21ms you're seeing is the time spent counting down from 9,999,999 to 0. To obtain correct results, add a line such as 'printf("%f\n",sum);' after the main loop in the C version. This will force gcc to leave the actual calculation in place and give you accurate results.
The above fix will likely render numarray faster than the C version. Using gcc O3 rather than gcc O2 will get fairer results, as this is what numarray uses.
You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?
Is there any reason why in the Python/numarray version, you use Numeric's RandomArray rather than numarray.random_array? It shouldn't affect your results, but it would speed up initialization time a bit.
There isn't a good reason, I simply didn't know about
numarray.random_array There are a few inefficiences in the pytime module (mostly involving range() and *args/**kwargs), but I don't think they'll have too big of an impact on your results. Instead, I'd suggest running the numarray/Numeric tests using Psyco to remove much of the Python overhead.
For completeness, I'd also suggest both running the Java version using a JIT compiler such as Kaffe, and compiling it natively using gcj (the latter should approach the speed of C).
[Yaroslav Bulatov]
.... A better estimate of error difference would use random digits as opposed to [x]*10000000, but I don't know how I would calculate exact answer in any reasonable amount of time. (using Decimals takes over a second for 4000 array (with conversion))
Google on
floating point distillation
It's possible to compute the exact sum of a list of floats using float
arithmetic, where, ignoring overflow, the result is a minimal set of
fp numbers whose mathematical (infinite precision) sum exactly equals
the mathematical sum of the inputs.
In the worst case the best of these methods can require O(log N)
passes over the list of N inputs, but in real life they usually
converge to the result in two passes independent of N, sometimes with
a third pass but over a very short list of floats.
The betterknown "Kahan summation" technique is essentially a
partiallybroken computation of the two highestorder components of
one distillation pass.
A case like summing [1e100] + [1.0] * (N1) shows that lefttoright
can deliver arbitrarily bad results (the fp sum of that is 1e100
regardless of N).
On 2 Aug 2004, Yaroslav Bulatov wrote: You are right, how silly of me! Fixing the script now results in 130 millis mean, 8.42 millis standard deviation, which is slower than numarray (104, 2.6 respectively). I wonder why numarray gives faster results on such a simple task?
The reason is that your C loop uses array indexing, whereas numarray's
increments the array pointer on each iteration through the loop an forgoes
any indexing. Even though this translates into about same number of
opcodes on x86, the latter is slightly faster because it skips the
(implied) bitshift operation. Also don't forget to compile your C version
with O3, as this can make a big difference in that little loop (maybe not
in your case though).
In principle, for simple access patterns like summation, etc., numarray
will always be faster than your "typical" C implementation, since its core
vector functions were designed to be very efficient. Of course, an
efficient C implementation will be as fast or faster, but you'd have to
use those efficiency tricks in every vector operation you perform, which
could be quite tedious without writing your own library. (The above
generaliztion of course only applies to very large arrays.)
Either way, your results are quite promising  they show that Python with
numarray is as good or better than C when it comes to vector processing
applications, all while being easier to use. That's good publicity! This discussion thread is closed Replies have been disabled for this discussion. Similar topics
2 posts
views
Thread by SunMan 
last post: by

18 posts
views
Thread by bsder 
last post: by

1 post
views
Thread by Adi Lazar 
last post: by

5 posts
views
Thread by Tales Normando 
last post: by
 
14 posts
views
Thread by Peter Hallett 
last post: by

10 posts
views
Thread by Neville Lang 
last post: by

12 posts
views
Thread by neeraj 
last post: by

35 posts
views
Thread by Lee Crabtree 
last post: by
          