473,738 Members | 7,110 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Microbenchmark: Summing over array of doubles

I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

gcc -O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631

http://yaroslav.hopto.org/russianwik...-numeric-speed

numarray takes over Numeric at about 1100 floats

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.

Yaroslav
Jul 18 '05 #1
9 2572
bu*****@engr.or st.edu (Yaroslav Bulatov) wrote in
news:4d******** *************** ***@posting.goo gle.com:
I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):


I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.

Also, without running it I don't see how most of your Python examples work
since you seem to call the timing function with the result of the
calculation which doesn't make sense.

Jul 18 '05 #2
On 1 Aug 2004, Duncan Booth wrote:
I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.


Lousy or not, I believe that's how numarray is implemented internally, so
at least all the benchmarks are the same. If you want accuracy summing
that many numbers, you're going to have to do it in software (likely by
summing each mantissa bit individually and reconstructing the float
afterward), so it will be abysmally slow (obviously not what the OP
wants).

Jul 18 '05 #3
On 31 Jul 2004, Yaroslav Bulatov wrote:
I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.


If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results
(mostly due to Python overhead).

Regarding your results:

numarray should be just as fast as the -O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:

call clock
movl %eax, %esi
movl $9999999, %ebx
..L11:
decl %ebx
jns .L11
subl $16, %esp
call clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n", sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
numarray uses.

Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random _array? It shouldn't
affect your results, but it would speed up initialization time a bit.

There are a few inefficiences in the pytime module (mostly involving
range() and *args/**kwargs), but I don't think they'll have too big of an
impact on your results. Instead, I'd suggest running the numarray/Numeric
tests using Psyco to remove much of the Python overhead.

For completeness, I'd also suggest both running the Java version using a
JIT compiler such as Kaffe, and compiling it natively using gcj (the
latter should approach the speed of C).

Jul 18 '05 #4
Christopher T King <sq******@WPI.E DU> wrote in
news:Pi******** *************** *************** @ccc4.wpi.edu:
On 1 Aug 2004, Duncan Booth wrote:
I just had a brief look at the code you posted. Are you not concerned
about accuracy in any of your calculations? Summing a 10 million
element array by simply accumulating each element into a running
total is guaranteed to give a lousy result.


Lousy or not, I believe that's how numarray is implemented internally,
so at least all the benchmarks are the same. If you want accuracy
summing that many numbers, you're going to have to do it in software
(likely by summing each mantissa bit individually and reconstructing
the float afterward), so it will be abysmally slow (obviously not what
the OP wants).


My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):

return 0
The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.

If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel.
Jul 18 '05 #5
bu*****@engr.or st.edu (Yaroslav Bulatov) wrote in message news:<4d******* *************** ****@posting.go ogle.com>...
I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

gcc -O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631

http://yaroslav.hopto.org/russianwik...-numeric-speed

numarray takes over Numeric at about 1100 floats

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.


What hardware are you using?

On a 2.8 GHz Intel Pentium 4, a Fortran 95 code compiled with Compaq
Visual Fortran 6.6C using the -optimize:5 option takes 40
milliseconds. Probably you could improve the speed of the C program by
compiling it with the Intel C compiler. The Fortran program speed is
constrained by array accesses, not the additions (which are done in
hardware). A program to compute sum(xx**4) takes the same time as one
to compute sum(xx).

program xsum_double
! sum 1e7 doubles
implicit none
integer, parameter :: n = 10000000
real(kind=8) :: xx(n)
real :: t1,t2,xsum
call random_seed()
call random_number(x x)
call cpu_time(t1)
xsum = sum(xx)
call cpu_time(t2)
print*,1000*(t2-t1),xsum
end program xsum_double
Jul 18 '05 #6
Duncan Booth <me@privacy.net > wrote in message news:<Xn******* *************** ****@127.0.0.1> ...
Christopher T King <sq******@WPI.E DU> wrote in
news:Pi******** *************** *************** @ccc4.wpi.edu:
On 1 Aug 2004, Duncan Booth wrote:
I just had a brief look at the code you posted. Are you not concerned
about accuracy in any of your calculations? Summing a 10 million
element array by simply accumulating each element into a running
total is guaranteed to give a lousy result.


Lousy or not, I believe that's how numarray is implemented internally,
so at least all the benchmarks are the same. If you want accuracy
summing that many numbers, you're going to have to do it in software
(likely by summing each mantissa bit individually and reconstructing
the float afterward), so it will be abysmally slow (obviously not what
the OP wants).


My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):

return 0
The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.

If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel.


You are right, naive summing generates significant accuracy losses.
I estimated the error by summing [0.1234567890123 45]*1000000 and
comparing it to
1234567.8901234 5. All methods have error about 1e-4 .

The method below sums the array at the same speed as regular Python
sum loop, but reports error < 1e-15 .

def sum2(arr):
size = len(arr)
for itr in range(int(ceil( log(size)/log(2)))):
for i in xrange(0, size, 2**(itr+1)):
next_i = i+2**itr
if next_i<size:
arr[i]+=arr[next_i]
return arr[0]
When arbitrary precision numbers are being used, this method has
performance O(nd) vs. regular summing O(n(d+log d)) where n is size of
array, d is number of digits of the elements. In practice I got about
20% performance increase when summing an array of 1000 Decimals using
method above.

A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))

Yaroslav
Jul 18 '05 #7
Christopher T King <sq******@WPI.E DU> wrote in message news:<Pi******* *************** *************** *@ccc4.wpi.edu> ...
On 31 Jul 2004, Yaroslav Bulatov wrote:
I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.
If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results
(mostly due to Python overhead).

Regarding your results:

numarray should be just as fast as the -O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:

call clock
movl %eax, %esi
movl $9999999, %ebx
.L11:
decl %ebx
jns .L11
subl $16, %esp
call clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n", sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
numarray uses.


You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?

Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random _array? It shouldn't
affect your results, but it would speed up initialization time a bit.
There isn't a good reason, I simply didn't know about
numarray.random _array

There are a few inefficiences in the pytime module (mostly involving
range() and *args/**kwargs), but I don't think they'll have too big of an
impact on your results. Instead, I'd suggest running the numarray/Numeric
tests using Psyco to remove much of the Python overhead.

For completeness, I'd also suggest both running the Java version using a
JIT compiler such as Kaffe, and compiling it natively using gcj (the
latter should approach the speed of C).

Jul 18 '05 #8
[Yaroslav Bulatov]
....
A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))


Google on

floating point distillation

It's possible to compute the exact sum of a list of floats using float
arithmetic, where, ignoring overflow, the result is a minimal set of
fp numbers whose mathematical (infinite precision) sum exactly equals
the mathematical sum of the inputs.

In the worst case the best of these methods can require O(log N)
passes over the list of N inputs, but in real life they usually
converge to the result in two passes independent of N, sometimes with
a third pass but over a very short list of floats.

The better-known "Kahan summation" technique is essentially a
partially-broken computation of the two highest-order components of
one distillation pass.

A case like summing [1e100] + [1.0] * (N-1) shows that left-to-right
can deliver arbitrarily bad results (the fp sum of that is 1e100
regardless of N).
Jul 18 '05 #9
On 2 Aug 2004, Yaroslav Bulatov wrote:
You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?


The reason is that your C loop uses array indexing, whereas numarray's
increments the array pointer on each iteration through the loop an forgoes
any indexing. Even though this translates into about same number of
opcodes on x86, the latter is slightly faster because it skips the
(implied) bitshift operation. Also don't forget to compile your C version
with -O3, as this can make a big difference in that little loop (maybe not
in your case though).

In principle, for simple access patterns like summation, etc., numarray
will always be faster than your "typical" C implementation, since its core
vector functions were designed to be very efficient. Of course, an
efficient C implementation will be as fast or faster, but you'd have to
use those efficiency tricks in every vector operation you perform, which
could be quite tedious without writing your own library. (The above
generaliztion of course only applies to very large arrays.)

Either way, your results are quite promising -- they show that Python with
numarray is as good or better than C when it comes to vector processing
applications, all while being easier to use. That's good publicity!

Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2246
by: SunMan | last post by:
Hello! I am trying to create a program that will ask for a user to enter a series of letters (codes) and then print out a table that shows the codes in decending frequency. Only letters will be read into the array. I have created a struct and array like below. struct record { char inputChar;
18
2480
by: bsder | last post by:
Hi, Can anyone please tell me how to calculate the size of the following 4-dimensional array, and now to use qsort for sorting on this array? double sp = { 4.0, 5.0, 6.0 }; double spa = { { 4.0, 2.0 }, { 5.0, 8.0 }, { 6.0, 6.0 },
1
3538
by: Adi Lazar | last post by:
Hi, I'm using an ActiveX control that has a read/write property named InsertionPoint (type = object). In the ActiveX documentation : Variant (three-element array of doubles); I can write to it : double insertPoint = new double{2, 5, 1}; myObj.InsertionPoint = insertPoint; => works fine !!!
5
4501
by: Tales Normando | last post by:
The title says it all. Anyone?
2
2458
by: MrL8Knight | last post by:
I am building a simple shopping cart and I am having problems trying to add the costs of the items to generate a total cost. I am new at this so forgive me if my technical verbiage isn’t the greatest! I am trying to sum specific values in my arrays (or the price portions of the arrays). I have tried every imaginable way with the array_sum command and nothing has worked for me! Here is what my cookie information looks like after I put 2...
14
3130
by: Peter Hallett | last post by:
I would like to set up a string array as a class member, or field, and then populate this array by reading in from a text file, but I cannot find the appropriate syntax. The getter and setter are very unhappy with the idea and the compiler refuses to play ball with any of the attempts I have made to build such a structure. There is no problem when using a single string but a two dimensional array of strings appears to be a very different...
10
2156
by: Neville Lang | last post by:
Hi all, Here is a problem I first struck a while back in the Compact Framework. I am now using the full Framework (v2.0) and wanted to know whether there is anyway around this issue without building separate shim code in another DLL. Here is the layout of the external C DLL function: int thisFunction( double a, int b, double* c)
12
4062
by: neeraj | last post by:
Hi Can any body give me the syntax for summing the elements of the an array , without looping thanks
35
5893
by: Lee Crabtree | last post by:
This seems inconsistent and more than a little bizarre. Array.Clear sets all elements of the array to their default values (0, null, whatever), whereas List<>.Clear removes all items from the list. That part makes a reasonable amount of sense, as you can't actually take items away from an Array. However, there doesn't seem to be a way to perform the same operation in one fell swoop on a List<>. For example:
0
9335
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9263
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6751
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6053
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4570
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4825
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3279
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2745
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2193
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.