473,387 Members | 1,517 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

weird benchmarks with c arrays

Hi there to everyone!
I've got a problem which may be interesting to some of you, and I'd be very grateful if there is someone who can give me some advice (or maybe redirect me to some other place).

Very brief introduction.
Since I'm learning vectorization with SSE (I use gcc 4.1.3 on Ubuntu 7.10) I've built a simple test program in C for comparing running times. I wanted to see how much improvement could give vectorization to simple iterated operations. So I made also some tests with standard arrays, pointers, and such (with and without SSE), and I stumbled upon some strange facts I cannot naively explain.

The (apparently) weird thing.
Let me show you an example. One of the test is the multiplication of a random-filled array of doubles for a fixed scalar.
The array has length L and the multiplication is repeated R times. (The repetition is the outer loop, while the array operation is in the inner loop). For example, the code for the indexed iteration is:
Expand|Select|Wrap|Line Numbers
  1. for (r = 0; r < repeat; r++) for (i = 0; i < length; i++) v[i] *= m; 
Amongst the benchmark I've tried, there is one test in which I modify the values of R and L keeping the product R * L constant.
These are the execution times in seconds (on a AMD processor). The program has been compiled with optimization -O2 and SSE flags on.

operation - repeat - array length -- SSE / indexed loop / pointer loop
sclmul 5000000 32 -- 0.19 0.31 0.24
sclmul 2500000 64 -- 0.16 0.29 0.22
sclmul 1250000 128 -- 0.15 0.28 0.21
sclmul 500000 320 -- 0.14 0.27 0.20
sclmul 250000 640 -- 0.14 0.27 0.20
sclmul 125000 1280 -- 0.13 0.26 0.20
sclmul 50000 3200 -- 0.13 0.27 0.20
sclmul 25000 6400 -- 0.14 0.27 0.20
sclmul 12500 12800 -- 0.24 0.35 0.32
sclmul 5000 32000 -- 0.24 0.35 0.32
sclmul 2500 64000 -- 0.25 0.36 0.32
sclmul 1250 128000 -- 0.44 0.55 0.51
sclmul 1000 160000 -- 0.60 0.71 0.65
sclmul 800 200000 -- 0.72 0.84 0.78
sclmul 500 320000 -- 0.72 0.84 0.77
sclmul 250 640000 -- 0.72 0.84 0.78
sclmul 125 1280000 -- 0.72 0.84 0.78
sclmul 62 2560000 -- 0.73 0.86 0.79
sclmul 50 3200000 -- 0.72 0.84 0.78

The strange fact is that as the array length increases, the execution time worsens a lot. In some other benchmarks (with copy operations), it feels like there is a cutoff above which the execution goes almost ten times slower (from 0.40 to 3.26)!
The fact is not linked to optimization: it happens even with no optimization on. And it is not linked to a specific processor, since I've run it on three different machines and I found similar results (with different "cutoffs", but there is always an array length for which the phenomenon occurs).
After some tinkering, I've found that _adding_ an extra extern loop -- which in fact "segments" the array as if it was the union of smaller arrays -- the slowness vanishes.

Expand|Select|Wrap|Line Numbers
  1. for (k = 0; k < kmax; k++) {
  2. start = k*length/kmax;
  3. for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m; 
  4. }
(The piece of code above is awkward but it is actually up to ten time faster than a normal array sweep for very very large arrays -- kmax must be fine-tuned.)
Actually, I'm not able to explain it with a violation of locality of reference or other simple optimization techniques, since we are talking about a very simple operation which is simply iterated in a (very very) long array (remember that this happens even with optimization turned off).
I feel like I'm missing something fundamental.

Any idea?
Thank you for your time!

Luigi
Aug 20 '08 #1
6 1872
newb16
687 512MB
(6400-12800 )*sizeof(double) looks like L1 cache size, and next pit is like to be about L2 cache size.
Aug 20 '08 #2
weaknessforcats
9,208 Expert Mod 8TB
Put your array on the heap rather than the stack.

Also, I hope this code:
for (k = 0; k < kmax; k++) {
start = k*length/kmax;
for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m;
}
does not have k, kmax, r and repeat as doubles. Operators like < don't work as expected for floating point. That could cause loop problems.

Also, avoid calculations in the condition part of a loop since these will be redone on each cycle. Better to calculate once, put the result in a variable and use the variable in the loop.
Aug 20 '08 #3
newb16
687 512MB
So when you process array slice-by-slice it is loaded into cache only once per each k iteration. Google for 'cache trashing'
Aug 20 '08 #4
(6400-12800 )*sizeof(double) looks like L1 cache size, and next pit is like to be about L2 cache size.
Touché. Yes, that must be the answer, thank you.
Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)
Put your array on the heap rather than the stack.
I'm dynamically allocating arrays with malloc (actually, with a function which calls malloc, since I need to manually align the addresses to 16 bytes, to use them with SSE registers). So, I think that in my implementation (gcc) they are already on the heap, why do you guess they are on the stack?
Aug 20 '08 #5
weaknessforcats
9,208 Expert Mod 8TB
why do you guess they are on the stack?
You would be surprised the number o C programmers that can't use malloc. Then they create huge arrays on the stack, exceed stack memory limits and die with all sorts of strange errors.
Aug 20 '08 #6
newb16
687 512MB
Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)
1) CPUID command. or 2) run several attempts at startup with different array sizes measuring performance.
Aug 20 '08 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Dmitry Borisov | last post by:
Guys, Just get back to the shootout site and was redirected here: http://shootout.alioth.debian.org/index.php After I saw psyco having the same benchmarks as Python I tested that by myself and...
0
by: aaronwmail-usenet | last post by:
I've been wondering about benchmarks recently. What is a fair benchmark? How should benchmarks be vetted or judged? I decided to see what you folks thought, so for discussion I compared two...
82
by: nobody | last post by:
Howdy, Mike! mikecoxlinux@yahoo.com (Mike Cox) wrote in message news:<3d6111f1.0402271647.c20aea3@posting.google.com>... > I'm a C++ programmer, and have to use lisp because I want to use >...
3
by: Holger (David) Wagner | last post by:
Hi all, we're currently developing an application in the .NET environment that needs to access a DLL implemented in Fortran. There is one procedure with about 17 parameters, most of them arrays....
4
by: Sahil Malik [MVP] | last post by:
Okay so lets say I have a valuetype - lets say DateTime. Look at this code . List<DateTime> dt = new List<DateTime>() ; dt.Add(new dateTime(1999,12,1)) dt.AddDays(1) ; <--- This statement...
13
by: Shane Wright | last post by:
Hi, I've been trying to spec a new server for my company's database for a few weeks and one of the biggest problems I've had is trying to find meaningful performance information about how...
15
by: roberts.noah | last post by:
Are there any decent benchmarks of the difference between using c strings and std::string out there? Google isn't being friendly about it. Obviously this would be dependant on different...
4
by: Dibur | last post by:
I have been told to set the performance benchmarks for all the db2 databases hosted on all servers. I have no clue as to how to approach this problem. Can this be done theoritically through...
14
by: WStoreyII | last post by:
the following code is supposed to read a whole line upto a new line char from a file. however it does not work. it is producing weird results. please help. I had error checking in there for...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.