I've got a problem which may be interesting to some of you, and I'd be very grateful if there is someone who can give me some advice (or maybe redirect me to some other place).
Very brief introduction.
Since I'm learning vectorization with SSE (I use gcc 4.1.3 on Ubuntu 7.10) I've built a simple test program in C for comparing running times. I wanted to see how much improvement could give vectorization to simple iterated operations. So I made also some tests with standard arrays, pointers, and such (with and without SSE), and I stumbled upon some strange facts I cannot naively explain.
The (apparently) weird thing.
Let me show you an example. One of the test is the multiplication of a random-filled array of doubles for a fixed scalar.
The array has length L and the multiplication is repeated R times. (The repetition is the outer loop, while the array operation is in the inner loop). For example, the code for the indexed iteration is:
Expand|Select|Wrap|Line Numbers
- for (r = 0; r < repeat; r++) for (i = 0; i < length; i++) v[i] *= m;
These are the execution times in seconds (on a AMD processor). The program has been compiled with optimization -O2 and SSE flags on.
operation - repeat - array length -- SSE / indexed loop / pointer loop
sclmul 5000000 32 -- 0.19 0.31 0.24
sclmul 2500000 64 -- 0.16 0.29 0.22
sclmul 1250000 128 -- 0.15 0.28 0.21
sclmul 500000 320 -- 0.14 0.27 0.20
sclmul 250000 640 -- 0.14 0.27 0.20
sclmul 125000 1280 -- 0.13 0.26 0.20
sclmul 50000 3200 -- 0.13 0.27 0.20
sclmul 25000 6400 -- 0.14 0.27 0.20
sclmul 12500 12800 -- 0.24 0.35 0.32
sclmul 5000 32000 -- 0.24 0.35 0.32
sclmul 2500 64000 -- 0.25 0.36 0.32
sclmul 1250 128000 -- 0.44 0.55 0.51
sclmul 1000 160000 -- 0.60 0.71 0.65
sclmul 800 200000 -- 0.72 0.84 0.78
sclmul 500 320000 -- 0.72 0.84 0.77
sclmul 250 640000 -- 0.72 0.84 0.78
sclmul 125 1280000 -- 0.72 0.84 0.78
sclmul 62 2560000 -- 0.73 0.86 0.79
sclmul 50 3200000 -- 0.72 0.84 0.78
The strange fact is that as the array length increases, the execution time worsens a lot. In some other benchmarks (with copy operations), it feels like there is a cutoff above which the execution goes almost ten times slower (from 0.40 to 3.26)!
The fact is not linked to optimization: it happens even with no optimization on. And it is not linked to a specific processor, since I've run it on three different machines and I found similar results (with different "cutoffs", but there is always an array length for which the phenomenon occurs).
After some tinkering, I've found that _adding_ an extra extern loop -- which in fact "segments" the array as if it was the union of smaller arrays -- the slowness vanishes.
Expand|Select|Wrap|Line Numbers
- for (k = 0; k < kmax; k++) {
- start = k*length/kmax;
- for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m;
- }
Actually, I'm not able to explain it with a violation of locality of reference or other simple optimization techniques, since we are talking about a very simple operation which is simply iterated in a (very very) long array (remember that this happens even with optimization turned off).
I feel like I'm missing something fundamental.
Any idea?
Thank you for your time!
Luigi