Hi all,
I have a strange optimization problem. I have written a small program,
basically a matrix-vector multiplication at its core, that needs to
run as fast as possible.
The relevant code snippet is:
for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s;)
*e += (*r++) * (*s++);
where all variables are float pointers, 'r' is the matrix, 's' the
vector and 'e' the result vector. I call the operation on the last
line a MAC (multiply-accumulate), a common measure of performance on
DSPs. The 'r' matrix is very large (about 15 MB) and does not fit into
the cache.
The program does a thousand iterations, each consisting of some setup
and of the matrix mult. above, and prints out the speed.
On Linux, using an Intel Xeon 2.6 Ghz (512Kb cache) I get the
following result:
Done in 0.42 seconds (2398.55 iterations/sec) (487.00 Mmac/sec)
The above result was with the optimizing Intel compiler v9.0, which
auto-vectorize loops using SSE. The non-SSE version was only about 20%
slower.
On Windows, using my Athlon 2700+, I get this:
Done in 1.81 seconds (551.61 iterations/sec) (112.00 Mmac/sec)
I then learned that my non-professional copy of VisualC++ does not
optimize binaries (!), so I downloaded the Microsoft Visual C++
Toolkit 2003, which claims to have the same optimizing compiler
featured by the professional version of Microsoft Visual C++. The
result is even worse:
Done in 1.85 seconds (540.53 iterations/sec) (109.75 Mmac/sec)
The windows version was compiled with this command line:
cl /O2 test2.c
Adding flags for SSE instructions did not help.
Anyone has a clue of what I'm doing wrong? The numbers are very
repeatable. Using a smaller 'r' matrix pushed the speed on the Linux
xeon up to 1.5 GMac (!), while the windows version on the Athlon never
went over 250 Mmac.
Thanks for the answers :-)
Alfio