weird benchmarks with c arrays

Hi there to everyone!
I've got a problem which may be interesting to some of you, and I'd be very grateful if there is someone who can give me some advice (or maybe redirect me to some other place).

Very brief introduction.
Since I'm learning vectorization with SSE (I use gcc 4.1.3 on Ubuntu 7.10) I've built a simple test program in C for comparing running times. I wanted to see how much improvement could give vectorization to simple iterated operations. So I made also some tests with standard arrays, pointers, and such (with and without SSE), and I stumbled upon some strange facts I cannot naively explain.

The (apparently) weird thing.
Let me show you an example. One of the test is the multiplication of a random-filled array of doubles for a fixed scalar.
The array has length L and the multiplication is repeated R times. (The repetition is the outer loop, while the array operation is in the inner loop). For example, the code for the indexed iteration is:

Expand|Select|Wrap|Line Numbers

 for (r = 0; r < repeat; r++) for (i = 0; i < length; i++) v[i] *= m; 
 

Amongst the benchmark I've tried, there is one test in which I modify the values of R and L keeping the product R * L constant.
These are the execution times in seconds (on a AMD processor). The program has been compiled with optimization -O2 and SSE flags on.

operation - repeat - array length -- SSE / indexed loop / pointer loop
sclmul 5000000 32 -- 0.19 0.31 0.24
sclmul 2500000 64 -- 0.16 0.29 0.22
sclmul 1250000 128 -- 0.15 0.28 0.21
sclmul 500000 320 -- 0.14 0.27 0.20
sclmul 250000 640 -- 0.14 0.27 0.20
sclmul 125000 1280 -- 0.13 0.26 0.20
sclmul 50000 3200 -- 0.13 0.27 0.20
sclmul 25000 6400 -- 0.14 0.27 0.20
sclmul 12500 12800 -- 0.24 0.35 0.32
sclmul 5000 32000 -- 0.24 0.35 0.32
sclmul 2500 64000 -- 0.25 0.36 0.32
sclmul 1250 128000 -- 0.44 0.55 0.51
sclmul 1000 160000 -- 0.60 0.71 0.65
sclmul 800 200000 -- 0.72 0.84 0.78
sclmul 500 320000 -- 0.72 0.84 0.77
sclmul 250 640000 -- 0.72 0.84 0.78
sclmul 125 1280000 -- 0.72 0.84 0.78
sclmul 62 2560000 -- 0.73 0.86 0.79
sclmul 50 3200000 -- 0.72 0.84 0.78

The strange fact is that as the array length increases, the execution time worsens a lot. In some other benchmarks (with copy operations), it feels like there is a cutoff above which the execution goes almost ten times slower (from 0.40 to 3.26)!
The fact is not linked to optimization: it happens even with no optimization on. And it is not linked to a specific processor, since I've run it on three different machines and I found similar results (with different "cutoffs", but there is always an array length for which the phenomenon occurs).
After some tinkering, I've found that _adding_ an extra extern loop -- which in fact "segments" the array as if it was the union of smaller arrays -- the slowness vanishes.

Expand|Select|Wrap|Line Numbers

 for (k = 0; k < kmax; k++) {

start = k*length/kmax;

for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m; 

}

(The piece of code above is awkward but it is actually up to ten time faster than a normal array sweep for very very large arrays -- kmax must be fine-tuned.)
Actually, I'm not able to explain it with a violation of locality of reference or other simple optimization techniques, since we are talking about a very simple operation which is simply iterated in a (very very) long array (remember that this happens even with optimization turned off).
I feel like I'm missing something fundamental.

Any idea?
Thank you for your time!

Luigi

Aug 20 '08 #1

Subscribe Post Reply

1872

newb16

687

512MB

(6400-12800 )*sizeof(double) looks like L1 cache size, and next pit is like to be about L2 cache size.

Aug 20 '08 #2

weaknessforcats

9,208

Expert Mod 8TB

Put your array on the heap rather than the stack.

Also, I hope this code:

for (k = 0; k < kmax; k++) {
start = k*length/kmax;
for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m;
}

does not have k, kmax, r and repeat as doubles. Operators like < don't work as expected for floating point. That could cause loop problems.

Also, avoid calculations in the condition part of a loop since these will be redone on each cycle. Better to calculate once, put the result in a variable and use the variable in the loop.

Aug 20 '08 #3

newb16

687

512MB

So when you process array slice-by-slice it is loaded into cache only once per each k iteration. Google for 'cache trashing'

Aug 20 '08 #4

Emiurgo

(6400-12800 )*sizeof(double) looks like L1 cache size, and next pit is like to be about L2 cache size.

Touché. Yes, that must be the answer, thank you.
Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)

Put your array on the heap rather than the stack.

I'm dynamically allocating arrays with malloc (actually, with a function which calls malloc, since I need to manually align the addresses to 16 bytes, to use them with SSE registers). So, I think that in my implementation (gcc) they are already on the heap, why do you guess they are on the stack?

Aug 20 '08 #5

weaknessforcats

9,208

Expert Mod 8TB

why do you guess they are on the stack?

You would be surprised the number o C programmers that can't use malloc. Then they create huge arrays on the stack, exceed stack memory limits and die with all sorts of strange errors.

Aug 20 '08 #6

newb16

687

512MB

Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)

1) CPUID command. or 2) run several attempts at startup with different array sizes measuring performance.

Aug 20 '08 #7

Similar topics

Some weird benchmarking numbers

by: Dmitry Borisov | last post by:

Guys, Just get back to the shootout site and was redirected here: http://shootout.alioth.debian.org/index.php After I saw psyco having the same benchmarks as Python I tested that by myself and...

Python

On benchmarks, heaps, priority queues

by: aaronwmail-usenet | last post by:

I've been wondering about benchmarks recently. What is a fair benchmark? How should benchmarks be vetted or judged? I decided to see what you folks thought, so for discussion I compared two...

Python

off-topic: Why is lisp so weird?

by: nobody | last post by:

Howdy, Mike! mikecoxlinux@yahoo.com (Mike Cox) wrote in message news:<3d6111f1.0402271647.c20aea3@posting.google.com>... > I'm a C++ programmer, and have to use lisp because I want to use >...

C / C++

P/Invoke to Fortran DLL - weird parameters...

by: Holger (David) Wagner | last post by:

Hi all, we're currently developing an application in the .NET environment that needs to access a DLL implemented in Fortran. There is one procedure with about 17 parameters, most of them arrays....

C# / C Sharp

Weird Valuetype behavior

by: Sahil Malik [MVP] | last post by:

Okay so lets say I have a valuetype - lets say DateTime. Look at this code . List<DateTime> dt = new List<DateTime>() ; dt.Add(new dateTime(1999,12,1)) dt.AddDays(1) ; <--- This statement...

C# / C Sharp

disk performance benchmarks

by: Shane Wright | last post by:

Hi, I've been trying to spec a new server for my company's database for a few weeks and one of the biggest problems I've had is trying to find meaningful performance information about how...

PostgreSQL Database

benchmarks of char* ops vs. std::string

by: roberts.noah | last post by:

Are there any decent benchmarks of the difference between using c strings and std::string out there? Google isn't being friendly about it. Obviously this would be dependant on different...

C / C++

Setting Performance Benchmarks

by: Dibur | last post by:

I have been told to set the performance benchmarks for all the db2 databases hosted on all servers. I have no clue as to how to approach this problem. Can this be done theoritically through...

DB2 Database

read line function outputs weird results

by: WStoreyII | last post by:

the following code is supposed to read a whole line upto a new line char from a file. however it does not work. it is producing weird results. please help. I had error checking in there for...

C / C++

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++