I have a C program I'm optimizing, and I would like to hear any idea about what is more important in here.
The program is very simple, I have a huge table of long integers (123MB aprox, on my PC that has Vista 32 bits with a Core 2 Duo the long integer is 32 bytes), who's format currently is:
Table = PART 0 PART 1 PART 2 PART 3 PART 4 PART 5 PART 6
Where PART i is a segment of the table.
These are the sizes of each part of the table, in bytes:
PART 1: 208
PART 2: 10816
PART 3: 276016
PART 4: 4597008
PART 5: 17565392
PART 6: 32352896
PART 7: 74718128
p is the pointer to the beggining of the table.
I later traverse the table like this:
Expand|Select|Wrap|Line Numbers
- clock_t clock_tTimer;
- long c1, c2, c3, c4, c5, c6, c7;
- long *p1, *p2, *p3, *p4, *p5, *p6;
- long i;
- long iSum=0;
- iSum=0;
- clock_tTimer=clock();
- for (i=0;i<NUMBER_OF_TESTS;i++) {
- for (c1 = 1; c1 < 47; c1++) {
- p1 = p+p[BASE_ADDRESS+c1];
- for (c2 = c1+1; c2 < 48; c2++) {
- p2 = p+p1[c2];
- for (c3 = c2+1; c3 < 49; c3++) {
- p3 = p+p2[c3];
- for (c4 = c3+1; c4 < 50; c4++) {
- p4 = p+p3[c4];
- for (c5 = c4+1; c5 < 51; c5++) {
- p5 = p+p4[c5];
- for (c6 = c5+1; c6 < 52; c6++) {
- p6 = p+p5[c6];
- for (c7 = c6+1; c7 < 53; c7++) {
- iSum+=p6[c7];
- }
- }
- }
- }
- }
- }
- }
- }
- clock_tTimer=clock()-clock_tTimer;
In the first one the states are generated in each part (the part is where I jump with cx, meaning that with c1 I jump to what I called part 1, with c2 to part 2, and so on) in the same order they are traversed later.
The work of the second version is still in progress and I can improve it further, but the idea is that it tries to minimize the distance between the jumps from one loop to the inner loop (for example the distance from p4 to p5.
For now the faster version is the first one (the one that has the states in each part ordered in the same way they are traversed), and the second version is trailing like for 20% of the speed that seems to be too much for me.
What do you think might be doing the first version to be faster?
Might this be related with a cache of the processor?
Do you think the second version with further organization to minimize the distances may beat the first version?
Do I have to understand something about the cache of the processor to better understand what to do?
Any idea on how to further optimize this?
TIA & Regards ...