Hi Jerry,
thanks for your meaningful response.
Jerry Coffin wrote:
In article <g2**********@nserver.hrz.tu-freiberg.de>, wa***@chemie.uni-
halle.de says...
>How does this:
>>>>x = myList[i].value1 + myList[i].value2
look like in your eyes?
This looks like it requires operator[] to be valid for whatever sort of
things myList is -- but that doesn't tell us a thing. Overloading
operator[] for a linked list is trivial.
OK
>>[...] optimizing away the
difference between array-style and pointer-style access has been routine
among nearly all compilers for years -- all versions of "Visual Studio"
(and versions of Microsoft C++ and Microsoft C since long before they
started using the "Visual" brand) have known how to do this kind of
optimization.
Not entirely true. The code posted by me *will* show significant
differences for pointer vs. array formulation between gcc 3.x and
4.x *and* between VC++6 and VC++9. I tested it ;-)
I glanced through your code, and frankly I wasn't much impressed.
What the ... ;-)
First of all, what you timed was substantially different from what the OP
asked about. Second, you called the functions through a pointer
(repeatedly) and included the time to call the function via a pointer in
the overall time for the function. Third, you included the first run
through the data in the time for executing the code -- on the first run,
the data will frequently be in main memory, but on the second and
subsequent runs, it's likely to be in the cache, so the first run will
often be substantially slower than the second and subsequent runs.
All three points are somehow unfounded, imho.
1) what was timed was simply interleaved sequential access to
a block of data that is not too trivial regarding L2
access of the underlying architecture. To take out one
"dimension", as you did, also makes things too simple
for any optimizing compiler and wouldn't support my view ... ,
2) calling through pointer was done 2x per test scenario, which
has to be compared with the NxN loop calculation in the function,
that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
we assume the indirect call to take as much cycles as the inner-loop
calculation (which it obviously doesn't),
3) this does't really count because the data block initialization is
done immediately before the first test function invocation. BTW,
due to the "indirect call dispatcher design", this was easy to
add(and I added it - it didn't change any results, see enclosed source)
Here's a bit of code that gets rid of at least those specific problems:
#include <cstdlib>
[snipped]
...
struct listitem { int value1, value2; };
...
Why not to include some other data item in between?
struct listitem { int value1, dummy, value2; };
int test_array(listitem const *list, size_t nitems)
{
int x=0;
for(size_t i=0; i<nitems; ++i)
x += list[i].value1 + list[i].value2;
return x;
}
This is just a sequential block access which can
be translated to assembler using only one index
register, one counter, and one accumulator.
[snipped]
std::cout << "Array value: " << val1;
std::cout << "\nPointer value: " << val2;
std::cout << "\nArray Time: " << time1;
std::cout << "\nPointer Time : " << time2;
On any of my machines, your code (using your
prameters) runs just through within a few clock
cycles. But I see your point. On such simple
constructs, almost any compiler can deduce
the optimal access sequence and can easily
prevent repeated index offset multiplications.
Compiled with VC++ 6.0, using the default "Release" settings for a
console application, the results I get from a few runs of this are:
...
[snipped]
ohh, this is nice:
test_array:
... $L10257:
mov edi, DWORD PTR [ecx+esi*8+4]
add edi, DWORD PTR [ecx+esi*8]
add esi, 1
add eax, edi
cmp esi, edx
jb SHORT $L10246
...
and
test_pointer:
... $L10255:
mov esi, DWORD PTR [ecx+4]
add esi, DWORD PTR [ecx]
add ecx, 8
add eax, esi
cmp ecx, edx
jb SHORT $L10255
...
This makes my point absolutely clear, the
array access is *not* optimized away by
your compiler (aybe this is the asm of the
"debug" settings?). If the above is really
faster an a mychine, thats a problem of
the instruction decoding/reordering or
whatever.
I also did a quick test by stripping the source code down to the parts
we care about:
[simplified code snipped]
...
and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
Visual C++ 1.0). Here's the assembly language it produces for the two
inner loops:
OK
test_array:
$F344:
; Line 8
mov ax,WORD PTR [bx-2]
add ax,WORD PTR [bx]
add cx,ax
add bx,4
dec dx
jne $F344
test_pointer:
$F355:
; Line 17
mov ax,WORD PTR [si+2]
add ax,WORD PTR [si]
add dx,ax
add si,4
cmp WORD PTR [bp-4],si ;list
ja $F355
Yes, thats simply using the adjacent positions
of struct members:
mov ax, WORD PTR [bx-2]
add ax, WORD PTR [bx]
and advancing the pointer into the data block
by the struct size (16 bit ints?)
add bx,4
Thie compiler obviously did convert the
"array access" nto a "pointer access".
In this case, there may be a small difference between the two -- but
it's the array version that's faster. Even on a machine from that time
frame, the difference would be too small to care much about though --
given the memory constraints of MS-DOS, the time difference for the
largest array you could create would still have been less than a
millisecond.
Lets see what the new gcc 4.3.1 thinks about the inner loop
if full optimization is set (g++-4.3 -O3 -dA -c stuff.cxx -S):
[array]
..L6:
# basic block 3
addl 4(%ecx,%edx,8), %eax # displacement(base, index, scale)
addl (%ecx,%edx,8), %eax
addl $1, %edx
cmpl %edx, %ebx
jg .L6
[pointer]
..L14:
# basic block 3
addl 4(%edx), %eax
addl (%edx), %eax
addl $8, %edx
cmpl %ecx, %edx
jb .L14
What do we see here? The "array-loop" uses a different
addressing scheme (indirect register w/base and displacement)
compared to the "pointer-loop". This proves my point again,
the pointer loop will be optimized into simpler and
shorter machine instructions. Due to modern processor
architectures and instruction translation/reordering,
one can't deduce from the above code which is faster -
but this might *possibly* pop up on some architectures.
I modified my code and did rerun systematical comparisons
on some boxes:
--- 8< -----------------------------------------------------------
[C2Q-9300/3.33GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.1 sec pointer: 0.91 sec arr/ptr: 120.879%
[C2Q-9300/3.33GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.09 sec arr/ptr: 100.917%
[C2Q-6600/3.15GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.09 sec pointer: 0.91 sec arr/ptr: 119.78%
[C2Q-6600/3.15GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.11 sec arr/ptr: 99.0991%
[P4/2.6GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.61 sec pointer: 1.74 sec arr/ptr: 92.5287%
[P4/2.6GHz icc 10.1 -O3 -xN]
array: 1.97 sec pointer: 1.97 sec arr/ptr: 100%
[Athlon64/2.3GHz VC++6 /O2 /Zp4]
array: 1.875 sec pointer: 1.922 sec arr/ptr: 97.5546%
[Athlon64/2.3GHz VC++9 /O2 /Oy /Ot /Zp4]
array: 2.376 sec pointer: 1.89 sec arr/ptr: 125.714%
[Athlon64/2.3GHz gcc-mingw-3.4.2 -O3 -march=pentiumpro]
array: 3.656 sec pointer: 3.297 sec arr/ptr: 110.889%
--- 8< -----------------------------------------------------------
(using the code in the appendix). The *same* program shows
different behaviour eg. on P4 vs. Core2Q, obviously
due to the different chip-internals ...
Regards
Mirco
code, modified: ==>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>
struct listitem { int value1, dummy, value2; };
double timethis(const char*s, int(*p)(listitem*,int), listitem *list, int n)
{
std::cout << s << ' ' << std::flush; // print some process message ...
time_t t=clock(); // the indirect call does not contribute to run-time,
int x = p(list, n); // it's just a dispatcher (and prevents inlining)
return double(clock()-t + x)/CLOCKS_PER_SEC; // fool optimizer! (x == zero)
}
int test_array(listitem *list, int nitems)
{
int x=0;
for(int j=0; j<nitems; ++j)
for(int i=0; i<nitems; ++i)
x += (list[i].value1-list[j].value1) + (list[i].value2-list[j].value2);
return x;
}
int test_pointer(listitem *list, int nitems)
{
int x=0;
for(listitem *pj=list; pj<list+nitems; ++pj)
for(listitem *pi=list; pi<list+nitems; ++pi)
x += (pi->value1-pj->value1) + (pi->value2-pj->value2);
return x;
}
int main(int argc, char*argv[])
{
int nitems = argv[1] ? atoi(argv[1]) : 22222;
listitem *list = new listitem[nitems];
memset(list, 0x1A, sizeof(listitem)*nitems);
timethis("warmup", test_array, list, nitems);
double ta=0, tp=0;
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
std::cout << "length: " << nitems << std::endl
<< "array: " << ta << " sec" << '\t'
<< "pointer: " << tp << " sec" << '\t'
<< "arr/ptr: " << (ta/tp)*100. << "%" << std::endl;
delete [] list;
return 0;
}
<==