By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,985 Members | 1,777 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,985 IT Pros & Developers. It's quick & easy.

Curious about loop optimization C++ - assembly

P: n/a
Hi,

Out of curiousity, I sometimes look at the produced assembly after
compilation in release mode.

What you often see, is that CPP, always fully addresses registers to copy
values from a to b...

While stosb,stosw, stosd etc and the same for movs[x] are one statement, and
internally use registers ESI and EDI (source, destination) to copy data.

This seems (imho) more efficient, however, CPP never uses this construct...
it always uses a lot more instructions.

imagine this loop (I simplified the idea, of course, memcpy would be
normally used)
DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];
could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0
Q: is the mentioned construct, simply not so efficient or is there a reason
the C++ compiler team decided not to try to optimize to this level?

Mar 20 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Egbert Nierop (MVP for IIS) wrote:
Hi,

Out of curiousity, I sometimes look at the produced assembly after
compilation in release mode.

What you often see, is that CPP, always fully addresses registers to
copy values from a to b...

While stosb,stosw, stosd etc and the same for movs[x] are one
statement, and internally use registers ESI and EDI (source,
destination) to copy data.
This seems (imho) more efficient, however, CPP never uses this
construct... it always uses a lot more instructions.

imagine this loop (I simplified the idea, of course, memcpy would be
normally used)
DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];
could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0
Q: is the mentioned construct, simply not so efficient or is there a
reason the C++ compiler team decided not to try to optimize to this
level?


The LOOP and MOVS instructions are horribly slow on modern CPUs because they
don't make effective use of the deep pipeline in the CPU. The longer
instruction sequence actually executes many times faster.

IIRC, VC++ did generate LOOP/MOVS years ago (VC1-4 maybe?), but has gone
away from using those constructs since maybe the Pentium.

-cd
Mar 20 '06 #2

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:el**************@TK2MSFTNGP11.phx.gbl...
Egbert Nierop (MVP for IIS) wrote:
Hi,

DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];
could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0
Q: is the mentioned construct, simply not so efficient or is there a
reason the C++ compiler team decided not to try to optimize to this
level?


The LOOP and MOVS instructions are horribly slow on modern CPUs because
they don't make effective use of the deep pipeline in the CPU. The longer
instruction sequence actually executes many times faster.


Interesting!

This seems to prove the remark of some C++ / ASM programmer somewhere on the
web. He stated that he could not optimize code anymore better than C++ did.
Normally, I tend to think 'ok, so he was not up to the task' but I seemed
wrong (again :-) ) .
Mar 20 '06 #3

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:el**************@TK2MSFTNGP11.phx.gbl...
Egbert Nierop (MVP for IIS) wrote:
Hi,


Nope,
I've beaten the C++ optimization by 25% (by testing 100MB !) but this might
be true on a ATHLON 64, not for other CPUS possibly...

Anyway, you were right, that one cannot state, the less ASM instructions,
the faster!
ps: Function below is not meant to 'decode' be for real (it skips unicode
coding). Just for fun...

void __stdcall AnsiToBstr(PCSTR ansi, BSTR bstr, int writtenLen)
{
//#ifdef _M_IX86
DWORD ticks = GetTickCount();
__asm XOR AH, AH // just to clear the high part of our unicode char (= 2
bytes)
__asm MOV ECX, writtenLen // initialize our loop
__asm DEC ECX // our loop counter
__asm MOV EDI, [bstr] // destination index
__asm MOV ESI, [ansi] // source index
__asm labell:
__asm MOV AL, BYTE PTR [ESI] // copy a string byte
__asm MOV [EDI], AX
__asm INC EDI
__asm INC EDI
__asm INC ESI
__asm DEC ECX
__asm JNZ labell

//#else
wprintf(L"%d\n", GetTickCount() - ticks);
ticks = GetTickCount();

for (int loopit =
writtenLen - 1;
loopit != 0;
loopit--, bstr++, ansi++)
bstr[0] = ansi[0];
wprintf(L"%d\n", GetTickCount() - ticks);

//#endif
}
Mar 20 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.