optimizers are overrated

copx

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:
/* GCC 4.3.0 on x86/Windows, -O2 */

/* foo */
L7:
subl $12, %esp
pushl $97
call _putchar

incl %ebx
addl $16, %esp
cmpl $100, %ebx
jne L7
/* bar */
L2:
subl $12, %esp
pushl $97
call _putchar
addl $16, %esp

decl %ebx
jne L2
Comment: See, even the most recent version of the probably most widely used
compiler can not correctly optimize a most simple loop! At least GCC
understood the bar loop, so my "write C like ASM" optimization worked.

At this point you might wonder what horrible things an average C compiler
will do when GCC already fails so badly. Here is the gruesome result:
/* lccwin32, optimize on */

/* foo */
_$4:
pushl $97
call _putchar
popl %ecx

incl %edi
cmpl $100,%edi
jl _$4
/* bar */
_$10:
pushl $97
call _putchar
popl %ecx

movl %edi,%eax
decl %eax
movl %eax,%edi
or %eax,%eax
jne _$10
Comment: lcc is unable to optimize the loop just like GCC, but it adds
insults to injury by actually generating worse code for the ASM-style loop!
So you cannot even optimize the loop yourself!
/* MS Visual C++ 6 /O2 */

For this compiler I had to replace the putchar call with a call to a custom
my_putchar function otherwise the compiler replaces the putchar calls with
direct OS API stuff. While this is a good optimization it is not the
subject of this test, and only makes the resulting asm harder to read, so I
supressed that.
/* foo */

jmp SHORT $L833
$L834:
mov eax, DWORD PTR _i$[ebp]
add eax, 1
mov DWORD PTR _i$[ebp], eax
$L833:
cmp DWORD PTR _i$[ebp], 100
jge SHORT $L835

push 97
call _my_putchar
add esp, 4

jmp SHORT $L834
$L835:
/* bar */

$L840:
push 97
call _my_putchar
add esp, 4

mov eax, DWORD PTR _i$[ebp]
sub eax, 1
mov DWORD PTR _i$[ebp], eax
cmp DWORD PTR _i$[ebp], 0
jne SHORT $L840
Comment: Amazingly enough, this compiler has found yet another way to screw
up. Would you have thought that each compiler generates different code for
such a simple construct?
I hope you agree that the compiler of the beast deserves the award "Worst of
Show" for this mess. Are MS compilers still this bad?

Jun 27 '08 #1

Subscribe Post Reply

2346

Ian Collins

copx wrote:

/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

You didn't say what you expected to see. Did you optimise for speed or
space, or use a default?

The first compiler I tried partly unrolled both loops generating near
identical code for both, which is what I'd expect for a default
optimisation.

The only difference was the starting condition and test.

--
Ian Collins.

Jun 27 '08 #2

copx

"Ian Collins" <ia******@hotmail.comschrieb im Newsbeitrag
news:67*************@mid.individual.net...

copx wrote:
>/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

You didn't say what you expected to see.

I wasn't sure what to expect, that's why I tested it.

Did you optimise for speed or
space, or use a default?

Read the post again, I specified the used compiler flags (e.g. -O2 for GCC).
When there was a choice (lcc's UI doesn't offer one) I chose "optimize for
speed".

The first compiler I tried partly unrolled both loops generating near
identical code for both, which is what I'd expect for a default
optimisation.

And which compiler was that? Just curious.

The only difference was the starting condition and test.

The test is the whole point of optimizing this loop on x86. You do not need
a test (cmp instruction) in the loop if you decrement towards zero instead
of incrementing towards 100. This saves one instruction in the body of loop.
The other obvious optimizations are using a register to hold the counter,
and skipping the first check because it is known at compile time that the
loop will always execute at least once. Not one compiler managed to do all
that when feed with the "the optimizer will understand" version (foo).

Loop unrolling is a trickier optimization. You sacrifice code size for
speed. Or lets say the hope for speed, because the increased code size might
cause the code to end up being slower in the end.

Jun 27 '08 #3

Ian Collins

copx wrote:

"Ian Collins" <ia******@hotmail.comschrieb im Newsbeitrag
news:67*************@mid.individual.net...
>copx wrote:
>>/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

You didn't say what you expected to see.

I wasn't sure what to expect, that's why I tested it.

>Did you optimise for speed or
space, or use a default?

Read the post again, I specified the used compiler flags (e.g. -O2 for GCC).
When there was a choice (lcc's UI doesn't offer one) I chose "optimize for
speed".

Now everyone knows what a specific compiler's flags do.

>The first compiler I tried partly unrolled both loops generating near
identical code for both, which is what I'd expect for a default
optimisation.

And which compiler was that? Just curious.

Sun c99.

>The only difference was the starting condition and test.

The test is the whole point of optimizing this loop on x86. You do not need
a test (cmp instruction) in the loop if you decrement towards zero instead
of incrementing towards 100. This saves one instruction in the body of loop.

You still have to test for 0, which may or may not be faster.

The other obvious optimizations are using a register to hold the counter,
and skipping the first check because it is known at compile time that the
loop will always execute at least once. Not one compiler managed to do all
that when feed with the "the optimizer will understand" version (foo).

c99 appears to, generated

movl $100,%ebx ;/ line : 18
leaq __iob+128(%rip),%r12 ;/ line : 18
.align 16
..CG6.21:
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
addl $-5,%ebx ;/ line : 19
..LU7.69:
testl %ebx,%ebx ;/ line : 19
jne .CG6.21 ;/ line : 19

--
Ian Collins.

Jun 27 '08 #4

robertwessel2

On Apr 22, 6:38*pm, "copx" <c...@gazeta.plwrote:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
*int i;

*for (i = 0; i < 100; i++) putchar('a');

}

void bar(void)
{
*int i = 100;

*do {
* putchar('a');
*} while (--i);

}

...(GCC and LccWin stuff snipped)...

>
/* MS Visual C++ 6 /O2 */

For this compiler I had to replace the putchar call with a call to a custom
my_putchar function otherwise the compiler replaces the putchar calls with
direct OS API stuff. While this is a good *optimization it is not the
subject of this test, and only makes the resulting asm harder to read, so I
supressed that.

/* foo */

* * * * jmp * * SHORT $L833
$L834:
* * * * mov * * eax, DWORD PTR _i$[ebp]
* * * * add * * eax, 1
* * * * mov * * DWORD PTR _i$[ebp], eax
$L833:
* * * * cmp * * DWORD PTR _i$[ebp], 100
* * * * jge * * SHORT $L835

* * * * push * *97
* * * * call * *_my_putchar
* * * * add * * esp, 4

* * * * jmp * * SHORT $L834
$L835:

/* bar */

$L840:
* * * * push * *97
* * * * call * *_my_putchar
* * * * add * * esp, 4

* * * * mov * * eax, DWORD PTR _i$[ebp]
* * * * sub * * eax, 1
* * * * mov * * DWORD PTR _i$[ebp], eax
* * * * cmp * * DWORD PTR _i$[ebp], 0
* * * * jne * * SHORT $L840

Comment: Amazingly enough, this compiler has found yet another way to screw
up. Would you have thought that each compiler generates different code for
such a simple construct?
I hope you agree that the compiler of the beast deserves the award "Worst of
Show" for this mess. Are MS compilers still this bad?

So just how much abuse do *you* think you deserve for testing, and
complaining about, a compiler over a *decade* old? Especially when
you can download a current version for free.

But be that as it may, you've clearly not managed to run the compiler
correctly, because my copy of VC6 generates the following when run
with -O2 or -Ox. In fact, the output you included appears to be the
default, non-optimized output for VC6.
MSVC6 ("cl -Ox -c -Fa test48.c"):

_foo PROC NEAR
; File test48.c
; Line 2
push esi
; Line 6
mov esi, 100 ; 00000064H
$L90:
push 97 ; 00000061H
call _putchar
add esp, 4
dec esi
jne SHORT $L90
pop esi
; Line 10
ret 0
_foo ENDP
_TEXT ENDS
PUBLIC _bar
_TEXT SEGMENT
_bar PROC NEAR
; Line 14
push esi
; Line 15
mov esi, 100 ; 00000064H
$L98:
; Line 18
push 97 ; 00000061H
call _putchar
add esp, 4
; Line 19
dec esi
jne SHORT $L98
pop esi
; Line 23
ret 0
_bar ENDP

Jun 27 '08 #5

A. Sinan Unur

"copx" <co**@gazeta.plwrote in news:fu**********@inews.gazeta.pl:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results
of my first experiment were surprising to say at least. After reading
the chapter on loops in my ASM book I wanted to test whether modern C
compilers are actually as smart as commonly claimed. I chose a most
simple loop: calling putchar 100 times.

And how many programs have you written which just print 100 a's?

The argument for relying on the compiler has never been that they are
perfect. Rather, in most cases, given a clean algorithm, you will get
good enough results.

Then, if there are performance sensitive areas of the code that are
being executed too slowly given your criteria, you think hard about the
algorithm and see if you can improve performance where it matters most
through algorithmic changes.

Maybe then it is time to figure out if there are any remaining
bottlenecks that can be solved through hand tuning.

Sure, you can hand to tune every single part of a large application but
the world will have moved on by then.

An example of this is seen in the example you chose. Any optimizations
you make in decrementing/incrementing loop counters will be dwarfed by
the fact that you have to make 100 IO calls.

Constructing the string once and making one puts call should improve
things more than fiddling with the loop counter especially if this
function is called repeatedly.

Look at the tests below. It takes about 3 seconds to run your version
(t1.c) 1000 times. Whereas the version with just a single IO call per
invocation (t2.c) (albeit with a longer string) takes only 0.2 seconds
to finish the 1000 calls.

As another check, I changed this second version to randomize the
contents of the string. Even including the calls to rand to, 1000
invocations still took about 0.2 seconds.

E:\Testcat t1.c
#include <stdio.h>

void foo(void) {
int i;
for (i = 0; i < 100; i++) {
putchar('a');
}
putchar('\n');
}

int main(void) {
int i;
for ( i = 0; i < 1000; ++i ) {
foo();
}
return 0;
}
E:\Testcl /O2 /nologo t1.c
t1.c

TimeThis : Command Line : t1
TimeThis : Start Time : Tue Apr 22 21:46:28 2008
TimeThis : End Time : Tue Apr 22 21:46:31 2008
TimeThis : Elapsed Time : 00:00:03.047

E:\Testcat t2.c
#include <stdio.h>
#include <string.h>

void foo(void) {
char x[101];
memset( x, 'a', 100 );
x[100] = 0;
puts( x );
return;
}

int main(void) {
int i;
for ( i = 0; i < 1000; ++i) {
foo();
}
return 0;
}
E:\Testcl /O2 /nologo t2.c
t2.c

TimeThis : Command Line : t2
TimeThis : Start Time : Tue Apr 22 21:50:20 2008
TimeThis : End Time : Tue Apr 22 21:50:20 2008
TimeThis : Elapsed Time : 00:00:00.187

For control, here is what I get with a dummy function:

E:\Testcat t3.c
void foo(void) { return; }

int main(void) {
int i = 0;
for ( i = 0; i < 1000; ++i) {
foo();
}
return 0;
}

E:Testcl /O2 /nologo t3.c
t3.c

TimeThis : Command Line : t3
TimeThis : Start Time : Tue Apr 22 21:52:39 2008
TimeThis : End Time : Tue Apr 22 21:52:39 2008
TimeThis : Elapsed Time : 00:00:00.125
In between each invocation, I ran the following program to combat any
kind of cache effects:

E:\Testcat flushmem.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *p;
size_t bufsize = 1024;

while ( p = malloc( bufsize ) ) {
memset( p, 0xda, bufsize );
free( p );
bufsize *= 2;
printf("%x\n", bufsize/1024);
}

return 0;
}

--
A. Sinan Unur <1u**@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

Jun 27 '08 #6

copx

"A. Sinan Unur" <1u**@llenroc.ude.invalidschrieb im Newsbeitrag
news:Xn****************************@127.0.0.1...

"copx" <co**@gazeta.plwrote in news:fu**********@inews.gazeta.pl:

>Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results
of my first experiment were surprising to say at least. After reading
the chapter on loops in my ASM book I wanted to test whether modern C
compilers are actually as smart as commonly claimed. I chose a most
simple loop: calling putchar 100 times.

And how many programs have you written which just print 100 a's?

Come on! Obviously the point of this test was to figure out whether the
compilers are smart enough to optimize a for (i = 0; i < x; i++) loop to a i
= x do {}while (--i) loop. That is a simple micro-optimization you can do
(on x86 and any other platform where the dec/jump trick works) and exactly
the kind of stuff most regulars here would claim "is done by the compiler
anyway". I just proved that this common assumption in wrong.

[snip stuff about IO]

As I said the point of this was not printing characters. Maybe I should have
made it clearer by calling a function with no specified purpose in the loop.

Jun 27 '08 #7

copx

"Ian Collins" <ia******@hotmail.comschrieb im Newsbeitrag
news:67*************@mid.individual.net...

>The test is the whole point of optimizing this loop on x86. You do not
need
a test (cmp instruction) in the loop if you decrement towards zero
instead
of incrementing towards 100. This saves one instruction in the body of
loop.

You still have to test for 0, which may or may not be faster.

No you don't and that's the point. Look at the output of GCC for the "bar"
function for example, see any cmp? The dec instruction sets the necessary
flags to terminate the loop if we reach zero. That's a feature of the x86
instruction set (and others) a compiler could/should exploit to create more
efficient loops.

>The other obvious optimizations are using a register to hold the counter,
and skipping the first check because it is known at compile time that the
loop will always execute at least once. Not one compiler managed to do
all
that when feed with the "the optimizer will understand" version (foo).

c99 appears to, generated

movl $100,%ebx ;/ line : 18
leaq __iob+128(%rip),%r12 ;/ line : 18
.align 16
.CG6.21:
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
addl $-5,%ebx ;/ line : 19
.LU7.69:
testl %ebx,%ebx ;/ line : 19
jne .CG6.21 ;/ line : 19

Yet another way to translate this simple construct. Compilers certainly have
personality!

Jun 27 '08 #8

A. Sinan Unur

"copx" <co**@gazeta.plwrote in news:fu**********@inews.gazeta.pl:

>
"A. Sinan Unur" <1u**@llenroc.ude.invalidschrieb im Newsbeitrag
news:Xn****************************@127.0.0.1...
>"copx" <co**@gazeta.plwrote in news:fu**********@inews.gazeta.pl:

>>Optimizers are overrated

I started learning ASM not long ago to improve my understanding of
the hardware architecture and my ability to optimize C code. The
results of my first experiment were surprising to say at least.
After reading the chapter on loops in my ASM book I wanted to test
whether modern C compilers are actually as smart as commonly
claimed. I chose a most simple loop: calling putchar 100 times.

And how many programs have you written which just print 100 a's?

Come on! Obviously the point of this test was to figure out whether
the compilers are smart enough to optimize

And the point of my post that the programmer ought to be smart enough to
be able understand when the effort spent in beating the compiler is
worth it.

loop to a i = x do {}while (--i) loop. That is a simple
micro-optimization you can do (on x86 and any other platform where the
dec/jump trick works) and exactly the kind of stuff most regulars here
would claim "is done by the compiler anyway". I just proved that this
common assumption in wrong.

And I don't see why anyone ought to care even if you are correct.

Sinan
--
A. Sinan Unur <1u**@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

Jun 27 '08 #9

copx

<ro***********@yahoo.comschrieb im Newsbeitrag
news:10**********************************@f36g2000 hsa.googlegroups.com...
[snip]

>So just how much abuse do *you* think you deserve for testing, and
complaining about, a compiler over a *decade* old?

Eh, none? Since when is testing old software offensive?
I did not try to claim that this reflects the current performance of the MS
compiler. In fact, I explicitly asked "Are MS compilers still this bad?"

>Especially when you can download a current version for free.

The current version is not simply better than the old one. AFAIK software
built with VS2008 won't run on older versions of Windows or so I have been
told.

>But be that as it may, you've clearly not managed to run the compiler

correctly

Maybe, I have never used the command line version before and rarely use VC
anyway.

>because my copy of VC6 generates the following when run
with -O2 or -Ox. In fact, the output you included appears to be the
default, non-optimized output for VC6.

I compiled with /O2. So -O2 is the correct form? Strange, I could swear cl
/help suggested the /O2 form.. But maybe I confused something there.

MSVC6 ("cl -Ox -c -Fa test48.c"):

[snip]

If that is true, VC6 moves from the bottom to the top. The first compiler
which actually manages to properly optimize the for loop!

Jun 27 '08 #10

Eric Sosman

copx wrote:

"A. Sinan Unur" <1u**@llenroc.ude.invalidschrieb im Newsbeitrag
news:Xn****************************@127.0.0.1...
>"copx" <co**@gazeta.plwrote in news:fu**********@inews.gazeta.pl:

>>Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results
of my first experiment were surprising to say at least. After reading
the chapter on loops in my ASM book I wanted to test whether modern C
compilers are actually as smart as commonly claimed. I chose a most
simple loop: calling putchar 100 times.
And how many programs have you written which just print 100 a's?

Come on! Obviously the point of this test was to figure out whether the
compilers are smart enough to optimize a for (i = 0; i < x; i++) loop to a i
= x do {}while (--i) loop. That is a simple micro-optimization you can do
(on x86 and any other platform where the dec/jump trick works) and exactly
the kind of stuff most regulars here would claim "is done by the compiler
anyway". I just proved that this common assumption in wrong.

If you give your car a nice fresh coat of wax, do you
improve its fuel efficiency by reducing air drag or hurt
the efficiency by increasing weight?

Compiler writers do not stay up nights trying to figure
out how to optimize silly loops; they give their attention
to getting more "serious" programs to run well. They worry
about how to map many local variables to a few CPU registers;
you don't have enough variables to notice. They worry about
eliminating common sub-expressions; you don't have any and
again can't see any optimization. They worry about strength
reductions; your sample has no strength to be reduced. They
worry about cache lines, about prefetching, about filling the
various instruction pipelines, about branch prediction ...
and you are oblivious to all of these.

You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

--
Eric Sosman
es*****@ieee-dot-org.invalid

Jun 27 '08 #11

copx

"A. Sinan Unur" <1u**@llenroc.ude.invalidschrieb im Newsbeitrag
news:Xn****************************@127.0.0.1...
[snip]

And the point of my post that the programmer ought to be smart enough to
be able understand when the effort spent in beating the compiler is
worth it.

I wasn't trying to beat the compiler. I just measured its performance.

I disagree with your point that optimization should be limited to profiled
bottlenecks and choosing the right algorithm. In fact, I suspect that this
common belief (which really isn't news to me - you hear that from 90% of
all programmers these days) is responsible for software disasters like
Microsoft Vista. If you ignore efficiency issues completely while writing
your program you will not be able to just shave away all the wasted RAM and
CPU cycles at the end by rewriting a single central algorithm after
profiling. Whatever, when to optimize or not is not the topic of this
thread. In the professional world the answer to that question is determined
by market forces anyway I guess.

What I am trying to discuss here is what you can except a C compiler to
optimize and what not. I posted my results to counter the misinformation
which has been spread here in the past (without bad intent most of the time
for sure).

[snip]

Jun 27 '08 #12

robertwessel2

On Apr 22, 10:41*pm, "copx" <c...@gazeta.plwrote:

I wasn't trying to beat the compiler. I just measured its performance.

Of course, you didn't do that either - just counting generated
instructions is hardly definitive on modern processors. For example,
dec/jne is faster on many processors than a single loop instruction.
Also the current versions of MSVC generate "sub esi,1" rather than
"dec esi", since the former is faster on many CPUs.

Modern CPUs are complex enough that actually measuring performance is
the only way to tell if a particular optimization is successful.

Jun 27 '08 #13

copx

"Eric Sosman" <es*****@ieee-dot-org.invalidschrieb im Newsbeitrag
news:8P******************************@comcast.com. ..
[snip]

If you give your car a nice fresh coat of wax, do you
improve its fuel efficiency by reducing air drag or hurt
the efficiency by increasing weight?

Compiler writers do not stay up nights trying to figure
out how to optimize silly loops; they give their attention
to getting more "serious" programs to run well.

"Serious" programs probably contain many of such "silly loops". And what
exactly is "silly" about loops based on incrementing/decrementing an integer
value counting towards a maximum/minimum? Have you written many "serious"
programs without them?
A loop like this executed lets say a million times means one million wasted
instructions. Of course, that won't matter most the the time, I am not
trying to argue with that.

They worry about how to map many local variables to a few CPU registers;
you don't have enough variables to notice. They worry about
eliminating common sub-expressions; you don't have any and
again can't see any optimization. They worry about strength
reductions; your sample has no strength to be reduced. They
worry about cache lines, about prefetching, about filling the
various instruction pipelines, about branch prediction ...
and you are oblivious to all of these.

Thanks for giving me ideas for what to test next!

You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

Error: analogy mismatch

Jun 27 '08 #14

copx

<ro***********@yahoo.comschrieb im Newsbeitrag
news:de**********************************@z72g2000 hsb.googlegroups.com...
On Apr 22, 10:41 pm, "copx" <c...@gazeta.plwrote:

I wasn't trying to beat the compiler. I just measured its performance.

Of course, you didn't do that either - just counting generated
instructions is hardly definitive on modern processors.

Ok, good point. I will measure execution time next time, too.

Jun 27 '08 #15

vippstar

On Apr 23, 6:41 am, "copx" <c...@gazeta.plwrote:

"A. Sinan Unur" <1...@llenroc.ude.invalidschrieb im Newsbeitragnews:Xn****************************@127 .0.0.1...
[snip]

And the point of my post that the programmer ought to be smart enough to
be able understand when the effort spent in beating the compiler is
worth it.

I wasn't trying to beat the compiler. I just measured its performance.

I disagree with your point that optimization should be limited to profiled
bottlenecks and choosing the right algorithm. In fact, I suspect that this
common belief (which really isn't news to me - you hear that from 90% of
all programmers these days) is responsible for software disasters like
Microsoft Vista.

So, you read a few pages off an assembly book, and suddenly you know
beter than compiler and OS devs?
Optimization is hardly about the smallest output possible.
Your remark about the 'common belief' that leads to 'software
disasters' is just stupid, to the point I'd claim you are either a
troll or that you are very new to programming.
My suggestion is that you finish this book you read, then you read
your processors manuals, then you read about various compiler
optimization methods, and then start this discussion again.
If you want a compiler for intel that optimizes very well, try icc
(intel c++ compiler).
It's developed from Intel, and I'd expect it to be the best compiler
for intel products.

Jun 27 '08 #16

user923005

On Apr 22, 4:38*pm, "copx" <c...@gazeta.plwrote:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
*int i;

*for (i = 0; i < 100; i++) putchar('a');

}

void bar(void)
{
*int i = 100;

*do {
* putchar('a');
*} while (--i);

}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

/* GCC 4.3.0 on x86/Windows, -O2 */

/* foo */
L7:
* * * subl *$12, %esp
* * * pushl $97
* * * call *_putchar

* * * incl *%ebx
* * * addl *$16, %esp
* * * cmpl *$100, %ebx
* * * jne * L7

/* bar */
L2:
* * * subl *$12, %esp
* * * pushl $97
* * * call *_putchar
* * * addl *$16, %esp

* * * decl *%ebx
* * * jne * L2

Comment: See, even the most recent version of the probably most widely used
compiler can not correctly optimize a most simple loop! At least GCC
understood the bar loop, so my "write C like ASM" optimization worked.

At this point you might wonder what horrible things an average C compiler
will do when GCC already fails so badly. Here is the gruesome result:

/* lccwin32, optimize on */

/* foo */
_$4:
* * * pushl $97
* * * call *_putchar
* * * popl *%ecx

* * * incl *%edi
* * * cmpl *$100,%edi
* * * jl * *_$4

/* bar */
_$10:
* * * pushl $97
* * * call *_putchar
* * * popl *%ecx

* * * movl *%edi,%eax
* * * decl *%eax
* * * movl *%eax,%edi
* * * or * *%eax,%eax
* * * jne * _$10

Comment: lcc is unable to optimize the loop just like GCC, but it adds
insults to injury by actually generating worse code for the ASM-style loop!
So you cannot even optimize the loop yourself!

/* MS Visual C++ 6 /O2 */

For this compiler I had to replace the putchar call with a call to a custom
my_putchar function otherwise the compiler replaces the putchar calls with
direct OS API stuff. While this is a good *optimization it is not the
subject of this test, and only makes the resulting asm harder to read, so I
supressed that.

/* foo */

* * * * jmp * * SHORT $L833
$L834:
* * * * mov * * eax, DWORD PTR _i$[ebp]
* * * * add * * eax, 1
* * * * mov * * DWORD PTR _i$[ebp], eax
$L833:
* * * * cmp * * DWORD PTR _i$[ebp], 100
* * * * jge * * SHORT $L835

* * * * push * *97
* * * * call * *_my_putchar
* * * * add * * esp, 4

* * * * jmp * * SHORT $L834
$L835:

/* bar */

$L840:
* * * * push * *97
* * * * call * *_my_putchar
* * * * add * * esp, 4

* * * * mov * * eax, DWORD PTR _i$[ebp]
* * * * sub * * eax, 1
* * * * mov * * DWORD PTR _i$[ebp], eax
* * * * cmp * * DWORD PTR _i$[ebp], 0
* * * * jne * * SHORT $L840

Comment: Amazingly enough, this compiler has found yet another way to screw
up. Would you have thought that each compiler generates different code for
such a simple construct?
I hope you agree that the compiler of the beast deserves the award "Worst of
Show" for this mess. Are MS compilers still this bad?

From this:
#include <stdio.h>
/* C Code */
void foo(void)
{
int i;

for (i = 0; i < 100; i++)
putchar('a');
}

void bar(void)
{
int i = 100;
do {
putchar('a');
} while (--i);
}

int main(void)
{
foo();
bar();
return 0;
}

I got this:
; Listing generated by Microsoft (R) Optimizing Compiler Version
14.00.50727.762

TITLE c:\tmp\foo.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

EXTRN __imp__putchar:PROC
PUBLIC @bar@0
; Function compile flags: /Ogtpy
; File c:\tmp\foo.c
; COMDAT @bar@0
_TEXT SEGMENT
@bar@0 PROC ; COMDAT

; 12 : {

00000 56 push esi

; 13 : int i = 100;

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL3@:

; 14 : do {
; 15 : putchar('a');

00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4

; 16 : } while (--i);

00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL3@
0001c 5f pop edi
0001d 5e pop esi

; 17 : }

0001e c3 ret 0
@bar@0 ENDP
_TEXT ENDS
PUBLIC @foo@0
; Function compile flags: /Ogtpy
; COMDAT @foo@0
_TEXT SEGMENT
@foo@0 PROC ; COMDAT

; 4 : {

00000 56 push esi

; 5 : int i;
; 6 :
; 7 : for (i = 0; i < 100; i++)

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL3@:

; 8 : putchar('a');

00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4
00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL3@
0001c 5f pop edi
0001d 5e pop esi

; 9 : }

0001e c3 ret 0
@foo@0 ENDP
_TEXT ENDS
PUBLIC _main
; Function compile flags: /Ogtpy
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 20 : {

00000 56 push esi

; 21 : foo();

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL5@main:
00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4
00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL5@main

; 22 : bar();

0001c bf 64 00 00 00 mov edi, 100 ; 00000064H
$LL10@main:
00021 6a 61 push 97 ; 00000061H
00023 ff d6 call esi
00025 83 c4 04 add esp, 4
00028 83 ef 01 sub edi, 1
0002b 75 f4 jne SHORT $LL10@main
0002d 5f pop edi

; 23 : return 0;

0002e 33 c0 xor eax, eax
00030 5e pop esi

; 24 : }

00031 c3 ret 0
_main ENDP
_TEXT ENDS
END

I would be interested to see some example of a C algorithm, then the
same algorithm in assembly, and then demonstrate how yours is actually
measurably faster.

Jun 27 '08 #17

Paul Brettschneider

copx wrote:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the
chapter on loops in my ASM book I wanted to test whether modern C
compilers are actually as smart as commonly claimed. I chose a most simple
loop: calling putchar 100 times. The first function (foo) uses a typical C
style loop to test the assumption that "the compiler will optimize that
better than any human could". The second function (bar) is based my newly
gained knowledge, the loop is basically ASM written in C. Now I am certain
if I asked here which one is more efficient all you guys would reply "the
compiler will most likely generate the same code in both cases" (I have
read such claims countless times here). Well, look at the ASM output below
to see how wrong your assumption is.

/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:
/* GCC 4.3.0 on x86/Windows, -O2 */

/* foo */
L7:
subl $12, %esp
pushl $97
call _putchar

incl %ebx
addl $16, %esp
cmpl $100, %ebx
jne L7
/* bar */
L2:
subl $12, %esp
pushl $97
call _putchar
addl $16, %esp

decl %ebx
jne L2
Comment: See, even the most recent version of the probably most widely
used compiler can not correctly optimize a most simple loop! At least GCC
understood the bar loop, so my "write C like ASM" optimization worked.

gcc tends to produce horrible code (I could show you examples that are
downright silly) and x86 is a horrible architecture. But curiously gcc
3.4.1 *does* "optimise" your two functions to the same code on PA-RISC
(various assembler directives and stackframe setup code snipped):

foo
[...]
ldi 100,%r3
ldo -1(%r3),%r3
L$0011
bl putchar,%r2
ldi 97,%r26
comib,<>,n 0,%r3,L$0011
ldo -1(%r3),%r3
[...]

bar
[...]
ldi 100,%r3
ldo -1(%r3),%r3
L$0018
bl putchar,%r2
ldi 97,%r26
comib,<>,n 0,%r3,L$0018
ldo -1(%r3),%r3
[...]

Note: This code looks strange due to branch delay slots.

So one has to wonder why the same optimisation is not done on x86. Maybe it
simply doesn't matter nowadays.

Jun 27 '08 #18

Bartc

"copx" <co**@gazeta.plwrote in message
news:fu**********@inews.gazeta.pl...

Optimizers are overrated

void foo(void)
{
int i;
for (i = 0; i < 100; i++) putchar('a');
}

void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

This is not very interesting code to optimise, there's nothing much for the
compiler to get it's teeth into. So perhaps the attempts were half-hearted.

I never believed compilers were getting as good as hand-coded asm, but
recently I almost came to believe the hype.

Just yesterday I finished porting to C, the key module of a byte-code
interpreter. Testing this on only one input program, the results on two
compilers I tried (x86 target) were:

lccwin32 3000ms with -O
gcc 3.4.5 1600ms with no optimisation (any -O switches made much slower)
original 800ms, using tight mix of asm and compiled code (not C).

And these times were /after/ considerable tweaking of the C code.

Have some avenues to explore yet, but so far disappointing.

--
Bart

Jun 27 '08 #19

CBFalconer

copx wrote:

>

.... snip ...

>
Come on! Obviously the point of this test was to figure out whether
the compilers are smart enough to optimize a for (i = 0; i < x; i++)
loop to a i = x do {}while (--i) loop. That is a simple micro-
optimization you can do (on x86 and any other platform where the
dec/jump trick works) and exactly the kind of stuff most regulars
here would claim "is done by the compiler anyway". I just proved
that this common assumption in wrong.

No you didn't. You proved that makeing the termination depend on
the variable reaching 0 differs from having the variable reach 100.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.
** Posted from http://www.teranews.com **

Jun 27 '08 #20

Eric Sosman

copx wrote:

[...]
A loop like this executed lets say a million times means one million wasted
instructions. Of course, that won't matter most the the time, I am not

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

trying to argue with that.

Well put.

--
Eric Sosman
es*****@ieee-dot-org.invalid

Jun 27 '08 #21

Walter Banks

copx wrote:

Optimizers are overrated

The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases"

You made the assumption that the two pieces of code are equivalent
they are not.

Regards,

--
Walter Banks
Byte Craft Limited
Tel. (519) 888-6911
http://www.bytecraft.com
wa****@bytecraft.com

Jun 27 '08 #22

Willem

Walter wrote:
) copx wrote:
)Optimizers are overrated
)>
)The first function (foo) uses a typical C style loop to
)test the assumption that "the compiler will optimize that better than any
)human could". The second function (bar) is based my newly gained knowledge,
)the loop is basically ASM written in C. Now I am certain if I asked here
)which one is more efficient all you guys would reply "the compiler will most
)likely generate the same code in both cases"
)
) You made the assumption that the two pieces of code are equivalent
) they are not.

They are too. The variable 'i' is not used for anything but
counting the loop. I've seen GCC produce the same code for
the two snippets and that was several years back. Before 3.0.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Jun 27 '08 #23

Flash Gordon

Bartc wrote, On 23/04/08 11:34:

<snip>

I never believed compilers were getting as good as hand-coded asm, but
recently I almost came to believe the hype.

Just yesterday I finished porting to C, the key module of a byte-code
interpreter. Testing this on only one input program, the results on two
compilers I tried (x86 target) were:

lccwin32 3000ms with -O
gcc 3.4.5 1600ms with no optimisation (any -O switches made much slower)
original 800ms, using tight mix of asm and compiled code (not C).

And these times were /after/ considerable tweaking of the C code.

Have some avenues to explore yet, but so far disappointing.

Why do you think comparing your original code against using a compiler
*without* optimisation has any baring on how good the compiler is if you
tell it to produce good code? Try actually telling the compiler to
optimise and telling it what type of processor to optimise for and what
the minimum processor it can assume is (this defaults to an 80386 for
Intel I believe, do you really need to support processors that old?).

Jun 27 '08 #24

Anonymous

On Wed, 23 Apr 2008 01:38:18 +0200, copx wrote:

Optimizers are overrated

In some ways yes. In others, no.

For example, in the code you put, GCC inserts various alignment
instructions. In larger segments of code, it'll reorder instructions to
make pipelining more efficient (non-x86 will benefit more from this,
especially embedded processors that don't do their own instruction
reordering at runtime). I doubt that any human could write tens of
millions of lines of assembly to carefully consider where to insert nops
to align loop boundaries, or how to reorder instructions to prevent data
dependency stalls in the pipeline.

On the flip side of the coin, for a small function, a human will almost
always be able to do a better job of optimizing than the compiler,
especially since the semantics of the code are in the user's mind, and
they can ignore the corner cases that the C compiler has to get right
when compiling C code, and doing certain optimizations.

Finally, the loop optimization you mentioned is quite unimportant. Why?
Because a single cmp instruction is more or less lost in the noise when
you benchmark it. For a billion iterations of an _empty_ loop (which I
had to write by hand, or GCC's optimizer would remove the loop entirely),
it leads to a 15% difference. For any code which did a single memory
access within the loop (writing to a volatile int), there was on average
a 3% performance difference.

Conclusion:
Draw your own! (ok, ok... I think both assembly and C has it's
place. I won't spend my time writing lots of dumb assembly, but I have no
problems optimizing a hot-spot in my code by dropping into it for an
inner loop.)

Jun 27 '08 #25

Anonymous

On Wed, 23 Apr 2008 09:54:37 -0700, Keith Thompson wrote:

The latter raises a (possibly) interesting point. The value of some
optimizations might depend on which model of the x86 family the code is
going to run on.

Of course. On the Pentium 4, for example, the 'inc' and 'dec'
instructions are slower than 'add $1,%reg', and 'sub $1,%reg'
respectively. However, on the Core architecture, the inc and dec are
preferred, as they aren't any slower than add/sub, but they are smaller,
and thus reduce code size.

Jun 27 '08 #26

Knight

Hey, I am still waiting for the Super Compiler that would understand
that
for (i = 0; i < 100; i++) putchar('a');
is the same as
putchar('a');
Is that too much to hope for?

Jun 27 '08 #27

Flash Gordon

copx wrote, On 23/04/08 00:38:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}
void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:
/* GCC 4.3.0 on x86/Windows, -O2 */

<snip>

Add -funroll-loops and gcc will do some loop unrolling for you
potentially speeding it up. This is likely to have a bigger impact that
the optimisation you thought of.

The larger the function/module/program the more likely it is for the
compiler to beet hand coded assembler and the more it will cost to
produce hand coded assembler. I have beet a compiler where it mattered
(admittedly 10 years or so ago) but I've also speeded up code by more
than a factor of 2 by removing stupidities in what beople had done in a
high level language (and I know others who have also done this).

Generally writing large chunks of assembler code is simply too expensive
and error prone. There are exceptions where you really do have to get
the last cycle, but I believe they are rare when you move away from the
really small embedded processors.
--
Flash Gordon

Jun 27 '08 #28

Knight

On Apr 23, 12:59*pm, Knight <knightt...@yahoo.comwrote:

Hey, I am still waiting for the Super Compiler that would understand
that
for (i = 0; i < 100; i++) putchar('a');
is the same as
putchar('a');
Is that too much to hope for?

D'oh!
It's not.
I'll throw some tomatoes on myself now.

To be fair, substitute
for (i = 0; i < 100; i++) a = b;
And find a compiler that would substitute this with
a = b;

Jun 27 '08 #29

user923005

On Apr 23, 12:15*pm, Flash Gordon <s...@flash-gordon.me.ukwrote:

copx wrote, On 23/04/08 00:38:

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
*int i;

*for (i = 0; i < 100; i++) putchar('a');
}

void bar(void)
{
*int i = 100;

*do {
* putchar('a');
*} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

/* GCC 4.3.0 on x86/Windows, -O2 */

<snip>

Add -funroll-loops and gcc will do some loop unrolling for you
potentially speeding it up. This is likely to have a bigger impact that
the optimisation you thought of.

The larger the function/module/program the more likely it is for the
compiler to beet hand coded assembler and the more it will cost to
produce hand coded assembler. I have beet a compiler where it mattered
(admittedly 10 years or so ago) but I've also speeded up code by more
than a factor of 2 by removing stupidities in what beople had done in a
high level language (and I know others who have also done this).

Generally writing large chunks of assembler code is simply too expensive
and error prone. There are exceptions where you really do have to get
the last cycle, but I believe they are rare when you move away from the
really small embedded processors.

Instead of unrolling loops, how about memset with the char of choice
for 'count' characters, null termination with a zero byte, and puts()?
I guess that the choice of the right optimization is better than all
the silly tweaks in the world.

Jun 27 '08 #30

lawrence.jones

Keith Thompson <ks***@mib.orgwrote:

>
The latter raises a (possibly) interesting point. The value of some
optimizations might depend on which model of the x86 family the code
is going to run on.

In fact, the x86 family has been notorious for having optimization
techniques for one generation of processor performing horribly on the
next generation of processor.

-Larry Jones

I like maxims that don't encourage behavior modification. -- Calvin

Jun 27 '08 #31

Anonymous

On Wed, 23 Apr 2008 13:02:18 -0700, Knight wrote:

On Apr 23, 12:59Â*pm, Knight <knightt...@yahoo.comwrote:
>Hey, I am still waiting for the Super Compiler that would understand
that
for (i = 0; i < 100; i++) putchar('a'); is the same as
putchar('a');
Is that too much to hope for?

D'oh!
It's not.
I'll throw some tomatoes on myself now.

To be fair, substitute
for (i = 0; i < 100; i++) a = b;
And find a compiler that would substitute this with a = b;

gcc 4.2.3 does it for me.
actually, if I do:

void foo()
{
int i, a, b;
b = 3;
for (i = 0; i < 100; i++)
a = b;
}

it optimizes it to:

void foo()
{
}

since none of the code in the function actually affects program state.

Jun 27 '08 #32

Richard Heathfield

Knight said:

On Apr 23, 12:59 pm, Knight <knightt...@yahoo.comwrote:
>Hey, I am still waiting for the Super Compiler that would understand
that
for (i = 0; i < 100; i++) putchar('a');
is the same as
putchar('a');
Is that too much to hope for?

D'oh!
It's not.
I'll throw some tomatoes on myself now.

To be fair, substitute
for (i = 0; i < 100; i++) a = b;
And find a compiler that would substitute this with
a = b;

Presumably you mean:

i = 100;
a = b;

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Jun 27 '08 #33

Dann Corbit

"Richard Heathfield" <rj*@see.sig.invalidwrote in message
news:Q_******************************@bt.com...

Knight said:

>On Apr 23, 12:59 pm, Knight <knightt...@yahoo.comwrote:
>>Hey, I am still waiting for the Super Compiler that would understand
that
for (i = 0; i < 100; i++) putchar('a');
is the same as
putchar('a');
Is that too much to hope for?

D'oh!
It's not.
I'll throw some tomatoes on myself now.

To be fair, substitute
for (i = 0; i < 100; i++) a = b;
And find a compiler that would substitute this with
a = b;

Presumably you mean:

i = 100;
a = b;

I guess it would look something like this:

#include <stdio.h>
int main (void)
{
int i, a=0, b;
b = 31415;
for (i = 0; i < 100; i++)
a = b;
printf ("a=%d,i=%d\n", a, i);
return 0;
}

; Listing generated by Microsoft (R) Optimizing Compiler Version
14.00.50727.762

TITLE c:\tmp\foo.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

EXTRN __imp__printf:PROC
$SG-4 DB 'a=%d,i=%d', 0aH, 00H
PUBLIC _main
; Function compile flags: /Ogtpy
; File c:\tmp\foo.c
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 4 : int i, a=0, b;
; 5 : b = 31415;
; 6 : for (i = 0; i < 100; i++)
; 7 : a = b;
; 8 : printf ("a=%d,i=%d\n", a, i);

00000 6a 64 push 100 ; 00000064H
00002 68 b7 7a 00 00 push 31415 ; 00007ab7H
00007 68 00 00 00 00 push OFFSET $SG-4
0000c ff 15 00 00 00
00 call DWORD PTR __imp__printf
00012 83 c4 0c add esp, 12 ; 0000000cH

; 9 : return 0;

00015 33 c0 xor eax, eax

; 10 : }

00017 c3 ret 0
_main ENDP
_TEXT ENDS
END
** Posted from http://www.teranews.com **

Jun 27 '08 #34

Mark McIntyre

copx wrote:

>
The test is the whole point of optimizing this loop on x86. You do not need
a test (cmp instruction) in the loop if you decrement towards zero instead
of incrementing towards 100. This saves one instruction in the body of loop.

You're assuming that this and your other optimisations are actually
faster. Did you test this? Its entirely possible that there's some other
h/w magic which you don't know about but which the compiler writer does.

Also, not meaning to be funny, but you admit you just started ASM
programming recently. Do you /really/ think you know more already than
people who write compilers for a living?

--
Mark McIntyre

CLC FAQ <http://c-faq.com/>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

Jun 27 '08 #35

Mark McIntyre

copx wrote:

"Eric Sosman" <es*****@ieee-dot-org.invalidschrieb im Newsbeitrag
news:8P******************************@comcast.com. ..
[snip]
> If you give your car a nice fresh coat of wax, do you
improve its fuel efficiency by reducing air drag or hurt
the efficiency by increasing weight?

Compiler writers do not stay up nights trying to figure
out how to optimize silly loops; they give their attention
to getting more "serious" programs to run well.

"Serious" programs probably contain many of such "silly loops".

And how much experience do you have of serious programmes, compared to
Eric?

Please stop embarassing yourself.

>And what
exactly is "silly" about loops based on incrementing/decrementing an integer
value counting towards a maximum/minimum? Have you written many "serious"
programs without them?

Yes, but its irrelevant. I never wrote a programme that did nothing but
count down to zero and print a meaningless statement....

--
Mark McIntyre

CLC FAQ <http://c-faq.com/>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

Jun 27 '08 #36

Eric Sosman

Mark McIntyre wrote:

copx wrote:
>"Eric Sosman" <es*****@ieee-dot-org.invalidschrieb im Newsbeitrag
news:8P******************************@comcast.com ...
[snip]
>> If you give your car a nice fresh coat of wax, do you
improve its fuel efficiency by reducing air drag or hurt
the efficiency by increasing weight?

Compiler writers do not stay up nights trying to figure
out how to optimize silly loops; they give their attention
to getting more "serious" programs to run well.

"Serious" programs probably contain many of such "silly loops".

And how much experience do you have of serious programmes, compared to
Eric?

Even when it's made on my behalf I find the "Experience
counts" argument only mildly persuasive. We all know people
of vast experience and half-vast ability ...

The point, still, is that one observation (not even a
measurement!) of one tiny program is insufficient basis for
a sweeping statement like "optimizers are overrated." The
sample is too small and the observation too inexact to support
such a conclusion. Using similar methodology I could show
that penicillin is overrated: Give one dose to one healthy
person, ask him how much better he feels, and draw^H^H^H^H
leap to your conclusion. (I think election pollsters use
this general approach ...)

--
Er*********@sun.com

Jun 27 '08 #37

christian.bau

On Apr 23, 5:54*pm, Keith Thompson <ks...@mib.orgwrote:

On the other hand, copx is assuming that a test against 0 is going to
be better than a test against 100. *Even if the test against 0 is
implicit (the decrement instruction implicitly sets a flag), that may
or may not be the case. *Without either measurement (which copx has
not done) or intimate knowledge of how the x86 CPU behaves (which I
lack, and which might vary across models anyway), we can't tell
whether the optimization would be worthwhile.

We can get a hint from the fact that the decrement and increment
instructions have been removed from the IA-64 instruction set, so both
Intel and AMD seem to think that having this instruction is not
worthwhile. Also, the increment and decrement instructions don't
affect the carry flag, unlike an add or subtract instruction, which
means inc and dec force a partial condition code register update,
which means the instruction will run slower.

Jun 27 '08 #38

robertwessel2

On Apr 23, 5:35*pm, "christian.bau" <christian....@cbau.wanadoo.co.uk>
wrote:

On Apr 23, 5:54*pm, Keith Thompson <ks...@mib.orgwrote:

On the other hand, copx is assuming that a test against 0 is going to
be better than a test against 100. *Even if the test against 0 is
implicit (the decrement instruction implicitly sets a flag), that may
or may not be the case. *Without either measurement (which copx has
not done) or intimate knowledge of how the x86 CPU behaves (which I
lack, and which might vary across models anyway), we can't tell
whether the optimization would be worthwhile.

We can get a hint from the fact that the decrement and increment
instructions have been removed from the IA-64 instruction set, so both
Intel and AMD seem to think that having this instruction is not
worthwhile. Also, the increment and decrement instructions don't
affect the carry flag, unlike an add or subtract instruction, which
means inc and dec force a partial condition code register update,
which means the instruction will run slower.

That's not really correct.

First, you almost certainly don't mean IA-64 (especially since you
mention AMD). IA-64 is the IPF or Itanium ISA. What you want is
x86-64, which is the most common name for what AMD calls AMD64 and
Intel calls EM64T or Intel 64, which are all names for the 64 bit
version of good old x86.

Anyway, AMD removed the one byte forms of the inc and dec
instructions, since they needed a range of single byte opcodes so they
could encode the new REX prefix. The longer forms of inc and dec,
which are functionally identical, are still there (as they have since
the 8086). And given that Intel copied AMD64, it doesn't really mater
what they thought of the one byte forms of those instructions - they
were stuck with AMD's decision.

Anyway, the performance of incs and decs is back up to par with the
Core generations of CPUs.

Jun 27 '08 #39

rzed

Eric Sosman <es*****@ieee-dot-org.invalidwrote in
news:8P******************************@comcast.com:

You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
can't seriously doubt the ... the, um ... the QUALITY of his
compositions, now can you?

--
rzed
.... also never wrote a good kazoo concerto.

Jun 27 '08 #40

A. Sinan Unur

user923005 <dc*****@connx.comwrote in news:3554078d-425c-4da3-b98d-
ea**********@34g2000hsh.googlegroups.com:

Instead of unrolling loops, how about memset with the char of choice
for 'count' characters, null termination with a zero byte, and puts()?
I guess that the choice of the right optimization is better than all
the silly tweaks in the world.

See

http://groups.google.com/group/comp....5762477174dde4

--
A. Sinan Unur <1u**@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

Jun 27 '08 #41

TonyMc

CBFalconer <cb********@yahoo.comwrites:

No you didn't. You proved that makeing the termination depend on
the variable reaching 0 differs from having the variable reach 100.

You have used the word "makeing". No such word exists in the language
standard. Perhaps you were looking for the word "petard".

--
Tony

Jun 27 '08 #42

Bartc

"Flash Gordon" <sp**@flash-gordon.me.ukwrote in message
news:l1************@news.flash-gordon.me.uk...

Bartc wrote, On 23/04/08 11:34:

<snip>

>I never believed compilers were getting as good as hand-coded asm, but
recently I almost came to believe the hype.

Just yesterday I finished porting to C, the key module of a byte-code
interpreter. Testing this on only one input program, the results on two
compilers I tried (x86 target) were:

lccwin32 3000ms with -O
gcc 3.4.5 1600ms with no optimisation (any -O switches made much
slower)
original 800ms, using tight mix of asm and compiled code (not C).

And these times were /after/ considerable tweaking of the C code.

Have some avenues to explore yet, but so far disappointing.

Why do you think comparing your original code against using a compiler
*without* optimisation has any baring on how good the compiler is if you
tell it to produce good code? Try actually telling the compiler to
optimise and telling it what type of processor to optimise for and what
the minimum processor it can assume is (this defaults to an 80386 for
Intel I believe, do you really need to support processors that old?).

I've tried extra options, with little effect.

I've since looked in more detail at gcc code output. And decided to write my
main dispatcher in ASM! Which will also take over some of the simplest
operations (this is a byte-code interpreter).

I will keep the C dispatcher (for portability reasons), but see no reason to
have to run it at half-speed (or one-third speed on another machine) because
the compiler can't make the global strategic decisions that I can make in
ASM.

(BTW I don't know what's the matter with lccwin32; I also have a version in
my own language+compiler (no asm), that does no optimisation /whatsoever/,
yet performs between gcc and lccwin32.)
--
Bart

Jun 27 '08 #43

Nick Keighley

On 23 Apr, 20:15, Flash Gordon <s...@flash-gordon.me.ukwrote:

The larger the function/module/program the more likely it is for the
compiler to beet hand coded assembler

club it to death with a root vegetable?
--
Nick Keighley

Jun 27 '08 #44

Walter Banks

rzed wrote:

Eric Sosman <es*****@ieee-dot-org.invalidwrote in
news:8P******************************@comcast.com:

You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
can't seriously doubt the ... the, um ... the QUALITY of his
compositions, now can you?

I once went to a P.D.Q. Bach concert and the two old ladies who
sat behind me failed to see the humour until after the intermission
drinks had taken affect. They expected . . . . Bach.

w..

Jun 27 '08 #45

Eric Sosman

rzed wrote:

Eric Sosman <es*****@ieee-dot-org.invalidwrote in
news:8P******************************@comcast.com:

> You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
can't seriously doubt the ... the, um ... the QUALITY of his
compositions, now can you?

<off-topic>

Long years ago my high-school calculus teacher (who had
many interests outside mathematics) played for his class a
tape of a mini-concert given by a singing group to which he
belonged, on the occasion of their annual party. A speaker
soberly introduced the program as the first fruits of a piece
of ground-breaking musical scholarship, which had recently
discovered that J.S. Bach's "Die Kunst der Fuge" ("The Art
of the Fugue") had been mis-identified for all this time,
the word "Fuge" being a mis-transcription of Bach's intended
"Ruge," which the speaker explained was an Old German word
meaning "kazoo." A quartet of kazooists, my teacher among
them, then gave the first "original instruments" performance
of this long-misunderstood masterpiece of, er, casuistry.

</off-topic>

--
Er*********@sun.com

Jun 27 '08 #46

Walter Banks

Willem wrote:

Walter wrote:

) You made the assumption that the two pieces of code are equivalent
) they are not.

They are too. The variable 'i' is not used for anything but
counting the loop. I've seen GCC produce the same code for
the two snippets and that was several years back. Before 3.0.

I understand about i and the results are the same. I also understand
that it could be optimized to be the same and many compilers would.

The for and while implementations have different meaning. It is a suttle
point that I am making essentially about the care that is needed in
creating benchmarks like this.

w..

Jun 27 '08 #47

santosh

Bartc wrote:

>
"Flash Gordon" <sp**@flash-gordon.me.ukwrote in message
news:l1************@news.flash-gordon.me.uk...
>Bartc wrote, On 23/04/08 11:34:

<snip>

>>I never believed compilers were getting as good as hand-coded asm,
but recently I almost came to believe the hype.

Just yesterday I finished porting to C, the key module of a
byte-code interpreter. Testing this on only one input program, the
results on two compilers I tried (x86 target) were:

lccwin32 3000ms with -O
gcc 3.4.5 1600ms with no optimisation (any -O switches made much
slower)
original 800ms, using tight mix of asm and compiled code (not
C).

And these times were /after/ considerable tweaking of the C code.

Have some avenues to explore yet, but so far disappointing.

Why do you think comparing your original code against using a
compiler *without* optimisation has any baring on how good the
compiler is if you tell it to produce good code? Try actually telling
the compiler to optimise and telling it what type of processor to
optimise for and what the minimum processor it can assume is (this
defaults to an 80386 for Intel I believe, do you really need to
support processors that old?).

I've tried extra options, with little effect.

I've since looked in more detail at gcc code output. And decided to
write my main dispatcher in ASM! Which will also take over some of the
simplest operations (this is a byte-code interpreter).

I will keep the C dispatcher (for portability reasons), but see no
reason to have to run it at half-speed (or one-third speed on another
machine) because the compiler can't make the global strategic
decisions that I can make in ASM.

(BTW I don't know what's the matter with lccwin32; I also have a
version in my own language+compiler (no asm), that does no
optimisation /whatsoever/, yet performs between gcc and lccwin32.)

If you want heavy optimisations the choice is between gcc and intel (or
perhaps another compiler like DigitalMars). jacob navia will be (I'm
sure) the first to admit the lcc-win32 doesn't do all the optimisations
under the Sun. He has admitted in the past to do peephole optimisation
though.

In my experience for Intel x86 chips and their clones, executables
produced by the Intel compilers usually outperform others.

But yes, if it's feasible, a manual design can always perform high-level
optimisations that compilers will struggle to match. I suppose your
interpreter is a case study in this. It's still surprising to hear that
the C version is about twice as slow as your assembler one. Is the
speed gain of the assembler code spread out uniformly over the source
code or is it confined to one or more specific constructs/routines?

Jun 27 '08 #48

Willem

Walter wrote:
) I understand about i and the results are the same. I also understand
) that it could be optimized to be the same and many compilers would.
)
) The for and while implementations have different meaning. It is a suttle
) point that I am making essentially about the care that is needed in
) creating benchmarks like this.

I don't get your point. Could you elaborate ?

The C standard allows that a conforming compiler creates
exactly the same machine code for the two code snippets.
To me, that implies that the two have exactly the same meaning.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Jun 27 '08 #49

Charlton Wilbur

>>>>"ES" == Eric Sosman <Er*********@sun.comwrites:

ES<off-topic>

ESA quartet of kazooists, my teacher among
ESthem, then gave the first "original instruments" performance
ESof this long-misunderstood masterpiece of, er, casuistry.

Did they seriously perform something from Der Kunst der Fuge on kazoo?
That takes an incredible amount of skill to get reasonably right.

Charlton

(who played in a jazz combo once that was known for doing whimsical
things like playing a verse on kazoo and slide whistle and thus knows
how difficult it is to actually play the kazoo in such a context)
ES</off-topic>

--
Charlton Wilbur
cw*****@chromatico.net

Jun 27 '08 #50

optimizers are overrated

Similar topics