The strange speed of a program!

bear

hi all,
I have a program whose speed is so strange to me. It is maily used
to
calculate a output image so from four images s0,s1,s2,s3 where
so=(s0-s2)^2+ (s1-s3)^2. I compile it with gcc (no optimization). the
codec between /***********/ is the initialization code. What supprise
me a lot
is the code with initialization(io==1) is much faster than without
initialization(io!=1). The initialization code should takes some time
and it should slow down the program. But the result confuse me. Why?

Code is listed below
================================================== ================
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start,end;
double cpu_time_used;
short *s0,*s1,*s2,*s3;
int *so;
unsigned int i,j;
unsigned int times;
unsigned int length;
int io;
if( argc<4 )
{
fprintf( stderr,"USAGE: %s times width (1 initialize, otherwize
noinitialize)\n",argv[0] );
exit(1);
}
else
{
times = atoi( argv[1] );
length = atoi( argv[2] );
length = length*length;
io = atoi( argv[3] );
}
s0 = (short *)malloc( length*sizeof(short) );
s1 = (short *)malloc( length*sizeof(short) );
s2 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );
so = (int *)malloc( length*sizeof(int) );
start = clock();
for( i=0; i<times; ++i)
{
/**************************************************/
if( io==1 )
{
for( j=0; j<length; ++j )
{
s0[j] = i+j;
s1[j] = length-1-j;
s2[j] = 2*j;
s3[j] = 3*j;
}
}
/**************************************************/
for( j=0; j<length; ++j )
{
int tmp1,tmp2;
tmp1 = s0[j]-s2[j];
tmp1 = tmp1*tmp1;
tmp2 = s1[j]-s3[j];
tmp2 = tmp2*tmp2;
so[j] = tmp1+tmp2;
}
}
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time: %f sec\n",cpu_time_used);
free( s0 );
free( s1 );
free( s2 );
free( s3 );
free( so );
return 0;
}

Nov 14 '05 #1

Subscribe Post Reply

2544

Old Wolf

bear wrote:

hi all,
I have a program whose speed is so strange to me. What supprise
me a lot is the code with initialization(io==1) is much faster
than without initialization(io!=1). The initialization code should
takes some time and it should slow down the program.
But the result confuse me. Why?
Can you post the actual timings you observed?
The only thing that comes to mind is, the operations in the
io==1 version are all on small numbers, so maybe they don't
take as long as the io==0 version which are all on random numbers
which could be large and/or negative. Also, the large number
operations might be causing integer overflow exceptions.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start,end;
double cpu_time_used;
short *s0,*s1,*s2,*s3;
int *so;
unsigned int i,j;
unsigned int times;
unsigned int length;
int io;
if( argc<4 )
{
fprintf( stderr,"USAGE: %s times width (1 initialize, otherwize
noinitialize)\n",argv[0] );
exit(1);
}
else
{
times = atoi( argv[1] );
length = atoi( argv[2] );
atoi causes undefined behaviour if the number is bigger than
can fit in an int
length = length*length;
You should check that length * length will not overflow, before
doing this.
io = atoi( argv[3] );
}
s0 = (short *)malloc( length*sizeof(short) );
Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

s0 = malloc( length * sizeof *s0 );
s1 = (short *)malloc( length*sizeof(short) );
s2 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );
You just leaked the first s3 malloc's results.
You should also check that all of these mallocs succeeded.
so = (int *)malloc( length*sizeof(int) );

Rest of the code looks OK.

Nov 14 '05 #2

bear

Thanks for your advice.

I compile basic.c using gcc-3.3.5 with no optimization. My CPU is
athlon-xp,installed Debian.

../basic 50 1024 0 #no init
results in CPU time: 9.690000 sec
../basic 50 1024 1 #init
results in CPU time: 3.280000 sec
Old Wolf wrote:

bear wrote:
hi all,
I have a program whose speed is so strange to me. What supprise
me a lot is the code with initialization(io==1) is much faster
than without initialization(io!=1). The initialization code should
takes some time and it should slow down the program.
But the result confuse me. Why?
Can you post the actual timings you observed?
The only thing that comes to mind is, the operations in the
io==1 version are all on small numbers, so maybe they don't
take as long as the io==0 version which are all on random numbers
which could be large and/or negative. Also, the large number
operations might be causing integer overflow exceptions.

The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start,end;
double cpu_time_used;
short *s0,*s1,*s2,*s3;
int *so;
unsigned int i,j;
unsigned int times;
unsigned int length;
int io;
if( argc<4 )
{
fprintf( stderr,"USAGE: %s times width (1 initialize, otherwize
noinitialize)\n",argv[0] );
exit(1);
}
else
{
times = atoi( argv[1] );
length = atoi( argv[2] );
atoi causes undefined behaviour if the number is bigger than
can fit in an int
length = length*length;

You should check that length * length will not overflow, before
doing this.
io = atoi( argv[3] );
}
s0 = (short *)malloc( length*sizeof(short) );

Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

s0 = malloc( length * sizeof *s0 );
s1 = (short *)malloc( length*sizeof(short) );
s2 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );
s3 = (short *)malloc( length*sizeof(short) );

You just leaked the first s3 malloc's results.
You should also check that all of these mallocs succeeded.

I have corrected this

so = (int *)malloc( length*sizeof(int) );

Rest of the code looks OK.

Nov 14 '05 #3

CBFalconer

bear wrote:

Old Wolf wrote:
bear wrote:
.... snip ...
io = atoi( argv[3] );
}
s0 = (short *)malloc( length*sizeof(short) );

Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

In general ALL casts, other than to arguments to variadic
functions, are suspect and should be avoided. Unnecessary casts
only serve to suppress error messages, by announcing "I know what I
am doing, so don't complain". The usual form for a malloc call is:

if (!(ptr = malloc(number * sizeof *ptr))) {
/* handle the "out of memory" condition */
}
else {
/* successful allocation, do whatever with ptr */
}

which secures space for an array of number items of whatever type
ptr has been declared to point to. The declarations and values of
ptr and number (the latter may be 1, and thus omitted) control all
the action, and no error messages are suppressed. A very common
error that can be suppressed by a silly cast is failure to #include
<stdlib.h>.

Make sure you read the first three references below.

--
Some useful references about C:
<http://www.ungerhu.com/jxh/clc.welcome.txt>
<http://www.eskimo.com/~scs/C-faq/top.html>
<http://benpfaff.org/writings/clc/off-topic.html>
<http://anubis.dkuug.dk/jtc1/sc22/wg14/www/docs/n869/> (C99)
<http://www.dinkumware.com/refxc.html> (C-library}
<http://gcc.gnu.org/onlinedocs/> (GNU docs)

Nov 14 '05 #4

S.Tobias

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:

Thanks for your advice. I compile basic.c using gcc-3.3.5 with no optimization. My CPU is
athlon-xp,installed Debian. ./basic 50 1024 0 #no init
results in CPU time: 9.690000 sec
./basic 50 1024 1 #init
results in CPU time: 3.280000 sec
gcc 3.3.4 (Slackware)
gcc -O3 -static

The same binary was run on 4 different processors, my results are:
../a.out 50 1024 [01]

init: 0 1
Pentium MMX (200MHz) 10.860000 37.870000
Pentium II (400MHz) 1.330000 7.090000
mobile Athlon XP (1400MHz) 6.950000 4.060000
Pentium4 (2.80GHz) 9.290000 0.710000

(Look at the last result - I couldn't beleive it myself!)

It looks as if the problem is not C (or gcc) specific (although there
might be some interesting morals for C programmers).
Old Wolf wrote:
bear wrote:
Can you post the actual timings you observed?
The only thing that comes to mind is, the operations in the
io==1 version are all on small numbers, so maybe they don't
take as long as the io==0 version which are all on random numbers
which could be large and/or negative. Also, the large number
operations might be causing integer overflow exceptions.
The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?
Interesting idea, but my first results didn't confirm this (ie.
that overflows might cause slow-down), but I'll yet investigate this.

There's big difference when you pre-initialize the allocated
memory before the big loop. I used this code:

void *randmalloc(size_t s)
{
size_t i;
char *pc = malloc(s);
if (pc)
for (i=0; i<s; ++i)
pc[i] = rand();
return pc;
}

#define malloc randmalloc

(Okay, okay, I know it's not legal to redefine malloc, but it's
quicker that modifying the source.)

s0 = (short *)malloc( length*sizeof(short) );

Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

It's not incorrect per se, but it is malpractice.
See FAQ 7.7.

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #5

bear

S.Tobias wrote:

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:
Thanks for your advice.

I compile basic.c using gcc-3.3.5 with no optimization. My CPU is
athlon-xp,installed Debian.

./basic 50 1024 0 #no init
results in CPU time: 9.690000 sec
./basic 50 1024 1 #init
results in CPU time: 3.280000 sec

gcc 3.3.4 (Slackware)
gcc -O3 -static

The same binary was run on 4 different processors, my results are:
./a.out 50 1024 [01]

init: 0 1
Pentium MMX (200MHz) 10.860000 37.870000
Pentium II (400MHz) 1.330000 7.090000
mobile Athlon XP (1400MHz) 6.950000 4.060000
Pentium4 (2.80GHz) 9.290000 0.710000

(Look at the last result - I couldn't beleive it myself!)

It looks as if the problem is not C (or gcc) specific (although there
might be some interesting morals for C programmers).
Old Wolf wrote:
bear wrote: Can you post the actual timings you observed?
The only thing that comes to mind is, the operations in the
io==1 version are all on small numbers, so maybe they don't
take as long as the io==0 version which are all on random numbers
which could be large and/or negative. Also, the large number
operations might be causing integer overflow exceptions.
The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?

Interesting idea, but my first results didn't confirm this (ie.
that overflows might cause slow-down), but I'll yet investigate this.

There's big difference when you pre-initialize the allocated
memory before the big loop. I used this code:

void *randmalloc(size_t s)
{
size_t i;
char *pc = malloc(s);
if (pc)
for (i=0; i<s; ++i)
pc[i] = rand();
return pc;
}

#define malloc randmalloc

(Okay, okay, I know it's not legal to redefine malloc, but it's
quicker that modifying the source.)

I change the code to
==============================================
s0 = malloc( length*sizeof(short) );
s1 = malloc( length*sizeof(short) );
s2 = malloc( length*sizeof(short) );
s3 = malloc( length*sizeof(short) );
so = malloc( length*sizeof(int) );
for( j=0; j<length; ++j )
{
s0[j] = 0;
s1[j] = 0;
s2[j] = 0;
s3[j] = 0;
}
===============================================
zero the four input images. It seems that the speed is explainable now.
On my Athlon-xp:
with init: 4.020000 sec
without init 2.190000 sec
It seems that it is the overflow slows down the speed. Am I right?

> s0 = (short *)malloc( length*sizeof(short) );

Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

It's not incorrect per se, but it is malpractice.
See FAQ 7.7.

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #6

bear

S.Tobias wrote:

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:
Thanks for your advice.

I compile basic.c using gcc-3.3.5 with no optimization. My CPU is
athlon-xp,installed Debian.

./basic 50 1024 0 #no init
results in CPU time: 9.690000 sec
./basic 50 1024 1 #init
results in CPU time: 3.280000 sec

gcc 3.3.4 (Slackware)
gcc -O3 -static

The same binary was run on 4 different processors, my results are:
./a.out 50 1024 [01]

init: 0 1
Pentium MMX (200MHz) 10.860000 37.870000
Pentium II (400MHz) 1.330000 7.090000
mobile Athlon XP (1400MHz) 6.950000 4.060000
Pentium4 (2.80GHz) 9.290000 0.710000

(Look at the last result - I couldn't beleive it myself!)

It looks as if the problem is not C (or gcc) specific (although there
might be some interesting morals for C programmers).
Old Wolf wrote:
bear wrote: Can you post the actual timings you observed?
The only thing that comes to mind is, the operations in the
io==1 version are all on small numbers, so maybe they don't
take as long as the io==0 version which are all on random numbers
which could be large and/or negative. Also, the large number
operations might be causing integer overflow exceptions.
The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?

Interesting idea, but my first results didn't confirm this (ie.
that overflows might cause slow-down), but I'll yet investigate this.

There's big difference when you pre-initialize the allocated
memory before the big loop. I used this code:

void *randmalloc(size_t s)
{
size_t i;
char *pc = malloc(s);
if (pc)
for (i=0; i<s; ++i)
pc[i] = rand();
return pc;
}

#define malloc randmalloc

(Okay, okay, I know it's not legal to redefine malloc, but it's
quicker that modifying the source.)

I change the code to
==============================================
s0 = malloc( length*sizeof(short) );
s1 = malloc( length*sizeof(short) );
s2 = malloc( length*sizeof(short) );
s3 = malloc( length*sizeof(short) );
so = malloc( length*sizeof(int) );
for( j=0; j<length; ++j )
{
s0[j] = 0;
s1[j] = 0;
s2[j] = 0;
s3[j] = 0;
}
===============================================
zero the four input images. It seems that the speed is explainable now.
On my Athlon-xp:
with init: 4.020000 sec
without init 2.190000 sec
It seems that it is the overflow slows down the speed. Am I right?

> s0 = (short *)malloc( length*sizeof(short) );

Lose the cast (search this newsgroup for 'malloc' to learn
more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

It's not incorrect per se, but it is malpractice.
See FAQ 7.7.

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #7

S.Tobias

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:

S.Tobias wrote:
In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:
The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?
Interesting idea, but my first results didn't confirm this (ie.
that overflows might cause slow-down), but I'll yet investigate this.

There's big difference when you pre-initialize the allocated
memory before the big loop. I used this code:
[snip code]

I change the code to
==============================================

[snip code] ===============================================
zero the four input images. It seems that the speed is explainable now.
On my Athlon-xp:
with init: 4.020000 sec
without init 2.190000 sec
It seems that it is the overflow slows down the speed. Am I right?

I don't know. It might be it, it might be a few things together.
I tried putting random values and it worked "as it should", too.
The results seem to depend strongly on the `length' parameter
(array size) - processor cache is a suspect too. Finally, my
results (snipped) showed a large diversity on different architectures
(all PC-compatible).

This is too off-topic in comp.lang.c, where I read it, and I guess
gcc people won't be interested in this subject either (as I have shown,
the problem is - most probably - not specific to gcc). I'm not
going to discuss this in c.l.c. any longer.

Please, make a summary of what has been said until now, show your
(corrected) code, the results (mine too), and let's move this
discussion somewhere else. I think the first best group to try
would be comp.programming (unless someone can hint a better group).

I'm very interested in finding the solution to the problem myself.

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #8

S.Tobias

[Newsgroups restricted to clc]

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:

[snippage]

short *s0,*s1,*s2,*s3;
allocation: s0 = (short *)malloc( length*sizeof(short) );
s1 = (short *)malloc( length*sizeof(short) ); [...]

initialization of s0[n]: for( j=0; j<length; ++j )
{
s0[j] = i+j;
s1[j] = length-1-j; [...]

use: for( j=0; j<length; ++j )
{
int tmp1,tmp2;
tmp1 = s0[j]-s2[j];

If initialization step is skipped, then reading from an uninitialized
memory technically might raise UB. (But this, of course, defeats
the purpose of the question, which was not quite topical in clc anyway.)

Question to others:
In C99 "indeterminate value" is an unspecified value or a trap
representation. I think that C99 doesn't mention explicitly any
more, that accessing indeterminate values automatically raises UB.
So, if on an implementation, say, integers don't have trap values
(ie. every bit representation corresponds to some valid value),
is accessing of uninitialized integer objects defined behaviour
then (yielding only unspecified values)?

--
Stan Tobias
mailx `echo si***@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Nov 14 '05 #9

Lawrence Kirby

On Wed, 08 Jun 2005 10:18:34 +0000, S.Tobias wrote:

In comp.lang.c bear <xg***@mails.tsinghua.edu.cn> wrote:
Thanks for your advice.
I compile basic.c using gcc-3.3.5 with no optimization. My CPU is
athlon-xp,installed Debian.

./basic 50 1024 0 #no init
results in CPU time: 9.690000 sec
./basic 50 1024 1 #init
results in CPU time: 3.280000 sec

gcc 3.3.4 (Slackware)
gcc -O3 -static

The same binary was run on 4 different processors, my results are:

The operating system and particularly the VM strategies it uses could be a
big factor here. Were you running the identical OS on these 4
processors?
./a.out 50 1024 [01]

init: 0 1
Pentium MMX (200MHz) 10.860000 37.870000
Pentium II (400MHz) 1.330000 7.090000
mobile Athlon XP (1400MHz) 6.950000 4.060000
Pentium4 (2.80GHz) 9.290000 0.710000

(Look at the last result - I couldn't beleive it myself!)
The last is what I would expect for a VM system that uses copy-on-write.
It is common for such VM systems to set up a single memory page zero
filled with bytes. When memory is first allocated all the virstual pages
are initially mapped to that single zero-filled page as copy-on-write.
That means that if you don't write to the memory you will always be
reading zeros from a single page in physical memory, no matter how large
your datastructure is. So all reads come from Level 1 (very likely) cache
and are very fast. If you write to memory first then each page in your
virtual memory is mapped to its own page in physical memory and suddenly
large datastructures don't fit in the processor cache and access get s a
lot slower.

That doesn't explain why

It looks as if the problem is not C (or gcc) specific (although there
might be some interesting morals for C programmers).
Old Wolf wrote:
> bear wrote: > Can you post the actual timings you observed?
> The only thing that comes to mind is, the operations in the
> io==1 version are all on small numbers, so maybe they don't
> take as long as the io==0 version which are all on random numbers
> which could be large and/or negative. Also, the large number
> operations might be causing integer overflow exceptions.
The calculation of random number runs slowly than small numbers?
I think they use the same instruction and the same width of operand.
Also, can these overflow exceptions cause slow down?

Interesting idea, but my first results didn't confirm this (ie.
that overflows might cause slow-down), but I'll yet investigate this.

There's big difference when you pre-initialize the allocated
memory before the big loop. I used this code:

void *randmalloc(size_t s)
{
size_t i;
char *pc = malloc(s);
if (pc)
for (i=0; i<s; ++i)
pc[i] = rand();
return pc;
}

#define malloc randmalloc

(Okay, okay, I know it's not legal to redefine malloc, but it's
quicker that modifying the source.)
> > s0 = (short *)malloc( length*sizeof(short) );
>
> Lose the cast (search this newsgroup for 'malloc' to learn
> more about why):

I search but do not find any clue on why this cast is incorrect.
So I just leave it there.

It's not incorrect per se, but it is malpractice.
See FAQ 7.7.

Nov 14 '05 #10

Lawrence Kirby

On Sat, 11 Jun 2005 18:41:56 +0100, Lawrence Kirby wrote:

Please ignore this last posting which is incomplete and incorrect. It was
an old working copy that was posted by accident.

Lawrence

Nov 14 '05 #11

The strange speed of a program!

Similar topics