Is this the optimal FIR filter on all platforms?

Johan Bergman

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see there
were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

Nov 13 '05 #1

Subscribe Post Reply

6580

Tim Prince

Johan Bergman wrote:

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is
just rubbish in this example. For my intended usage of this filter, it is
not necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on
the different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see
there were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line
and only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

Assuming that you are using basically the same compiler and the same
optimization options in each case, you must be running on different
hardware. You've given no reason to believe that the OS is responsible.
If one or more of your platforms supports SSE/SSE2 instructions, and you
can change the operands to float/double, you should be able to get much
better performance. Split the sum into 4 or 8 partial sums, thus
parallelizing the sum reduction. This becomes OT if you use a C compiler
whichs supports parallel instructions only in asm.
--
Tim Prince

Nov 13 '05 #2

Arthur J. O'Dwyer

[Warning: long style critique ahead]

On Sat, 25 Oct 2003, Johan Bergman wrote:

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)
What in the world is an FIR filter? (Finite impulse-response, yes,
yes, I know, I looked it up. Point is, practically nobody in this
newsgroup will know, either, and they'll *all* have to look it up
if they really want to help you. You're making it harder for people
to help you when you don't define your terms.)
My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;) Can someone help me to speed up the program on the Linux and SunOS
platforms?
Well, first let's clean up your code and see what it looks like
then. Clean code is always easier to operate on. :-)

#include <malloc.h>
Non-standard header. 'malloc' is declared in <stdlib.h>.

#include <stdlib.h>
int main()
[Some regulars strongly believe in 'int main(void)'. I
don't have a strong preference either way, but it's wise
to consider why you picked the style you did, even on
little things like this.]
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
Okay, yeah, 'const' is great. But really, this is ridiculous.
You don't think that all these consts are somehow making your
program more efficient, do you? (They're not hurting, but
over-constifying is Bad mostly because it's Ugly, and Ugly code
is hard to make Right.)
(const int * const) malloc(nrof_taps*sizeof(int));
malloc(nrof_taps * sizeof *coeff_ptr);

No chance for error using this idiom.
Also note that casting a malloc() result to 'const' anything
is just silly, because free() takes a non-const pointer
argument. How are you going to free() this memory, hmm?

const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
Way too many similarly-named variables, in my opinion.
Again, purely a stylistic issue, but again, clean code is
nice code. Note that in the cleaned-up version, I've pulled
in their scope to where they're actually used. This may
actually help with the efficiency of the program, since some
optimizers look for "windows" in which objects can be moved
into registers and suchlike. The wider their scopes are, the
harder it is for the optimizer to see them.
tmp_output_ptr = (int *) output_ptr;
Casting away constness is Bad. Especially when you've just
finished jumping through hoops to make it const in the first
place. Geez.
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
STOP PERFORMING RANDOM CASTS! For Pete's sake!
Get a hold of yourself here!
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
This line (and the one below) are just begging to be
simplified. Let's expand it so we can see the order
of operations.
}
*tmp_output_ptr++ = sum;
}
/* always */ return 0; /* from main */
}

Okay, let's put it all together: no casts, clean names,
clean mallocs, and see what we get!

You see how all those tmp_*_ptr objects just melt away
once we start to see the outline of the program clearly.
In fact, we uncover a subtle bug in the memory allocation,
and fix it ('input' was being malloc'ed one element short)!
Finally, we strive for idiomatic identifiers: using 'i' and
'j' for the nested loop controls, and losing the silly
'_ptr' suffixes on the main data arrays.
#include <stdlib.h>

int main(void)
{
int nrof_lags = 10000; /* Could both of these be #defines? */
int nrof_taps = 10000; /* That would help a little. */

int *coeffs = malloc(nrof_taps * sizeof *coeffs);
int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
int *output = malloc(nrof_lags * sizeof *output);
int i, j;

for (i=0; i < nrof_lags; ++i)
{
int sum = 0;
for (j=0; j < nrof_taps; ++j)
sum += coeffs[j] * input[i+j];
output[i] = sum;
}

free(coeffs);
free(input);
free(output);
return 0;
}
There -- isn't that a heck of a lot simpler and easier to
read? I bet it performs comparably fast on all your machines.
If not, you might try changing 'int' to 'unsigned int' (in
fact, you might do that anyway, to guard against overflow
conditions); reversing the order of either or both loops;
if many of the 'coeffs[j]' are zero, try pre-processing
'coeffs' a little bit; little tweaks like that.

I think most of your performance differences are due to
different cache layouts on the different machines, but I
could easily be wrong.

How's the code do now?

HTH,
-Arthur

Nov 13 '05 #3

nobody

"Johan Bergman" <us**@inter.net> wrote in message
news:_y********************@newsc.telia.net...

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see there were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

No idea what the problem is, but if you comment out
*tmp_output_ptr++ = sum;
compiler may optimize away calculation of sum in previous statement,
which could explain performance differences. Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular platform.

I would appreciate posting of proper explanation and solution,
when you find it.

Nov 13 '05 #4

CBFalconer

Johan Bergman wrote:

.... snip ...
Can someone help me to speed up the program on the Linux and SunOS
platforms?

#include <malloc.h>
No such standard header. Use stdlib.h.
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
Don't cast malloc output.
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
Same here.
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
Same here.
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
Uninitialized data. Anything can happen.
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

I suggest you first create a legal program before measuring
performance.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 13 '05 #5

Arthur J. O'Dwyer

On Sat, 25 Oct 2003, nobody wrote:

"Johan Bergman" <us**@inter.net> wrote...

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)
Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular platform.

On many modern systems, there won't even be a significant difference
between the performance of calloc and malloc -- but since malloc doesn't
need to write zeroes everywhere, it's faster in theory. OTOH, if the
OP used calloc, he would have a program that actually exhibited defined,
reproducible behavior. :)

-Arthur

Nov 13 '05 #6

nobody

"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> wrote in message
news:Pi**********************************@unix49.a ndrew.cmu.edu...

On Sat, 25 Oct 2003, nobody wrote:

"Johan Bergman" <us**@inter.net> wrote...

Maybe someone can help me to optimize this C/C++ implementation of a FIR filter (although I realize that I should probably consider an FFT approach instead.)
Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular

platform.
On many modern systems, there won't even be a significant difference
between the performance of calloc and malloc -- but since malloc doesn't
need to write zeroes everywhere, it's faster in theory. OTOH, if the
That is true, but allocation is done outside of loop so this difference
would be IMHO practically inmesurable in given program (10K
multiplications and additions in inner loop times 10K iterations
and additions in outer loop. But I was really speculating, because
I can't explain those huge time execution differences (assuming
correct methodology was used to measure this time).
OP used calloc, he would have a program that actually exhibited defined,
reproducible behavior. :)

-Arthur

Nov 13 '05 #7

Sheldon Simms

On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:

"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> wrote in message
news:Pi**********************************@unix49.a ndrew.cmu.edu...

On Sat, 25 Oct 2003, nobody wrote:
>
> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.

For what it's worth, I compiled both versions on my machine (which
has plenty of memory). The version with sum was faster:

with sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
[sheldon@wsxyz]$ time ./test1

real 0m0.627s
user 0m0.560s
sys 0m0.000s
without sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
[sheldon@wsxyz]$ time ./test2

real 0m0.710s
user 0m0.630s
sys 0m0.000s

Nov 13 '05 #8

nobody

"Sheldon Simms" <sh**********@yahoo.com> wrote in message
news:pa****************************@yahoo.com...

On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> wrote in message
news:Pi**********************************@unix49.a ndrew.cmu.edu...

On Sat, 25 Oct 2003, nobody wrote:
>
> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.

For what it's worth, I compiled both versions on my machine (which
has plenty of memory). The version with sum was faster:

with sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
[sheldon@wsxyz]$ time ./test1

real 0m0.627s
user 0m0.560s
sys 0m0.000s
without sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
[sheldon@wsxyz]$ time ./test2

real 0m0.710s
user 0m0.630s
sys 0m0.000s

Thanks. So I was wrong. My sincere apologies. Had I SunOS box
at my disposal, I would test it before posting :-)

Nov 13 '05 #9

Arthur J. O'Dwyer

On Sat, 25 Oct 2003, nobody wrote:

"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> wrote...
On Sat, 25 Oct 2003, nobody wrote:
optimize code little bit, instead of
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
do
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.

More main memory, yes. More cache memory, no. I was referring to
the size of the cache in this case. You're right, though, that if
the arrays get really ungodly huge, then the OP might need to look
at the effects of whatever virtual memory facilities his OS is
trying to use.

This is way OT; I'll stop now.

-Arthur

Nov 13 '05 #10

Matteo Frigo

"Johan Bergman" <us**@inter.net> writes:

const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));

You are not initializing the input. Apart from the fact that the
behavior is undefined, as pointed out by previous posters, you ought
to be aware that floating-point arithmetic on infinities and NaN's on
Intel processors is much slower than floating-point arithmetic on
finite floating-point numbers. Your observed slowdown is likely
caused by a floating-point NaN that happens to be in the uninitialized
array.

Independently of this problem, to answer the question in the title,
your FIR filter is not optimal on *any* platform that I can think of,
but the reason is way off-topic for this newsgroup. (It is also
suboptimal from a roundoff error point of view.)

Cheers,
Matteo Frigo

Nov 13 '05 #11

Johan Bergman

Hi Tim, thanks for your reply. Yes, I am running on different hardware:

Cygwin on a 400-MHz x86
Linux on a 1620-MHz x86
SunOS on a 333-MHz sparc

I compiled with gcc -O3. I get the similar behaviors with gcc versions
2.95.3 and 3.2.2. I check the execution time with the 'time' command.

Thanks for the tip to change from int to float/double! Here are the results:

Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double

So changing from int to float/double brought down the execution time a lot
in Linux and SunOS. Now I have similar performance on the two x86 platforms,
and the sparc platform is now only two times worse in the float case. :-)

Regards,
Johan

Assuming that you are using basically the same compiler and the same
optimization options in each case, you must be running on different
hardware. You've given no reason to believe that the OS is responsible.
If one or more of your platforms supports SSE/SSE2 instructions, and you
can change the operands to float/double, you should be able to get much
better performance. Split the sum into 4 or 8 partial sums, thus
parallelizing the sum reduction. This becomes OT if you use a C compiler
whichs supports parallel instructions only in asm.
--
Tim Prince

Nov 13 '05 #12

Johan Bergman

Hi Arthur, thanks for your reply!

What in the world is an FIR filter? (Finite impulse-response, yes,
yes, I know, I looked it up. Point is, practically nobody in this
newsgroup will know, either, and they'll *all* have to look it up
if they really want to help you. You're making it harder for people
to help you when you don't define your terms.)
Sorry, maybe I should have described the background a bit better. An FIR
filter is one of the basic filter types in digital signal processing, and my
example shows one of the ways to realize it. For FIR filter this long (10000
taps), it would actually make more sence to choose another realization based
on FFTs (Fast Fourier Transforms). But first I would like to see how fast I
can make the realization in the example.
Well, first let's clean up your code and see what it looks like
then. Clean code is always easier to operate on. :-)
OK then! :-)
#include <stdlib.h>

int main(void)
{
int nrof_lags = 10000; /* Could both of these be #defines? */
int nrof_taps = 10000; /* That would help a little. */

int *coeffs = malloc(nrof_taps * sizeof *coeffs);
int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
int *output = malloc(nrof_lags * sizeof *output);
int i, j;

for (i=0; i < nrof_lags; ++i)
{
int sum = 0;
for (j=0; j < nrof_taps; ++j)
sum += coeffs[j] * input[i+j];
output[i] = sum;
}

free(coeffs);
free(input);
free(output);
return 0;
}
Thanks for the cleanup!
How's the code do now?

Sorry, it is slightly slower than my (admittedly terrible) original code.

BR,
Johan

Nov 13 '05 #13

Johan Bergman

> No idea what the problem is, but if you comment out

*tmp_output_ptr++ = sum;
compiler may optimize away calculation of sum in previous statement,
which could explain performance differences.
You are right, of course!
Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
You probably meant something like this (since we don't want to step
tmp_output_ptr inside the inner loop):

for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr += *tmp_coeff_ptr++ * *tmp_input_ptr++;
*tmp_output_ptr++;

I tried this, but unfortunately the result was about 40% longer execution
time (at least on the Sun platform).
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular platform.

Ok, I tried changing malloc() to calloc() but it didn't affect the execution
time.
I would appreciate posting of proper explanation and solution,
when you find it.

I will. See in my answer to Tim Prince about the results of changing int to
float/double.

BR,
Johan

Nov 13 '05 #14

Johan Bergman

Hi Sheldon, what kind of processor do you have? What is the clock frequency?

BR,
Johan

"Sheldon Simms" <sh**********@yahoo.com> wrote in message
news:pa****************************@yahoo.com...

On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> wrote in message
news:Pi**********************************@unix49.a ndrew.cmu.edu...

On Sat, 25 Oct 2003, nobody wrote:
>
> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.

For what it's worth, I compiled both versions on my machine (which
has plenty of memory). The version with sum was faster:

with sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
[sheldon@wsxyz]$ time ./test1

real 0m0.627s
user 0m0.560s
sys 0m0.000s
without sum:
[sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
[sheldon@wsxyz]$ time ./test2

real 0m0.710s
user 0m0.630s
sys 0m0.000s

Nov 13 '05 #15

Johan Bergman

Hi again,

I said in my post to you that your program performs slightly worse (at least
on the Sun platform). However, when I changed from int to float/double, as
proposed by Tim Prince, there was no longer any difference in execution time
between your code and mine. Funny!

By the way, here are the results (with my code):

Linux/x86: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for
double
SunOS/sparc: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for
double
Cygwin/x86: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for
double

So with float/double I get the same performance on the two x86 platforms.
And in the float case, the sparc platform doesn't fall too much behind.

Regards,
Johan

Nov 13 '05 #16

Johan Bergman

Hi again, again! Now I have tried your code and mine on the Cygwin/x86
platform as well, and this time your code was more than two times slower
than mine.

So it seems like this:
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
is often more optimal than this:
sum += coeffs[tap] * input[lag+tap];

Regards,
Johan

Nov 13 '05 #17

Larry Doolittle

In article <bn**********@newstree.wise.edt.ericsson.se>, Johan Bergman wrote:

Cygwin on a 400-MHz x86
Linux on a 1620-MHz x86
SunOS on a 333-MHz sparc

I compiled with gcc -O3. I get the similar behaviors with gcc versions
2.95.3 and 3.2.2. I check the execution time with the 'time' command.

Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double

Comparing clock cycles confuses the issue. Processor speeds, cache
speeds/sizes, and main memory access/write times are not necessarily correlated.
You have to get off your clock-cycle kick, and learn about memory subsystem
performance. There's a lot to learn! And operating systems have small
or zero effect on the result, as you could verify if you took the time
to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
on that 333-MHz sparc.

- Larry

Nov 13 '05 #18

Johan Bergman

> Comparing clock cycles confuses the issue. Processor speeds, cache

speeds/sizes, and main memory access/write times are not necessarily correlated. You have to get off your clock-cycle kick, and learn about memory subsystem performance. There's a lot to learn! And operating systems have small
or zero effect on the result, as you could verify if you took the time
to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
on that 333-MHz sparc.

It's not like I am saying that clock frequency is all that matters... I am
just saying that this program takes this many clock cycles to execute on a
that platform.

What I wanted was a program that runs fast on any platform, but it seemed to
perform very differently on the platforms that I tried out. In order to
convince myself of that, I had to at least take the different clock
frequencies into account, right? And when I had convinced myself, I posted
my question in this group, to see if someone could help me optimize my code.

Regards,
Johan

Nov 13 '05 #19

Johan Bergman

Hi Matteo, thanks for your reply!

You are not initializing the input. Apart from the fact that the
behavior is undefined, as pointed out by previous posters, you ought
to be aware that floating-point arithmetic on infinities and NaN's on
Intel processors is much slower than floating-point arithmetic on
finite floating-point numbers. Your observed slowdown is likely
caused by a floating-point NaN that happens to be in the uninitialized
array.
But in my example, I only worked with integers! (By the way, float/double
actually proved to be a lot fast than int on all tested platforms.)
Independently of this problem, to answer the question in the title,
your FIR filter is not optimal on *any* platform that I can think of,
but the reason is way off-topic for this newsgroup. (It is also
suboptimal from a roundoff error point of view.)

Well, as I wrote, I realize that an FFT approach would be beneficial for
such a long FIR filter. But apart from that, are there any other
optimizations that you can think of? In that case I would be most interested
in hearing them! The best would be to get a piece of code, of course! :-)

Regards,
Johan

Nov 13 '05 #20

Sheldon Simms

On Sun, 26 Oct 2003 18:06:59 +0100, Johan Bergman wrote:

Hi Sheldon, what kind of processor do you have? What is the clock frequency?

That's an Athlon XP 1800+ / Red Hat 9

Nov 13 '05 #21

Johan Bergman

> That's an Athlon XP 1800+ / Red Hat 9

Ok, you got about the same execution time as I did on my Linux/x86 platform,
about 1100 million cycles.

Regards,
Johan

Nov 13 '05 #22

CBFalconer

Johan Bergman wrote: (and eliminated all attributions)

You are not initializing the input. Apart from the fact that the
behavior is undefined, as pointed out by previous posters, you
ought to be aware that floating-point arithmetic on infinities
and NaN's on Intel processors is much slower than floating-point
arithmetic on finite floating-point numbers. Your observed
slowdown is likely caused by a floating-point NaN that happens
to be in the uninitialized array.

But in my example, I only worked with integers! (By the way,
float/double actually proved to be a lot fast than int on all
tested platforms.)

You still have code with undefined behaviour. Why don't you first
fix the code and stop wasting all our time with this nonsense.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 13 '05 #23

Johan Bergman

> > Independently of this problem, to answer the question in the title,

your FIR filter is not optimal on *any* platform that I can think of,
but the reason is way off-topic for this newsgroup. (It is also
suboptimal from a roundoff error point of view.)
Well, as I wrote, I realize that an FFT approach would be beneficial for
such a long FIR filter. But apart from that, are there any other
optimizations that you can think of? In that case I would be most

interested in hearing them! The best would be to get a piece of code, of course! :-)

In your last post, you forgot to answer this question. What are the
optimizations you are thinking of?

Regards,
Johan

Nov 13 '05 #24

Johan Bergman

Hi Chuck, forgot about the question in my last post:

In your last post, you forgot to answer this question. What are the
optimizations you are thinking of?

I thought you were someone else (Matteo).

Regards,
Johan

Nov 13 '05 #25

Johan Bergman

Some of you requested a cleaned-up program. Here it is. The first lines
might contain some C++, sorry about that.

I also changed the data type from int to float since it seems to give better
performance on some popular platforms (sun4u sparc and newer x86
processors).

Note: I am aware of the benefits with an FFT approach for such long FIR
filters.

Regards,
Johan
#include <stdlib.h>

int main(void)
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
float coeff[nrof_taps] = {0};
float input[nrof_taps+nrof_lags-1] = {0};
float output[nrof_lags] = {0};

float sum;
int lag, tap;
float *tmp_coeff_ptr;
float *tmp_input_ptr;
float *tmp_output_ptr = output;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = coeff;
tmp_input_ptr = input + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}

return 0;
}

Nov 13 '05 #26

Is this the optimal FIR filter on all platforms?

Similar topics