Connecting Tech Pros Worldwide Help | Site Map

Help in optimizing branches

John Malek
Guest
 
Posts: n/a
#1: Jul 19 '05
Hi,

I ran a profiler against this complex app that I'm trying to opimize.
This is an application I'm doing to test image processing. Even
though it does a lot of computation, two simple lines take 30% of the
running times! Both these lines are from Intel's OpenCV library.
Note that mhi, mask8u, and mask are arrays with one entry per pixel in
a 640x480 image.

If anyone has any hints on how to optimize this, it would be greatly
appreciated.

Thanks,
John


const int cts = (int&)ts;
for( y = 0; y < mhi->rows; y++ )
{
int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;

for( x = 0; x < mhi->cols; x++ )
{[color=blue][color=green][color=darkred]
>>>>>>> if( mhi_row[x] == cts && mask8u_row[x] == 0 )[/color][/color][/color]
THE LINE ABOVE TAKES 20% of the time


uchar* mask_row = mask->data.ptr + mask->step;
for( i = 1; i <= size.height; i++, mask_row += mask->step )
{[color=blue][color=green][color=darkred]
>>>>>>> mask_row[0] = mask_row[size.width+1] = (uchar)1;[/color][/color][/color]
THE LINE ABOVE TAKES 10% of the time
Gianni Mariani
Guest
 
Posts: n/a
#2: Jul 19 '05

re: Help in optimizing branches


John Malek wrote:[color=blue]
> Hi,
>
> I ran a profiler against this complex app that I'm trying to opimize.
> This is an application I'm doing to test image processing. Even
> though it does a lot of computation, two simple lines take 30% of the
> running times! Both these lines are from Intel's OpenCV library.
> Note that mhi, mask8u, and mask are arrays with one entry per pixel in
> a 640x480 image.
>
> If anyone has any hints on how to optimize this, it would be greatly
> appreciated.[/color]

This question is off-topic for comp.std.c++ - try comp.programming.

However, start by posting the whole routine or send us ap pointer to the
library.

As for some possible answers:-

You could try folding some constants out of the loop but it's possible
that the optimizer has already done this. The other thing is loop
unrolling. Finally you need to look at cache.
[color=blue]
>
> const int cts = (int&)ts;
> for( y = 0; y < mhi->rows; y++ )
> {
> int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
> uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;
>[/color]
[color=blue]
> for( x = 0; x < mhi->cols; x++ )
> {
>[color=green][color=darkred]
>>>>>>>> if( mhi_row[x] == cts && mask8u_row[x] == 0 )[/color][/color]
>
> THE LINE ABOVE TAKES 20% of the time
>
>
> uchar* mask_row = mask->data.ptr + mask->step;[/color]

add:
uchar* mask_row_plus_width_plus_1 = mask_row + size.width+1;
int step = mask->step;
[color=blue]
> for( i = 1; i <= size.height; i++, mask_row += mask->step )[/color]
// comparison with 0 is faster - i is only used to limit the loop
// count so you can reverse the loop.
replace:

for(
i = size.height;
i >= 0;
i--,
mask_row += step,
mask_row_plus_width_plus_1 += step
)
// you could theoretically also unroll this loop easily
[color=blue]
> {
>[color=green][color=darkred]
>>>>>>>> mask_row[0] = mask_row[size.width+1] = (uchar)1;[/color][/color][/color]

replace:
mask_row_plus_width_plus_1[0] = (uchar)1;
mask_row[0] = (uchar)1;
[color=blue]
>
> THE LINE ABOVE TAKES 10% of the time[/color]

Both of these seem like cache thrashers depending on the value of
"step". Basically the cache optimizations are changing the algorithm to
limit the memory footprint to "blocks" at a time. Hence the name
"blocking". This may require a change in the data structure. That's
why many image algorithms work with "tiles".



Nick Savoiu
Guest
 
Posts: n/a
#3: Jul 19 '05

re: Help in optimizing branches


"John Malek" <john_malek@hotmail.com> wrote in message
news:9eb574a1.0309301452.36d17b3a@posting.google.c om...[color=blue]
> Hi,
>
> I ran a profiler against this complex app that I'm trying to opimize.
> This is an application I'm doing to test image processing. Even
> though it does a lot of computation, two simple lines take 30% of the
> running times! Both these lines are from Intel's OpenCV library.
> Note that mhi, mask8u, and mask are arrays with one entry per pixel in
> a 640x480 image.
>
> If anyone has any hints on how to optimize this, it would be greatly
> appreciated.[/color]

John,

I don't think you can expect much improvement.

Those lines are in nested loops and will get executed many times. You can
try to do some constant propagation, loop invariant code motions by hand but
probably the compiler can do that too if the code's not too complicated.

Nick


Dan McLeran
Guest
 
Posts: n/a
#4: Jul 19 '05

re: Help in optimizing branches


> If anyone has any hints on how to optimize this, it would be greatly[color=blue]
> appreciated.
>
> Thanks,
> John
>
>
> const int cts = (int&)ts;
> for( y = 0; y < mhi->rows; y++ )
> {
> int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
> uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;
>
> for( x = 0; x < mhi->cols; x++ )
> {[color=green][color=darkred]
> >>>>>>> if( mhi_row[x] == cts && mask8u_row[x] == 0 )[/color][/color]
> THE LINE ABOVE TAKES 20% of the time
>
>
> uchar* mask_row = mask->data.ptr + mask->step;
> for( i = 1; i <= size.height; i++, mask_row += mask->step )
> {[color=green][color=darkred]
> >>>>>>> mask_row[0] = mask_row[size.width+1] = (uchar)1;[/color][/color]
> THE LINE ABOVE TAKES 10% of the time[/color]

I'm wondering if moving the pointer dereference out of the loop logic
would help. I assume that mhi->cols is not changing within the loop?
Try moving this out of the for loop. The compiler may already be doing
this for you. A quick look at the asm output would tell whether this
could buy you some time.
Ron Natalie
Guest
 
Posts: n/a
#5: Jul 19 '05

re: Help in optimizing branches



"John Malek" <john_malek@hotmail.com> wrote in message news:9eb574a1.0309301452.36d17b3a@posting.google.c om...
[color=blue]
> int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);[/color]

Just what is mhi->data.ptr? Hopefully, it's something that converts
to int* nicely.

[color=blue][color=green][color=darkred]
> >>>>>>> if( mhi_row[x] == cts && mask8u_row[x] == 0 )[/color][/color][/color]

Depending on the platform, using pointers might be faster:
if (*mhi_row++ == cts && *mask8u_row++ == 0)
you may need to put the increments elsewhere depending on what the rest
of the code does with these values.
[color=blue]
> uchar* mask_row = mask->data.ptr + mask->step;
> for( i = 1; i <= size.height; i++, mask_row += mask->step )
> {[color=green][color=darkred]
> >>>>>>> mask_row[0] = mask_row[size.width+1] = (uchar)1;[/color][/color][/color]

You might precompute size.width+1 into a local variable rather than preforming
the addition each time, this looks invariant to the loop. There's no reason to cast
the literal 1.



Closed Thread