| re: Help in optimizing branches
John Malek wrote:[color=blue]
> Hi,
>
> I ran a profiler against this complex app that I'm trying to opimize.
> This is an application I'm doing to test image processing. Even
> though it does a lot of computation, two simple lines take 30% of the
> running times! Both these lines are from Intel's OpenCV library.
> Note that mhi, mask8u, and mask are arrays with one entry per pixel in
> a 640x480 image.
>
> If anyone has any hints on how to optimize this, it would be greatly
> appreciated.[/color]
This question is off-topic for comp.std.c++ - try comp.programming.
However, start by posting the whole routine or send us ap pointer to the
library.
As for some possible answers:-
You could try folding some constants out of the loop but it's possible
that the optimizer has already done this. The other thing is loop
unrolling. Finally you need to look at cache.
[color=blue]
>
> const int cts = (int&)ts;
> for( y = 0; y < mhi->rows; y++ )
> {
> int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
> uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;
>[/color]
[color=blue]
> for( x = 0; x < mhi->cols; x++ )
> {
>[color=green][color=darkred]
>>>>>>>> if( mhi_row[x] == cts && mask8u_row[x] == 0 )[/color][/color]
>
> THE LINE ABOVE TAKES 20% of the time
>
>
> uchar* mask_row = mask->data.ptr + mask->step;[/color]
add:
uchar* mask_row_plus_width_plus_1 = mask_row + size.width+1;
int step = mask->step;
[color=blue]
> for( i = 1; i <= size.height; i++, mask_row += mask->step )[/color]
// comparison with 0 is faster - i is only used to limit the loop
// count so you can reverse the loop.
replace:
for(
i = size.height;
i >= 0;
i--,
mask_row += step,
mask_row_plus_width_plus_1 += step
)
// you could theoretically also unroll this loop easily
[color=blue]
> {
>[color=green][color=darkred]
>>>>>>>> mask_row[0] = mask_row[size.width+1] = (uchar)1;[/color][/color][/color]
replace:
mask_row_plus_width_plus_1[0] = (uchar)1;
mask_row[0] = (uchar)1;
[color=blue]
>
> THE LINE ABOVE TAKES 10% of the time[/color]
Both of these seem like cache thrashers depending on the value of
"step". Basically the cache optimizations are changing the algorithm to
limit the memory footprint to "blocks" at a time. Hence the name
"blocking". This may require a change in the data structure. That's
why many image algorithms work with "tiles". |