Optimisation needed

spasmous

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Jul 22 '05 #1

Subscribe Post Reply

1545

Ron Natalie

"spasmous" <sp******@yahoo.com> wrote in message news:2f**************************@posting.google.c om...

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

It would be ncie if you told us the types of some of these things like px, py
and array_1,2,3.

Also if you're on a Pentium chip, you may wish to ask in a intel specific
newsgroup. Most compilers generate absolutely horrific code for float/double
to int conversion.

Jul 22 '05 #2

Andre Kostur

sp******@yahoo.com (spasmous) wrote in news:2f066762.0401261506.2c3522a2
@posting.google.com:

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Jul 22 '05 #3

Thomas Matthews

spasmous wrote:

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;
Similarly with dy:
dy = n;
deltad = dx*dx + dy*dy;
This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Jul 22 '05 #4

spasmous

Andre Kostur <nn******@kostur.net> wrote in message news:<Xn*******************************@207.35.177 .135>...

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

Jul 22 '05 #5

Andre Kostur

Thomas Matthews <Th*************************@sbcglobal.net> wrote in
news:40**************@sbcglobal.net:

spasmous wrote:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;

Error. Your replacements aren't identical to the original:

Note that x0 is assigned to the integral part of px, minus m. Not just
px. So later on, when dx is calculated... the px mentioned in there is
the float px, not just the integral part. Thus you can't drop the px term
from the equations.
Similarly with dy:
dy = n;

Same error.

deltad = dx*dx + dy*dy;

This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Jul 22 '05 #6

Patrik Stellmann

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{

The compiler probably optimizes it anyway and for numeric values might
be no real difference in general but usually ++m is faster than m++
since the pseudo-code for the first looks like:

increment m;
return m;

and for the second:

temp = m;
increment m;
return temp;

so in for loops where you are not interested in the result you might
prefer the prefix ++ operator before the postfix...

Jul 22 '05 #7

Kevin Goodsell

spasmous wrote:

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

I haven't really looked at your code, but here's a few tips.

The first thing you should do (and this applies to ALL attempts at
optimization) is profile the code. This serves 2 purposes: 1) It can
help you determine if optimization is actually necessary - there's no
reason to optimize code that's already fast enough. 2) It allows you to
properly target optimization effort. Never try to guess which part of
the code is slow, and never try to guess what tweaks will make it
faster. These things have to be determined empirically.

The second thing you should do is look for a better algorithm.
Algorithmic improvements almost always outperform fine-tuning, often
dramatically.

The third thing you should probably do is double check your compiler's
optimization settings. See if there's a more appropriate set of options
for your purposes.

Only after doing those should you look at fine-tuning the code. In
general, hand-optimized code is uglier, more error-prone, more difficult
to debug, and more difficult to maintain, so it should be avoided until
there are no other options. Many of the things you are likely to try are
probably already done by decent optimizing compilers, so make sure you
profile to see if your changes make a difference - there's no reason to
use uglier code that's just as slow.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Jul 22 '05 #8

NKOBAYE027

if the range is only 1 or 2 then you're probably best of to not bother with
loops - if speed (not size) is of the essence just write the code out
straight.

definitely get rid of all those declarations and casts inside the loops -
there has to be a better way of representing the data than that
hodgepodge...use register counters in the loops (though most compilers
probably optimise to that anyway)

depending upon the accuracy of your exp requirements use your own
approximation algorithm - just be sure to calculate the factorial
denominators beforehand and save them in an array for later use...

"spasmous" <sp******@yahoo.com> wrote in message
news:2f**************************@posting.google.c om...

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Jul 22 '05 #9

Chris Theis

"spasmous" <sp******@yahoo.com> wrote in message
news:2f************************@posting.google.com ...

Andre Kostur <nn******@kostur.net> wrote in message news:<Xn*******************************@207.35.177 .135>...
[SNIP]
A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

Well the exponential function is certainly a bottleneck. However, are you
sure that your optimizations are necessary and that this is the right place
to apply them. You should profile first and see where the time is lost. If
it's the code you gave then think whether the algorithm is the best one. If
you can answer both of these questions with yes you might consider
"hands-on" optimization. If you're dealing with constants you might consider
meta-template programming for loop unrolling and the calculation of the
exponential. Still, premature optimization might be an evil thing. Thus,
fire up your profiler first!

Regards
Chris

Jul 22 '05 #10

Andre Kostur

sp******@yahoo.com (spasmous) wrote in
news:2f************************@posting.google.com :

Andre Kostur <nn******@kostur.net> wrote in message
news:<Xn*******************************@207.35.177 .135>...

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop.
It doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

Yep... that's why I said "might" :) Those are simply moving loop
invariants outside of the loop. A decent optimizer should be able to spot
those and move them outside on it's own. If you change the actual source
code, it makes it more obvious to other people reading your code.

Jul 22 '05 #11

Nick Hounsome

"spasmous" <sp******@yahoo.com> wrote in message
news:2f************************@posting.google.com ...

Andre Kostur <nn******@kostur.net> wrote in message

news:<Xn*******************************@207.35.177 .135>...

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

It is quite possible that operations on double are faster than floats as
the implementation will be optimized for double and
floating point hardware may only actually
work on doubles. You may actually be paying for a lot of
double/float conversions.

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.

Jul 22 '05 #12

David Fisher

"spasmous" <sp******@yahoo.com> wrote:

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations? for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));
Instead of multiplying index by 0 or 1, you could say:

if (index < 0 || index > fftrowsSquared)
{
index = 0;
}

and define fftrowsSquared as (kpar->fftrows*kpar->fftrows) before the
beginning of the first loop.

Values of argument to expf():

(square(px - (float) ((int)px - m)) + square(py - (float) ((int) py - n))) *
norm

float pxVal = px - (float) ((int) px)
float pyVal = py - (float) ((int) py)

square(pxVal - (-range to range)) + square(pyVal - (-range to range))

square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)

/* calculate distance */
float dx = px - (float) ((int) px - m);
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Jul 22 '05 #13

David Fisher

"David Fisher" <no****@nospam.nospam.nospam> wrote:

Oops, previous mail got sent before it was finished ...

I was trying to say, the arguments to expf() (without the multiplication by
"norm") look something like this:

square(pxVal -1) + square(pyVal + 1) == (square(pxVal) - 2 * pxVal + 1) +
(square(pyVal) + 2 * pyVal + 1)
square(pxVal - 2) + square(pyVal) == (square(pxVal) - 4 * pxVal + 4) +
square(pyVal)
etc.

where

float pxVal = px - (float) ((int) px);
float pyVal = py - (float) ((int) py);

So the differences in the arguments to expf() are a multiple of pxVal plus a
multiple of pyVal plus a constant ...

exp(a + b) = exp(a) * exp(b)

so you can calculate the following values and use them to find the required
values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hope this is helpful,

David F

Jul 22 '05 #14

David Fisher

"David Fisher" <no****@nospam.nospam.nospam> wrote:

values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hmm ... still leaves the problem of multiplying by "norm":

exp(a * b) == pow(exp(a), b)

You wouldn't want to say pow(whatever, norm) for each expression - that
would cost as much as calling expf() in the first place ...

I guess the solution is to multiply all of the above-listed values by
"norm", ie. calculate the values of:

exp(square(pxVal) * norm)
exp(4 * norm)
etc.

(I hope the maths is right :-)

David F

Jul 22 '05 #15

Buster

"Nick Hounsome" wrote

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.

You mean expf (deltad * norm) == pow (expf (norm), deltad);
But in any case,
pow (exp (norm), deltad) == exp (deltad * log (exp (norm)));
exp and pow are interdefinable and as such it's hard to say
a priori whether either is faster.

Regards,
Buster.

Jul 22 '05 #16

Buster

"David Fisher" wrote

exp(pxVal * -2) => same as sqrt(exp(pxVal))
No; same as inverse (square (exp (pxVal))).
You should end up with a set of expressions that are faster to calculate than expf(): square, sqrt, *, etc
I'm not sure that sqrt will be much faster than exp,
but luckily need it's not needed.
Hope this is helpful,

David F

Jul 22 '05 #17

spasmous

"David Fisher" <no****@nospam.nospam.nospam> wrote in message news:<A3*****************@nasal.pacific.net.au>...

"Buster" <no***@nowhere.com> wrote:

<snip good suggestions...>

Thanks guys, I played around on paper with the equations and I think I
can greatly reduce the expf() calls the way you suggest. However it's
a bit messy and the calls to pow() are *very* expensive. Slower than
expf() in my tests.

So... another way (similar) is to precompute the dx and dy expf().
I've implemented it like this and expf() only takes up 50% of the
runtime rather than 75% before, with an overall time reduction of
about 25%. Not spectacular but I'll take it :)
/* Storage */
float * EXP_X = (float*)malloc(...);
float * EXP_Y = (float*)malloc(...);
/* Indices */
int x0 = (int)px;
int y0 = (int)py;
float dx = px - x0;
float dy = py - y0;
/* Precompute expf */
for(m=-range;m<range;m++)
{
EXP_X[m+range] = expf(norm*(dx+m)*(dx+m));
EXP_Y(m+range] = expf(norm*(dy+m)*(dy+m));
}
for(m=-range;m<range;m++)
{
for(n=-range;n<range;n++)
{
index = x0*kpar->fftrows + y0;

/* zero out of bounds. NB. This is quicker than using 'if' */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate coefficient */
fscale = EXP_X[m+range] * EXP_Y[n+range];

/* Store values */
....
}
}

Jul 22 '05 #18

by: Simon Elliott | last post by:

What optimisation do compilers typically provide when passing STL containers around? For example, if I do something like this: struct Tbloggs { std::string s1; }; typedef...

C / C++

Magic Optimisation

by: simonwittber | last post by:

Hello People. I've have a very tight inner loop (in a game app, so every millisecond counts) which I have optimised below: def loop(self): self_pool = self.pool self_call_exit_funcs =...

Python

Albert Kallal's QuantityOnHand - Speed Optimisation

by: David Mitchell | last post by:

I use the above function in queries for a number of forms and reports. The reports take approx 20 seconds to open. There are only 100 product id's in tblProducts. My concern is that the time will...

Microsoft Access / VBA

Optimisation techniques

by: EC-AKD | last post by:

Hi All, I am new to the concept of optimising codes in C. I was wondering if C level inlining of code is any way comparable to macros. I believe that inlining is equivalent to writing macros....

C / C++

C# very optimisation

by: Ennixo | last post by:

hi, do you know where i can find some ebooks or websites talking about C# optimisation ? for exemple, i just learned that ++i is faster than i++. i would like to know more about the things...

C# / C Sharp

DB optimisation vs OOP layer separation

by: Jon Maz | last post by:

Hi, I'm facing a code-optimisation issue on an asp.net/vb.net/SQL Server 2000 project. A web page containing not much more than 3 DropDownLists is taking nigh on 6 seconds to load, because each...

Visual Basic .NET

Empty base member optimisation with multiple inheritance

by: David Welch | last post by:

Hi, I have a bit of code where I am relying on empty base member optimisation. The bit of code is below: template<typename Enum> struct EncodePrefix { template<Enum e> struct Apply

C / C++

Program performance/optimisation

by: grid | last post by:

Hi, I was exploring the affect of cache on program performance/optimisation.Is it the compilers responsibility only to consider this kind of optimisation or the programmer can do his bit in this...

C / C++

Help with optimisation

by: special_dragonfly | last post by:

Hello, I know this might be a little cheeky, and if it is, please say, but I need a little hand optimising some code. For the simple reason that this is 'company' code and I have no idea what I'm...

Python

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Optimisation needed

Similar topics