Hi, how can I optimize the following code?

kenneth

I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

Dec 11 '06 #1

Subscribe Post Reply

2490

Noah Roberts

kenneth wrote:

I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

You can quite often optimize an O(N^2) operation by coming up with an
algorithm that is not O(N^2). The compiler will hit your loops and
optimize them quite a bit but that doesn't resolve the problem of
having an algorithm that is a square of N. Don't know what you are
trying to do. You might describe the problem in comp.programming and
see if they have some ideas.

In other words, look to the algorithms you are using not ways to
micro-optimize.

Dec 11 '06 #2

Tommi Höynälänmaa

kenneth kirjoitti:

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

I can't say how to optimize for multithreading but
don't use pow function to compute squares.
Function pow takes the exponent as a floating point number and it may
use exponent and logarithm functions.

Use something like
----
inline double SquareDouble(double r)
{
return r * r;
}
----
or
----
#define square(x) ((x)*(x))
----

If you use GNU C consider using the const attribute:
----
inline double SquareDouble(double r) __attribute__((const))
{
return r * r;
}
----

--
Tommi Höynälänmaa
sähköposti / e-mail: to***************@iki.fi
kotisivu / homepage: http://www.iki.fi/tohoyn/

Dec 11 '06 #3

Jacek Dziedzic

kenneth wrote:

I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

I'll try with a single thread. Multiple threads would require
platform-specific solutions, so are out of the scope of this NG.

k[3][500] are initialized.

What's k?

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

Sure it can. First of all forget that 'pow' thing.
To get the square, just multiply the value by itself.
This will definitely make it much faster.

Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

HTH,
- J.

Dec 11 '06 #4

Chris Theis

"kenneth" <ja*************@gmail.comwrote in message
news:11**********************@79g2000cws.googlegro ups.com...

>I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

Hi,

in principle I would let the compiler worry about that and not try to
micro-optimize. However, you can move the assignments outside to register
variables like Jacek already showed and get rid of the pow(). Additionally,
it's cheap to check if some "manual" loop-unrolling via meta-templates might
get you some speed.

Cheers
Chris

Dec 11 '06 #5

Nikolaos D. Bougalis

Jacek Dziedzic wrote:

Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

Doubtful. Any compiler worth its salt would have hoisted the evaluations out
of the loop by itself. Just how any compiler worth its salt will convert
'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx = (r[0][j]-r0i); distx *=
distx;'

But the fact is that code like what the OP showed /rarely/ benefits from
micro-optimizations; that's not to say that trying to make such optimizations
aren't necessary at times -- only that it is a much better idea to focus on
the algorithm and try to make it simpler.

-n

Dec 11 '06 #6

Ian Collins

kenneth wrote:

I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

If you have access to an MP system and your compiler supports OpenMP,
you might be able to get some gains.

--
Ian Collins.

Dec 12 '06 #7

Jacek Dziedzic

Nikolaos D. Bougalis wrote:

Jacek Dziedzic wrote:

> Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

Doubtful. Any compiler worth its salt would have hoisted the
evaluations out of the loop by itself. Just how any compiler worth its
salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx =
(r[0][j]-r0i); distx *= distx;'

I disagree. First of all, I doubt any compiler would turn
pow(double,2) into double*double.

Secondly, the conversion distx=something*something into
distx=something, distx*=distx does not necessarily speed up
anything, that would depend on the computer architecture,
so I doubt "any compiler worth its salt will convert" it.

Thirdly, even though a good compiler is supposed to
move outside of the loop what does not depend on the loop
iterating variable, my experience show it is worth trying
to help it. I have recently spent a few days, if not weeks,
having to micro-optimize tight loops, working with a compiler
that is rather good at optimizing, the Intel compiler for
Itanium 2 (this platform _needs_ good optimization at
compiler level). I think you would be surprised at how
dumb sometimes the compiler was, or rather, how typical,
simple and out-of-the-book optimizations made the code
faster even at -O3.

Therefore, if the OP asked for ways to optimize, I gave
him two hints. I'm 99% sure they won't slow the code down
and pretty sure they will speed it up. I'm too lazy though
to try it out, anyway, it depends on the platform.

But the fact is that code like what the OP showed /rarely/ benefits
from micro-optimizations; that's not to say that trying to make such
optimizations aren't necessary at times -- only that it is a much better
idea to focus on the algorithm and try to make it simpler.

Of course. But I assumed the OP ran out of options on this one.
For example, this looked like a calculation of a harmonic potential.
Typically for the calculations of potential one assumes a cutoff
radius, beyond which the interactions are assumed to vanish.
Here, such algorithmic optimization would turn this O(N^2) into
O(N). However, with the harmonic potential it is impossible,
because it decays to slowly. That led me to the conclusion that
the OP has already ran out of options on algorithmic optimization
and is left with mirco-optimizations.

howgh :),
- J.

Dec 12 '06 #8

Nikolaos D. Bougalis

Jacek Dziedzic wrote:

Nikolaos D. Bougalis wrote:
>Jacek Dziedzic wrote:

>> Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

Doubtful. Any compiler worth its salt would have hoisted the
evaluations out of the loop by itself. Just how any compiler worth its
salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into: 'distx
= (r[0][j]-r0i); distx *= distx;'

I disagree. First of all, I doubt any compiler would turn
pow(double,2) into double*double.

I said the compiler would hoist the r[0][i], r[1][i] and r[2][i] out of the
loop. I said nothing about converting the pow(d,2) into d*d, although I don't
see why a compiler could not do that, if it had intrinsic support for pow()
and understood the semantics of the pow() call..

Secondly, the conversion distx=something*something into
distx=something, distx*=distx does not necessarily speed up
anything, that would depend on the computer architecture,
so I doubt "any compiler worth its salt will convert" it.

You are misrepresenting what I said. In your code "something" was a complex
expression that required evaluation. Do you honestly believe that any decent
compiler would calculate "a-b" twice to do (a-b)*(a-b)? Common subexpression
elimination is one of the most basic optimizations a compiler can and does
perform.

Basic optimizations aside, you would really be surprised what compilers can
do nowadays. I was quite stunned to see the latest CL.EXE for example, do some
crazy tricks in a math intensive function, shaving off two divisions by using
multiplications and additions instead and managing to keep both the U and V
pipes going full speed significantly improving throughput. I couldn't even
come close to matching it and I'm not a newbie at the optimization game.

Thirdly, even though a good compiler is supposed to
move outside of the loop what does not depend on the loop
iterating variable, my experience show it is worth trying
to help it. I have recently spent a few days, if not weeks,
having to micro-optimize tight loops, working with a compiler
that is rather good at optimizing, the Intel compiler for
Itanium 2 (this platform _needs_ good optimization at
compiler level). I think you would be surprised at how
dumb sometimes the compiler was, or rather, how typical,
simple and out-of-the-book optimizations made the code
faster even at -O3.

I don't disagree that hand-optimization and a helping hand to the compiler
are necessary even today. What I am saying, is that the manual hoist operation
you showed and most such trivial optimizations are unlikely to have any
significant effect. The compiler will take care of simple things like that if
you compile with optimizations enabled and you don't use volatile every other
line.

A very good example of optimizations that programmers can make but compilers
cannot is easily demonstrated with this little function that sums a floating
point array (extern float *huge_array):

The naive approach is this:

float array_sum()
{
float sum = 0;

for(int i = 0; i < huge_array_size; i++)
sum += huge_array[i];

return sum;
}

The better approach takes advantage of abelian operations,
something a compiler cannot do because of restrictions placed by
the ANSI/ISO C and C++ standards:

float array_sum()
{
float sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0

for(int i = 0; i < huge_array_size; i += 4)
{
sum0 += huge_array[i];
sum1 += huge_array[i + 1];
sum2 += huge_array[i + 2];
sum3 += huge_array[i + 3];
}

return (sum0 + sum2) + (sum1 + sum3);
}

-n

Dec 12 '06 #9

Jacek Dziedzic

Nikolaos D. Bougalis wrote:

Jacek Dziedzic wrote:
>Nikolaos D. Bougalis wrote:
>>Jacek Dziedzic wrote:

Another thing is that you don't need to index out
the second vector inside the inner loop, you can as
well do it outside, like in:

for(i) {
register double r0i=r[0][i]; // now evaluated only 500 times
register double r1i=r[1][i];
register double r2i=r[2][i];
for(j) {
distx = (r[0][j]-r0i)*(r[0][j]-r0i);
// ...
}
}

That should give your code a boost.

Doubtful. Any compiler worth its salt would have hoisted the
evaluations out of the loop by itself. Just how any compiler worth
its salt will convert 'distx = (r[0][j]-r0i)*(r[0][j]-r0i);' into:
'distx = (r[0][j]-r0i); distx *= distx;'

I disagree. First of all, I doubt any compiler would turn
pow(double,2) into double*double.

I said the compiler would hoist the r[0][i], r[1][i] and r[2][i] out
of the loop. I said nothing about converting the pow(d,2) into d*d,

Yes, but you said "doubtful" to my statement "that should give
your code a boost". What I meant to say was "for this code to give
no boost, it would require the compiler to turn pow(double,2) into
double*double. Therefore, by being doubtful wrt to this code's
potential boost, you imply this action of the compiler.".

although I don't see why a compiler could not do that, if it had
intrinsic support for pow() and understood the semantics of the pow()
call..

OK, perhaps a smart compiler could do that. But honestly
I wouldn't expect it.

> Secondly, the conversion distx=something*something into
distx=something, distx*=distx does not necessarily speed up
anything, that would depend on the computer architecture,
so I doubt "any compiler worth its salt will convert" it.

You are misrepresenting what I said. In your code "something" was a
complex expression that required evaluation. Do you honestly believe
that any decent compiler would calculate "a-b" twice to do (a-b)*(a-b)?

No, of course I don't believe that. This was a misunderstanding.
I meant that (a-b) would be stored into a register, there squared
and sent to distx. I somehow thought you implied that a memory
location would be squared, rather than a register.

Common subexpression elimination is one of the most basic optimizations
a compiler can and does perform.

Yes, I don't claim a-b will be evaluated twice.

Basic optimizations aside, you would really be surprised what
compilers can do nowadays. I was quite stunned to see the latest CL.EXE
for example, do some crazy tricks in a math intensive function, shaving
off two divisions by using multiplications and additions instead and
managing to keep both the U and V pipes going full speed significantly
improving throughput. I couldn't even come close to matching it and I'm
not a newbie at the optimization game.

Yes, I also usually trust the compiler to optimize better, especially
since for the Itanium they are supposed to be good at it. Maybe I was
lucky then that I managed to squeeze some extra ticks from the code
I was working on by simple tricks like changing

for(i) a[i]=something;
into
while(i++) *(a_ptr+i)=something;

or the other way round.

I don't disagree that hand-optimization and a helping hand to the
compiler are necessary even today. What I am saying, is that the manual
hoist operation you showed and most such trivial optimizations are
unlikely to have any significant effect. The compiler will take care of
simple things like that if you compile with optimizations enabled and
you don't use volatile every other line.

OK, then that was just a suggestion for the OP to try and see for
himself. I think neither I can make a general statement "taking
things out of the loop will make things faster" nor you can make
a general statement "taking things out of the loop will not make
things faster because the compiler had done it already", because
this all depends on the compiler.

We'll never know unless the OP tries.

I could have as well said "make sure you have -O3" and then
somebody could say "come on, it's on by default" -- just a
suggestion to try out.

>
A very good example of optimizations that programmers can make but
compilers cannot is easily demonstrated with this little function that
sums a floating point array (extern float *huge_array):

The naive approach is this:

float array_sum()
{
float sum = 0;

for(int i = 0; i < huge_array_size; i++)
sum += huge_array[i];

return sum;
}

The better approach takes advantage of abelian operations,
something a compiler cannot do because of restrictions placed by
the ANSI/ISO C and C++ standards:

float array_sum()
{
float sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0

for(int i = 0; i < huge_array_size; i += 4)
{
sum0 += huge_array[i];
sum1 += huge_array[i + 1];
sum2 += huge_array[i + 2];
sum3 += huge_array[i + 3];
}

return (sum0 + sum2) + (sum1 + sum3);
}

OK.

- J.

Dec 12 '06 #10

Nikolaos D. Bougalis

[ After re-reading this thread, I think I might have come off a bit too
aggressive. It was not my intent, and I hope there are no hard feelings. ]

A lot of times, simple tricks can help the compiler along and result in much
better code. While slightly off-topic, I'd be interested in knowing what kind
of gains you saw in Itanium builds after the micro-optimizations you made. You
can contact me in private if you don't mind sharing.

-n

Dec 12 '06 #11

Uenal S. Mutlu

kenneth wrote:

I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

I think you're too dum for the explanation.

Dec 17 '06 #12

Ian Collins

Uenal S. Mutlu wrote:

kenneth wrote:

>I was trying to use multiple thread to optimize my following code, but
met some problems, anyone can help me?

k[3][500] are initialized.

int computePot() {
int i, j;

for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (r[0][j] - r[0][i]), 2 );
disty = pow( (r[1][j] - r[1][i]), 2 );
distz = pow( (r[2][j] - r[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}

I doubt that If this code really can be optimized?

I think you're too dum for the explanation.

A bit rich from one who can't spell.

--
Ian Collins.

Dec 17 '06 #13

frame

Uenal S. Mutlu wrote:

>
I think you're too dum for the explanation.

Hey! don't break wind???...
[NOTE: As long as you post this kind of stuff in this group, I shall
follow up with this advice!]

Dec 17 '06 #14

Vidar Hasfjord

Ian Collins wrote:

A bit rich from one who can't spell.

I arege taht his cmmonet was ianpprorpaite, but dsylxeia has nthonig to
do wtih inetlilegnece.

Dec 17 '06 #15

White Wolf

frame wrote:

Uenal S. Mutlu wrote:
>>
I think you're too dum for the explanation.

Hey! don't break wind???...
[NOTE: As long as you post this kind of stuff in this group, I shall
follow up with this advice!]

Don't feed the troll pls. I have already reported his destructive actions
to his ISP and his NSP.

--
WW aka Attila
:::
Constant change is here to stay

Dec 19 '06 #16

Hi, how can I optimize the following code?

Similar topics