C++ inlining as a multithreading optimization tecnique?

gianguz

The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
Anyone has comments about that or has already experience such solution
with some result?

Gianguglielmo

Jul 22 '05 #1

Subscribe Post Reply

2299

Andre Kostur

"gianguz" <gi*****************@noze.it> wrote in
news:11**********************@c13g2000cwb.googlegr oups.com:

The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
Anyone has comments about that or has already experience such solution
with some result?

The only answer that we can give is profile, profile, profile. _You_ need
to measure both mechanisms to determine which will be faster. Inlining may
remove the function call overhead, but may cause the size of your functions
to balloon to such a point that what you saved in function calls, you're
now paying in additional page swaps, or cache misses, or whatever.

Jul 22 '05 #2

GianGuz

You are right about profiling. It is the only way to have a formal
analysis but I'm trying to figure out also a robustness aspect. Suppose
you have 1000 thread continuosly running in the same application (i.e.
for system like multi agent framework in which each Agent is
implemented throught a thread object this can be a normal value) and
they are calling a predifined sequence of functions. A simple calling
sequence like : f ( g ( h ( k( x,y,z) ) ) ) will produce a great amount
of temporaries and functions execution points to be saved on the stack.
Even if performance gain could be better / equal / worst with inlining,
the amount of memory saved seems to be evident. Moreover the
possibility of reaching the limits of the run-time environment data
structures is lesser than the non inlined case.

Gianguglielmo

Jul 22 '05 #3

Tom Widmer

On 13 Dec 2004 02:06:24 -0800, "gianguz" <gi*****************@noze.it>
wrote:

The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
I'm not sure about this. If you context switch into an inlined
function call, it is like context switching into the calling context
of the inline call - you are still switching into a function (in
assembler terms), it's just the calling one, not the inlined one.
Anyone has comments about that or has already experience such solution
with some result?

Well, inlining may or may not reduce the size of the stack required by
a thread, and it may or may not reduce the size of the code (the
saving in not setting up a function call is offset against possible
multiple copies of the same code).

However, I'm not sure inlining has any influence on multithreaded code
that it doesn't have on single threaded code. Context switches between
threads (and processes on some architectures like QNX) just involve
restoring a load of registers as far as I know - you have to do this
whether switching into an inline function or not.

Tom

Jul 22 '05 #4

GianGuz

Let me show the following code:

#include <zthread/Thread.h>
#include <zthread/ThreadedExecutor.h>
#include <string>
#include <iostream>

extern "C" {

#include <stdlib.h>
#include <time.h>
}

using namespace std;
using namespace ZThread;

#define MAX_VALUE 32000
#define STRESSING_TEST_LIMIT 10000000

#ifdef _INLINE_

#define CURRENT_INLINING inline

#else

#define CURRENT_INLINING

#endif

class MyTask : public Runnable {

private:

long* _x;
long* _y;
long* _result;

explicit MyTask() : _x(NULL) , _y(NULL) , _result(NULL) { }

CURRENT_INLINING const long k(const long a,const long b,const long c);

CURRENT_INLINING const long h(const long a,const long b,const long c);

CURRENT_INLINING const long g(const long a,const long b,const long c);

CURRENT_INLINING const long f(const long a,const long b,const long c);

CURRENT_INLINING void stressTest(const long i);

long _run(const unsigned long i) {

stressTest(i);

(*_result) = (*_x) + (*_y);

return (*_result);
}

public :

MyTask(long* x,long* y,long* result) : _x(x) , _y(y) , _result(result)
{ }

void run() {

_run(STRESSING_TEST_LIMIT);

return;
}

~MyTask() { }
};

const long MyTask::k(const long a,const long b,const long c) {

return a - b + c;
}

const long MyTask::h(const long a,const long b,const long c) {

return k((a / b) / c,(c + a) - b,(a * c) / b);
}

const long MyTask::g(const long a,const long b,const long c) {

return h(a / b, b / a, (a + b) * c);
}

const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}

int main(int argc, char* argv[]) {

if (argc > 1) {

int size;

size = atoi(argv[1]);

long* inputX = new long[size];
long* inputY = new long[size];

long* output = new long[size];

MyTask** tasks = new (MyTask*)[size];

long T;

time(&T);

srand(T);

ThreadedExecutor executor;

cout << "Creating tasks... " << endl;

for(int i=0; i<size; i++) {

inputX[i] = rand() % MAX_VALUE;

inputY[i] = rand() % MAX_VALUE;

output[i] = 0;

tasks[i] = new MyTask(&(inputX[i]),&(inputY[i]),&(output[i]));

cout << "Input " << i << " : " << inputX[i] << "," << inputY[i] <<
endl;
}

cout << "Executing tasks... " << endl;

long start0,end0;

clock_t start1,end1;

time(&start0);

start1 = clock();

for(int i=0; i<size; i++) {

executor.execute(tasks[i]);
}

executor.wait();

time(&end0);
end1 = clock();

cout << "Collecting results... " << endl;

for(int i=0; i<size; i++) {

cout << "Output " << i << " : " << output[i] << endl;
}

cout << "Computation time (sec)" << (end0) - (start0) << endl;

cout << "Computation time (msec)" << (end1) - (start1) << endl;
}
}
it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo

Jul 22 '05 #5

Swampmonster

GianGuz wrote:

Let me show the following code:

(...)

const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}

(...)
You should use the result of your stress-test somewhere (print it to the
console or whatever) to ensure the compiler can't optimize the whole
thing to nothing (some compilers are really smart).
(...)

it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo

Just some additional thoughts:

The more threads you use, the more memory will be used (stack), the
slower things will get (cache-pollution). What outperforms what (inline
vs. non-inline) will strongly depend on what kind of compiler you use,
what optimization options etc. Of course also what kind of OS,
threading-lib and hardware used. But there is one thing for sure,
running that stress-test with more threads then there are logical CPUs
will almost certain result in slower performance compared to when you
just use one thread per CPU. So 2 threads should be optimal in your case.
If there is no need to do the work in parallel then just create 2
threads (for the dual athlon machine) and do the work there. If you need
to do the work in parallel you might look for some fiber-implementation,
or implement the context-switching on your own.
And if you must use heavy multithreading for some reason, increase the
duration of a time-slice to something like 100ms (should be fine for
most server-stuff) or even more. That saves you a lot.

And one thing - if that stress-test is anything similar to the actual
code you need to optimize, then try handwritten assembly code. Pay
attention to stack usage (or general memory usage - use as little as
possible) and get something like vtune (or a similar tool for AMD CPUs)
to help you optimize the code. In some special cases handwritten code
can outperform the code generated by modern C++ compilers 1:2 or even more.
BTW: have a look at "cilk" - maybe you find it interesting:
http://supertech.lcs.mit.edu/cilk/
Rather old, but I like the idea :-)

swamp'

Jul 22 '05 #6

GianGuz

Swampmonster wrote:

GianGuz wrote:
Let me show the following code:
>
> (...)
>
const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}
>
> (...)
You should use the result of your stress-test somewhere (print it to

the console or whatever) to ensure the compiler can't optimize the whole
thing to nothing (some compilers are really smart).

Ok, I will try it! ;)

(...)

it uses a robust open source multithreading lib (ZThread) and perform a test with both inlining and not inlining functions. The results on a Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the two executables. Inlined code reach a 30% increase in timing in respect to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo

Just some additional thoughts:

The more threads you use, the more memory will be used (stack), the
slower things will get (cache-pollution). What outperforms what

(inline vs. non-inline) will strongly depend on what kind of compiler you use, what optimization options etc. Of course also what kind of OS,
threading-lib and hardware used. But there is one thing for sure,
running that stress-test with more threads then there are logical CPUs will almost certain result in slower performance compared to when you just use one thread per CPU. So 2 threads should be optimal in your case. If there is no need to do the work in parallel then just create 2
threads (for the dual athlon machine) and do the work there. If you need to do the work in parallel you might look for some fiber-implementation, or implement the context-switching on your own.
And if you must use heavy multithreading for some reason, increase the duration of a time-slice to something like 100ms (should be fine for
most server-stuff) or even more. That saves you a lot.

And one thing - if that stress-test is anything similar to the actual code you need to optimize, then try handwritten assembly code. Pay
attention to stack usage (or general memory usage - use as little as
possible) and get something like vtune (or a similar tool for AMD CPUs) to help you optimize the code. In some special cases handwritten code can outperform the code generated by modern C++ compilers 1:2 or even more. BTW: have a look at "cilk" - maybe you find it interesting:
http://supertech.lcs.mit.edu/cilk/
Rather old, but I like the idea :-)

swamp'

A lot of interesting consideration swamp! ;)
The problem with working with a lot of threads (200+) is not just a
matter
of parallelization and optimization. Sometimes, as in my case, you also
want to rely on some not-deterministic mechanisms that in a natural way
is able to make things really concurrents. Threads are the nicest way
to do that, I think, because they are mapped directly with 'machine
semantics'.
"I need a world were i don't know who is coming first to do something!"

Jul 22 '05 #7

Tom Widmer

On 13 Dec 2004 10:57:42 -0800, "GianGuz" <gi*****************@noze.it>
wrote:

it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Still, I'm not sure that it is the context switching that is slower
(because I can't see how inlining affects that directly), but it is
the additional cache misses from the increased code (and probably
stack) size that are causing the problem.

I'm sure that an example can be written where turning on inlining has
the opposite effect as the number of threads increases. Try making
your inlined function quite a bit longer, and making sure that each
function is called from, say, 10 different places in the thread
function. That should ensure that the inlined version has larger code.

Tom

Jul 22 '05 #8

GianGuz

This is in part true. Obviously a greater code can have side effects
(cache missess) that could overwhelm the benefits of inlining. But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
I think this should be taken into consideration.

Cheers,

Gianguglielmo

Jul 22 '05 #9

Tom Widmer

On 14 Dec 2004 06:23:10 -0800, "GianGuz" <gi*****************@noze.it>
wrote:

This is in part true.
What is?
Obviously a greater code can have side effects
(cache missess) that could overwhelm the benefits of inlining. But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
Yes, I noticed, and it was quite interesting.
I think this should be taken into consideration.

I think the differences can be explained in terms of cache misses
though. AFAIK a context switch involves (kernel switching left out):

- Halt execution of the current thread.
- Save the state of all the registers in the CPU (especially the
program counter!).
- Find the thread to run next (using the scheduling algorithm).
- Load the state of all the registers for the thread that is scheduled
next (which may be in the cache, or may involve going to main memory)
- Resume execution from the program counter of the next thread.

The only difference I can see for the inline version for the context
switch itself is that the thread state of the next thread is more
likely to be in the cache. As the number of threads increases, the
cache misses for the version with the larger stacks are only going to
get worse.

Your original comment:
"An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined function was called simply means
restoring the program counter at that code line instead of popping
from the stack the current function called with its own parameter. It
seems to me a great save of memory space and time."

I think is wrong, since I don't think any stack popping is involved in
a context switch, but rather registers are saved and restored.

Tom

Jul 22 '05 #10

Swampmonster

GianGuz wrote:

But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
I think this should be taken into consideration.

I think it's natural. All the 200+ threads actually run the same code,
same virtual and physical memory-location, and it will for sure be below
what the L1 cache of an Athlon can hold (in this simple example - very
thight loop, not much code, all threads doing the same). So there are no
instruction-cache-misses just because there are more threads that run
the same code, there are just more data-cache-misses because of the
increased space for the thread-context to "swap" those registers &
state-information. The non-inlined version is likely to use up more
memory for data as the inlined version, because of all those
stack-frames & return addresses, along with saved register-values and
parameters pushed to the stack and all that. So instruction-cache
pollution does not count and data-cache pollution will be worse in the
non-inlined version, and growing even worse when more threads are
created, so I think it's just natural.

Anyway, dump the threads, they just hurt performance :~)

bye,
'monster

Jul 22 '05 #11

C++ inlining as a multithreading optimization tecnique?

Similar topics