473,394 Members | 1,889 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

C++ inlining as a multithreading optimization tecnique?

The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
Anyone has comments about that or has already experience such solution
with some result?

Gianguglielmo

Jul 22 '05 #1
10 2299
"gianguz" <gi*****************@noze.it> wrote in
news:11**********************@c13g2000cwb.googlegr oups.com:
The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
Anyone has comments about that or has already experience such solution
with some result?


The only answer that we can give is profile, profile, profile. _You_ need
to measure both mechanisms to determine which will be faster. Inlining may
remove the function call overhead, but may cause the size of your functions
to balloon to such a point that what you saved in function calls, you're
now paying in additional page swaps, or cache misses, or whatever.
Jul 22 '05 #2
You are right about profiling. It is the only way to have a formal
analysis but I'm trying to figure out also a robustness aspect. Suppose
you have 1000 thread continuosly running in the same application (i.e.
for system like multi agent framework in which each Agent is
implemented throught a thread object this can be a normal value) and
they are calling a predifined sequence of functions. A simple calling
sequence like : f ( g ( h ( k( x,y,z) ) ) ) will produce a great amount
of temporaries and functions execution points to be saved on the stack.
Even if performance gain could be better / equal / worst with inlining,
the amount of memory saved seems to be evident. Moreover the
possibility of reaching the limits of the run-time environment data
structures is lesser than the non inlined case.

Gianguglielmo

Jul 22 '05 #3
On 13 Dec 2004 02:06:24 -0800, "gianguz" <gi*****************@noze.it>
wrote:
The question is about the possible use of inlining to improve
performance in a heavy multithreading environment (200+ threads).
If we have to work with applications in which threads aren't I/O
bounded or user-time bounded (i.e. windows based applications) but
they are concurrently involved in the execution of the same
parallelized task (i.e. a matrix-matrix multiplication), we must ensure
that the advantage obtained by the parallelization is not lesser than
the overhead introduced by the context switching needed to suspend/wake
up/run threads.
My opinion is that we could use inline as a mechanism to improve contex
switching performance.
An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined
function was called simply means restoring the program counter at that
code line instead of popping from the stack the current function called
with its own parameter. It seems to me a great save of memory space and
time.
I'm not sure about this. If you context switch into an inlined
function call, it is like context switching into the calling context
of the inline call - you are still switching into a function (in
assembler terms), it's just the calling one, not the inlined one.
Anyone has comments about that or has already experience such solution
with some result?


Well, inlining may or may not reduce the size of the stack required by
a thread, and it may or may not reduce the size of the code (the
saving in not setting up a function call is offset against possible
multiple copies of the same code).

However, I'm not sure inlining has any influence on multithreaded code
that it doesn't have on single threaded code. Context switches between
threads (and processes on some architectures like QNX) just involve
restoring a load of registers as far as I know - you have to do this
whether switching into an inline function or not.

Tom
Jul 22 '05 #4
Let me show the following code:

#include <zthread/Thread.h>
#include <zthread/ThreadedExecutor.h>
#include <string>
#include <iostream>

extern "C" {

#include <stdlib.h>
#include <time.h>
}

using namespace std;
using namespace ZThread;

#define MAX_VALUE 32000
#define STRESSING_TEST_LIMIT 10000000

#ifdef _INLINE_

#define CURRENT_INLINING inline

#else

#define CURRENT_INLINING

#endif

class MyTask : public Runnable {

private:

long* _x;
long* _y;
long* _result;

explicit MyTask() : _x(NULL) , _y(NULL) , _result(NULL) { }

CURRENT_INLINING const long k(const long a,const long b,const long c);

CURRENT_INLINING const long h(const long a,const long b,const long c);

CURRENT_INLINING const long g(const long a,const long b,const long c);

CURRENT_INLINING const long f(const long a,const long b,const long c);

CURRENT_INLINING void stressTest(const long i);

long _run(const unsigned long i) {

stressTest(i);

(*_result) = (*_x) + (*_y);

return (*_result);
}

public :

MyTask(long* x,long* y,long* result) : _x(x) , _y(y) , _result(result)
{ }

void run() {

_run(STRESSING_TEST_LIMIT);

return;
}

~MyTask() { }
};

const long MyTask::k(const long a,const long b,const long c) {

return a - b + c;
}

const long MyTask::h(const long a,const long b,const long c) {

return k((a / b) / c,(c + a) - b,(a * c) / b);
}

const long MyTask::g(const long a,const long b,const long c) {

return h(a / b, b / a, (a + b) * c);
}

const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}

int main(int argc, char* argv[]) {

if (argc > 1) {

int size;

size = atoi(argv[1]);

long* inputX = new long[size];
long* inputY = new long[size];

long* output = new long[size];

MyTask** tasks = new (MyTask*)[size];

long T;

time(&T);

srand(T);

ThreadedExecutor executor;

cout << "Creating tasks... " << endl;

for(int i=0; i<size; i++) {

inputX[i] = rand() % MAX_VALUE;

inputY[i] = rand() % MAX_VALUE;

output[i] = 0;

tasks[i] = new MyTask(&(inputX[i]),&(inputY[i]),&(output[i]));

cout << "Input " << i << " : " << inputX[i] << "," << inputY[i] <<
endl;
}

cout << "Executing tasks... " << endl;

long start0,end0;

clock_t start1,end1;

time(&start0);

start1 = clock();

for(int i=0; i<size; i++) {

executor.execute(tasks[i]);
}

executor.wait();

time(&end0);
end1 = clock();

cout << "Collecting results... " << endl;

for(int i=0; i<size; i++) {

cout << "Output " << i << " : " << output[i] << endl;
}

cout << "Computation time (sec)" << (end0) - (start0) << endl;

cout << "Computation time (msec)" << (end1) - (start1) << endl;
}
}
it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo

Jul 22 '05 #5
GianGuz wrote:
Let me show the following code:

(...)

const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}

(...)
You should use the result of your stress-test somewhere (print it to the
console or whatever) to ensure the compiler can't optimize the whole
thing to nothing (some compilers are really smart).
(...)

it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo


Just some additional thoughts:

The more threads you use, the more memory will be used (stack), the
slower things will get (cache-pollution). What outperforms what (inline
vs. non-inline) will strongly depend on what kind of compiler you use,
what optimization options etc. Of course also what kind of OS,
threading-lib and hardware used. But there is one thing for sure,
running that stress-test with more threads then there are logical CPUs
will almost certain result in slower performance compared to when you
just use one thread per CPU. So 2 threads should be optimal in your case.
If there is no need to do the work in parallel then just create 2
threads (for the dual athlon machine) and do the work there. If you need
to do the work in parallel you might look for some fiber-implementation,
or implement the context-switching on your own.
And if you must use heavy multithreading for some reason, increase the
duration of a time-slice to something like 100ms (should be fine for
most server-stuff) or even more. That saves you a lot.

And one thing - if that stress-test is anything similar to the actual
code you need to optimize, then try handwritten assembly code. Pay
attention to stack usage (or general memory usage - use as little as
possible) and get something like vtune (or a similar tool for AMD CPUs)
to help you optimize the code. In some special cases handwritten code
can outperform the code generated by modern C++ compilers 1:2 or even more.
BTW: have a look at "cilk" - maybe you find it interesting:
http://supertech.lcs.mit.edu/cilk/
Rather old, but I like the idea :-)

swamp'
Jul 22 '05 #6
Swampmonster wrote:
GianGuz wrote:
Let me show the following code:
>
> (...)
>
const long MyTask::f(const long a,const long b,const long c) {

return g(a*b,b*a,c);
}

void MyTask::stressTest(const long i) {

for(unsigned long j=1; j<i; j++) f(i,j,0);

return;
}
>
> (...)
You should use the result of your stress-test somewhere (print it to

the console or whatever) to ensure the compiler can't optimize the whole
thing to nothing (some compilers are really smart).

Ok, I will try it! ;)
(...)

it uses a robust open source multithreading lib (ZThread) and perform a test with both inlining and not inlining functions. The results on a Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the two executables. Inlined code reach a 30% increase in timing in respect to the not inlined one. I'm curious to see what happen on other
machines and os configurations.

Gianguglielmo


Just some additional thoughts:

The more threads you use, the more memory will be used (stack), the
slower things will get (cache-pollution). What outperforms what

(inline vs. non-inline) will strongly depend on what kind of compiler you use, what optimization options etc. Of course also what kind of OS,
threading-lib and hardware used. But there is one thing for sure,
running that stress-test with more threads then there are logical CPUs will almost certain result in slower performance compared to when you just use one thread per CPU. So 2 threads should be optimal in your case. If there is no need to do the work in parallel then just create 2
threads (for the dual athlon machine) and do the work there. If you need to do the work in parallel you might look for some fiber-implementation, or implement the context-switching on your own.
And if you must use heavy multithreading for some reason, increase the duration of a time-slice to something like 100ms (should be fine for
most server-stuff) or even more. That saves you a lot.

And one thing - if that stress-test is anything similar to the actual code you need to optimize, then try handwritten assembly code. Pay
attention to stack usage (or general memory usage - use as little as
possible) and get something like vtune (or a similar tool for AMD CPUs) to help you optimize the code. In some special cases handwritten code can outperform the code generated by modern C++ compilers 1:2 or even more. BTW: have a look at "cilk" - maybe you find it interesting:
http://supertech.lcs.mit.edu/cilk/
Rather old, but I like the idea :-)

swamp'


A lot of interesting consideration swamp! ;)
The problem with working with a lot of threads (200+) is not just a
matter
of parallelization and optimization. Sometimes, as in my case, you also
want to rely on some not-deterministic mechanisms that in a natural way
is able to make things really concurrents. Threads are the nicest way
to do that, I think, because they are mapped directly with 'machine
semantics'.
"I need a world were i don't know who is coming first to do something!"

Jul 22 '05 #7
On 13 Dec 2004 10:57:42 -0800, "GianGuz" <gi*****************@noze.it>
wrote:
it uses a robust open source multithreading lib (ZThread) and perform a
test with both inlining and not inlining functions. The results on a
Dual Athlon MP 2400+ with 2Gb ram and Gentoo Linux distribution were
interesting. More threads I tried to run concurrently (with the
executor pattern) and more evident became the performance gap by the
two executables. Inlined code reach a 30% increase in timing in respect
to the not inlined one. I'm curious to see what happen on other
machines and os configurations.


Still, I'm not sure that it is the context switching that is slower
(because I can't see how inlining affects that directly), but it is
the additional cache misses from the increased code (and probably
stack) size that are causing the problem.

I'm sure that an example can be written where turning on inlining has
the opposite effect as the number of threads increases. Try making
your inlined function quite a bit longer, and making sure that each
function is called from, say, 10 different places in the thread
function. That should ensure that the inlined version has larger code.

Tom
Jul 22 '05 #8
This is in part true. Obviously a greater code can have side effects
(cache missess) that could overwhelm the benefits of inlining. But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
I think this should be taken into consideration.

Cheers,

Gianguglielmo

Jul 22 '05 #9
On 14 Dec 2004 06:23:10 -0800, "GianGuz" <gi*****************@noze.it>
wrote:
This is in part true.
What is?
Obviously a greater code can have side effects
(cache missess) that could overwhelm the benefits of inlining. But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
Yes, I noticed, and it was quite interesting.
I think this should be taken into consideration.


I think the differences can be explained in terms of cache misses
though. AFAIK a context switch involves (kernel switching left out):

- Halt execution of the current thread.
- Save the state of all the registers in the CPU (especially the
program counter!).
- Find the thread to run next (using the scheduling algorithm).
- Load the state of all the registers for the thread that is scheduled
next (which may be in the cache, or may involve going to main memory)
- Resume execution from the program counter of the next thread.

The only difference I can see for the inline version for the context
switch itself is that the thread state of the next thread is more
likely to be in the cache. As the number of threads increases, the
cache misses for the version with the larger stacks are only going to
get worse.

Your original comment:
"An inlined function doesn't need to save into the stack its calling
with its own parameter. It is simply expanded into the code with a
temporary copy of any parameter it carries. Restoring a thread
execution in a point when an inlined function was called simply means
restoring the program counter at that code line instead of popping
from the stack the current function called with its own parameter. It
seems to me a great save of memory space and time."

I think is wrong, since I don't think any stack popping is involved in
a context switch, but rather registers are saved and restored.

Tom
Jul 22 '05 #10
GianGuz wrote:
But is
interesting to notice that the performance gain
when the inlining was used 'correctly' grows up with the number of
concurrent threads running.
I think this should be taken into consideration.


I think it's natural. All the 200+ threads actually run the same code,
same virtual and physical memory-location, and it will for sure be below
what the L1 cache of an Athlon can hold (in this simple example - very
thight loop, not much code, all threads doing the same). So there are no
instruction-cache-misses just because there are more threads that run
the same code, there are just more data-cache-misses because of the
increased space for the thread-context to "swap" those registers &
state-information. The non-inlined version is likely to use up more
memory for data as the inlined version, because of all those
stack-frames & return addresses, along with saved register-values and
parameters pushed to the stack and all that. So instruction-cache
pollution does not count and data-cache pollution will be worse in the
non-inlined version, and growing even worse when more threads are
created, so I think it's just natural.

Anyway, dump the threads, they just hurt performance :~)

bye,
'monster
Jul 22 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
by: Tony Vitonis | last post by:
From what I can tell, VB.NET doesn't allow programmers to say explicitly that they want a function to be inlined. Now, I'm a big fan of factoring out duplicate code, but I don't want to do too...
55
by: ben | last post by:
is it true that a function without an inline keyword never get inlined? If not true when is it inlined or not? ben
15
by: Generic Usenet Account | last post by:
While I have a very good feel for how inlining works, I fail to see how in the world inlining can work if the inlined function is not described "in place" in a header file, but rather defined in a...
4
by: Bill Cunningham | last post by:
What is C inlinging? Or inlining. Does it have something to do with using C and assembly language together? I know some of the linux kernel is inlined and it's written in C with a little assembly....
5
by: Raphael | last post by:
Hello, I have 2 files A.c : Defining a bunch of usefull functions B.c : Defining a main that uses these ones. I need to improve perfs : I know that, if the caller and the functions called...
11
by: Elpoca | last post by:
Hi: What rules govern the inlining of templated functions and templated class methods? It has always been my understanding that both templated functions and templated class methods were...
21
by: Michael Hull | last post by:
Hi, I remember from somewhere reading that inlining constructors is a 'BadThing', but now can't seem to find the original source. I can't however thing of a reason why it would be for simple...
8
by: Yakov | last post by:
I'd like a tool that performed inlining of function bodies of which do not appear in the .h file. Really. gcc on Linux. Yakov
58
by: sh.vipin | last post by:
is there any way to find out number of bytes freed on a particular free() call in C
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.