473,782 Members | 2,623 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

object on stack/heap performance problems

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

--
Thanks in advance,
B.

Jul 1 '07 #1
13 2794
On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?
Weird. Compiled with g++ on my system yields a difference of almost 20x.

--
Obnoxious User
Jul 1 '07 #2
or*******@gmail .com wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

--
Thanks in advance,
B.
Well, it seems that I cannot reproduce what you just described:

benben@watersid em $ g++ main_fast.cpp cnn.cpp -O2 -o fast
benben@watersid em $ g++ main_slow.cpp cnn.cpp -O2 -o slow
benben@watersid em $ ./fast
520000
benben@watersid em $ ./slow
520000

Theoretically there shouldn't be any difference between performance of
operations on an object on the stack and the same operation on an object
on the heap. At least on my machine I cannot reproduce such difference.

In a highly unlikely event the stack memory may be swapped out before
the process() call, resulting in swapping back in the memory. But this
is unlikely judging the straightforward manor of your program, plus the
swapping can happened to heap memory equally likely anyway...

Regards,
benben
Jul 1 '07 #3
On 2007-07-01 14:51, or*******@gmail .com wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?
Results when I compile/run your code with Visual C++ Codename Orcas
Express Beta1 (Visual C++ 2008)

Debug:
heap: 12868
stack: 13118
Release:
heap: 38666
stack: 4383

That's a difference of about 8.8 times faster when using the stack. I
have not used any profilers or such but there are some stuff in your
code that I find highly dubious, especially the allocation for the
RowMatrix. From what I can understand of the code you do some "magic" to
make sure the code is aligned properly, but does it work? Are you sure
your computer (or the it will run on) really works best with 32 byte
boundaries? This also makes your code totally unportable, I had to change
data = (float*) ((((long)(real_ data))+31L) & (-32L));
to
data = (float*) ((((long long)(real_data ))+31L) & (-32L));
before my compiler would let it through, and I'm still not sure what you
are trying to achieve with it.

Another thing that strikes me is that you use malloc, and while I'm no
expert I think this will cause your program to use two heaps, one for
new'ed memory and one for malloc'ed, this might slow things down.

I'm not sure what your number-crunching algorithm is supposed to do, so
I can't give you any better advice than to try to make the RowMatrix
simpler and try again.

--
Erik Wikström
Jul 1 '07 #4
Obnoxious User wrote:
On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
>Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here:
http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Weird. Compiled with g++ on my system yields a difference of almost 20x.
I am using g++, too; but I cannot confirm your observation. I do:

stack_testls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_testmake
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_testtime test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_testtime test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>
Best

Kai-Uwe Bux
Jul 1 '07 #5
On Sun, 01 Jul 2007 15:46:45 +0200, Kai-Uwe Bux wrote:
Obnoxious User wrote:
>On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
>>Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here:
http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Weird. Compiled with g++ on my system yields a difference of almost 20x.

I am using g++, too; but I cannot confirm your observation. I do:

stack_testls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_testmake
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_testtime test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_testtime test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>
~/stack_test$ make
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
~/stack_test$ time ./test_slow.exe
8200000

real 0m8.209s
user 0m8.205s
sys 0m0.000s
~/stack_test$ time ./test_fast.exe
310000

real 0m0.315s
user 0m0.316s
sys 0m0.000s
~/stack_test$

Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.

--
Obnoxious User
Jul 1 '07 #6
Thanks for your comments Erik.

I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :), it's just the fact, that the
exact same thing performs very differently when the object is on the
stack or on the heap. And what makes it more complicated, is that for
some people the heap is better, for some the stack, yet for others it
makes no difference at all.

I want to have an answer to why this happens.

--
Greets,
B.

Jul 1 '07 #7
And by the way: if I change the code to use new instead of malloc, and
I remove all the "magic" from the RowMatrix code, the stack version is
still slower on my computer. You can do that too, just replace
(cnn.cpp around line 40)

real_data = (float*) malloc( size * sizeof(float) + 31L );
data = (float*) ((((long)(real_ data))+31L) & (-32L));

with

real_data = new float[size];
data = real_data;

and replace (cnn.cpp around line 20)

free( real_data );

with

delete[] real_data;
In the meantime, I did some profiling with gprof, and it shows, that
for me, the stack version spends a LOT more time in
CNN::NonLineari ty().

Still I'm puzzled, perhaps this has something to do with how the
member variables come one after another in the class definition?

--
Greets,
B.

Jul 1 '07 #8
* or*******@gmail .com:
I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :)
I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data. Most importantly, CNN::lower_limi t,
CNN::upper_limi t, CNN::Z and a few others never seem to be initialised,
but I could be wrong. Have you tried checking the outcome (whatever
that may be) of Process() in both cases?

Most of your calculations are done on the .data members of the various
rows. SInce these are always malloc-ed (in a nasty premature-optimalisation
way), the location of the instance of the CNN class has little influence.

Uninitialised data (Along with subtle out-of-bounds errors) would also
explain why some people don't seem to be having this "problem", while
others do.

(fwiw:

atlas(1):~/stacktest/stack_test% time ./test_fast.exe
56
../test_fast.exe 0.57s user 0.01s system 97% cpu 0.600 total
atlas(1):~/stacktest/stack_test% time ./test_slow.exe
358
../test_slow.exe 3.59s user 0.02s system 98% cpu 3.687 total
atlas(1):~/stacktest/stack_test% gcc --version
gcc (GCC) 4.1.2 20070110 prerelease (NetBSD nb1 20070603)
[..]
atlas(1):~/stacktest/stack_test% uname -a
NetBSD atlas 4.99.20 NetBSD 4.99.20 (ATLAS) #0: Sat Jun 9 02:53:14 CEST 2007 martijnb@atlas:/usr/obj/sys/arch/amd64/compile/ATLAS amd64
--
Martijn van Buul - pi**@dohd.org
Jul 1 '07 #9
* Obnoxious User:
Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.
A tell-tale sign for unitialised data. The "results" of these test programs
is the CPU time used for calculation, using clock(), which *should* be
the actual CPU time used by this very process. While other activity on
the system will skew the results a little bit, a variation in runtime like this
on an algorithm that takes no user input, does no I/O, doesn't do wild
memory allocation and always uses the same arguments clearly indicates that
it's *not* doing the same job on every invocation.

--
Martijn van Buul - pi**@dohd.org
Jul 1 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
30099
by: Kevin Grigorenko | last post by:
Hello, I couldn't find an obvious answer to this in the FAQ. My basic question, is: Is there any difference in allocating on the heap versus the stack? If heap or stack implementation is not part of the standard, then just disregard this question. Here's some questions I'm confused about, and if you can add anything else, please do so! Is the stack limited for each program?
2
2490
by: news.tkdsoftware.com | last post by:
Aside from comp.compilers, is there any other forum, newsgroup or medium where I can post questions concerning the development of a byte code compiler & virtual stack machine? --
1
2589
by: Dave Dons | last post by:
GCC 3.3.4 setting Stack and Heap Help appreciated setting stack (and heap) in GCC in Linux: gcc (GCC) 3.2.3 (mingw special 20030504-1) has no problems with: g++ -Wl,--heap,1048576,--stack,10485760 axx1.cpp utils.cpp -Wall -Os -o axxngcc g++ -Wl,--heap=1048576,--stack=10485760 axx1.cpp utils.cpp -Wall -Os -o axxgcc g++ -Wl,--heap=0x00100000,--stack=0x00A00000 axx1.cpp -Wall -Os -o
1
1170
by: opistobranchia | last post by:
blah F test = getF(); // Print out shows that an F object was destroyed by the ~F. F set var to 0 on delete; test.print(); //reveals that var is still 123 value is 123 still instead of 0 F getF() {
9
7331
by: Ajay | last post by:
Hi all, Can I know what is the stack space and heap space allocated by the compiler.Can i increase it or decrease it.if yes,pleae tell me theway to do it.Thanks in advance. Cheers, Ajay
3
1912
by: Kirit Sælensminde | last post by:
I know that making new protected or private will (generally) prevent instances from being created on the heap, but I was wondering about preventing them on the stack. I saw in another post a hint about protecting the destructor. As the objects in question are all managed through a single smart pointer type I suspect that something like the following should work: class MyObjectPtr;
7
2862
by: Arpan | last post by:
The .NET Framework 2.0 documentation states that An Object variable always holds a pointer to the data, never the data itself. Now w.r.t. the following ASP.NET code snippet, can someone please explain me what does the above statement mean? <script runat="server"> Class Clock
87
5570
by: CJ | last post by:
Hello: We know that C programs are often vulnerable to buffer overflows which overwrite the stack. But my question is: Why does C insist on storing local variables on the stack in the first place? I can see two definite disadvantages with this: 1) deeply nested recursive calls to a function (especially if it defines
275
12401
by: Astley Le Jasper | last post by:
Sorry for the numpty question ... How do you find the reference name of an object? So if i have this bob = modulename.objectname() how do i find that the name is 'bob'
0
10308
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10076
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9939
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8964
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6729
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5375
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5507
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4040
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3633
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.