473,405 Members | 2,272 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

object on stack/heap performance problems

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

--
Thanks in advance,
B.

Jul 1 '07 #1
13 2761
On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?
Weird. Compiled with g++ on my system yields a difference of almost 20x.

--
Obnoxious User
Jul 1 '07 #2
or*******@gmail.com wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

--
Thanks in advance,
B.
Well, it seems that I cannot reproduce what you just described:

benben@watersidem $ g++ main_fast.cpp cnn.cpp -O2 -o fast
benben@watersidem $ g++ main_slow.cpp cnn.cpp -O2 -o slow
benben@watersidem $ ./fast
520000
benben@watersidem $ ./slow
520000

Theoretically there shouldn't be any difference between performance of
operations on an object on the stack and the same operation on an object
on the heap. At least on my machine I cannot reproduce such difference.

In a highly unlikely event the stack memory may be swapped out before
the process() call, resulting in swapping back in the memory. But this
is unlikely judging the straightforward manor of your program, plus the
swapping can happened to heap memory equally likely anyway...

Regards,
benben
Jul 1 '07 #3
On 2007-07-01 14:51, or*******@gmail.com wrote:
Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?
Results when I compile/run your code with Visual C++ Codename Orcas
Express Beta1 (Visual C++ 2008)

Debug:
heap: 12868
stack: 13118
Release:
heap: 38666
stack: 4383

That's a difference of about 8.8 times faster when using the stack. I
have not used any profilers or such but there are some stuff in your
code that I find highly dubious, especially the allocation for the
RowMatrix. From what I can understand of the code you do some "magic" to
make sure the code is aligned properly, but does it work? Are you sure
your computer (or the it will run on) really works best with 32 byte
boundaries? This also makes your code totally unportable, I had to change
data = (float*) ((((long)(real_data))+31L) & (-32L));
to
data = (float*) ((((long long)(real_data))+31L) & (-32L));
before my compiler would let it through, and I'm still not sure what you
are trying to achieve with it.

Another thing that strikes me is that you use malloc, and while I'm no
expert I think this will cause your program to use two heaps, one for
new'ed memory and one for malloc'ed, this might slow things down.

I'm not sure what your number-crunching algorithm is supposed to do, so
I can't give you any better advice than to try to make the RowMatrix
simpler and try again.

--
Erik Wikström
Jul 1 '07 #4
Obnoxious User wrote:
On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
>Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here:
http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Weird. Compiled with g++ on my system yields a difference of almost 20x.
I am using g++, too; but I cannot confirm your observation. I do:

stack_testls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_testmake
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_testtime test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_testtime test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>
Best

Kai-Uwe Bux
Jul 1 '07 #5
On Sun, 01 Jul 2007 15:46:45 +0200, Kai-Uwe Bux wrote:
Obnoxious User wrote:
>On Sun, 01 Jul 2007 05:51:29 -0700, orobalage wrote:
>>Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here:
http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Weird. Compiled with g++ on my system yields a difference of almost 20x.

I am using g++, too; but I cannot confirm your observation. I do:

stack_testls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_testmake
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_testtime test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_testtime test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>
~/stack_test$ make
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
~/stack_test$ time ./test_slow.exe
8200000

real 0m8.209s
user 0m8.205s
sys 0m0.000s
~/stack_test$ time ./test_fast.exe
310000

real 0m0.315s
user 0m0.316s
sys 0m0.000s
~/stack_test$

Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.

--
Obnoxious User
Jul 1 '07 #6
Thanks for your comments Erik.

I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :), it's just the fact, that the
exact same thing performs very differently when the object is on the
stack or on the heap. And what makes it more complicated, is that for
some people the heap is better, for some the stack, yet for others it
makes no difference at all.

I want to have an answer to why this happens.

--
Greets,
B.

Jul 1 '07 #7
And by the way: if I change the code to use new instead of malloc, and
I remove all the "magic" from the RowMatrix code, the stack version is
still slower on my computer. You can do that too, just replace
(cnn.cpp around line 40)

real_data = (float*) malloc( size * sizeof(float) + 31L );
data = (float*) ((((long)(real_data))+31L) & (-32L));

with

real_data = new float[size];
data = real_data;

and replace (cnn.cpp around line 20)

free( real_data );

with

delete[] real_data;
In the meantime, I did some profiling with gprof, and it shows, that
for me, the stack version spends a LOT more time in
CNN::NonLinearity().

Still I'm puzzled, perhaps this has something to do with how the
member variables come one after another in the class definition?

--
Greets,
B.

Jul 1 '07 #8
* or*******@gmail.com:
I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :)
I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data. Most importantly, CNN::lower_limit,
CNN::upper_limit, CNN::Z and a few others never seem to be initialised,
but I could be wrong. Have you tried checking the outcome (whatever
that may be) of Process() in both cases?

Most of your calculations are done on the .data members of the various
rows. SInce these are always malloc-ed (in a nasty premature-optimalisation
way), the location of the instance of the CNN class has little influence.

Uninitialised data (Along with subtle out-of-bounds errors) would also
explain why some people don't seem to be having this "problem", while
others do.

(fwiw:

atlas(1):~/stacktest/stack_test% time ./test_fast.exe
56
../test_fast.exe 0.57s user 0.01s system 97% cpu 0.600 total
atlas(1):~/stacktest/stack_test% time ./test_slow.exe
358
../test_slow.exe 3.59s user 0.02s system 98% cpu 3.687 total
atlas(1):~/stacktest/stack_test% gcc --version
gcc (GCC) 4.1.2 20070110 prerelease (NetBSD nb1 20070603)
[..]
atlas(1):~/stacktest/stack_test% uname -a
NetBSD atlas 4.99.20 NetBSD 4.99.20 (ATLAS) #0: Sat Jun 9 02:53:14 CEST 2007 martijnb@atlas:/usr/obj/sys/arch/amd64/compile/ATLAS amd64
--
Martijn van Buul - pi**@dohd.org
Jul 1 '07 #9
* Obnoxious User:
Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.
A tell-tale sign for unitialised data. The "results" of these test programs
is the CPU time used for calculation, using clock(), which *should* be
the actual CPU time used by this very process. While other activity on
the system will skew the results a little bit, a variation in runtime like this
on an algorithm that takes no user input, does no I/O, doesn't do wild
memory allocation and always uses the same arguments clearly indicates that
it's *not* doing the same job on every invocation.

--
Martijn van Buul - pi**@dohd.org
Jul 1 '07 #10
Martijn van Buul wrote:
* or*******@gmail.com:
>I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :)

I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data. Most importantly, CNN::lower_limit,
CNN::upper_limit, CNN::Z and a few others never seem to be initialised,
but I could be wrong. Have you tried checking the outcome (whatever
that may be) of Process() in both cases?

Most of your calculations are done on the .data members of the various
rows. SInce these are always malloc-ed (in a nasty
premature-optimalisation way), the location of the instance of the CNN
class has little influence.

Uninitialised data (Along with subtle out-of-bounds errors) would also
explain why some people don't seem to be having this "problem", while
others do.
That is an interesting idea. This is what valgrind has to say:

==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)
==15156== at 0x8049328:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x8049AD6:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)
==15156== at 0x80491EC:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x8049AD6:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)

And it goes on like this forever. Similar output for test_slow.exe

Best

Kai-Uwe Bux
Jul 1 '07 #11
Thanks Martijn!

Indeed, it was uninitialized data!
Problem is solved it seems, and another big experience in my bag,
thanks for it! :)

What made it somewhat obscure for me, is that if I rearranged the
members, it also became fast sometimes.

Anyway, we make mistakes, that I made a huge one, hope others will
learn from this too.

Good day to you, and thanks again everyone!

--
Greets,
B.

Jul 1 '07 #12
* or*******@gmail.com:
Thanks Martijn!

Indeed, it was uninitialized data!
Problem is solved it seems, and another big experience in my bag,
thanks for it! :)
Glad I could help.

--
Martijn van Buul - pi**@dohd.org
Jul 2 '07 #13
On Jul 1, 9:45 pm, Martijn van Buul <p...@dohd.orgwrote:
* orobal...@gmail.com:
I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :)

I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data.
I wrote some functions to compare the structures member-by-member,
comparing the data arrays element by element.

The answers come out different when Process() is called with the same
initial data on a CNN allocated on the stack and a CNN allocated on
the heap. Also, the differences are in different places and the values
from each method itself are different on different runs.

So I would tend to agree that it is uninitialized data and/or maybe
some other bug which is the culprit.

Jyotirmoy

Jul 2 '07 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
by: Kevin Grigorenko | last post by:
Hello, I couldn't find an obvious answer to this in the FAQ. My basic question, is: Is there any difference in allocating on the heap versus the stack? If heap or stack implementation is not...
2
by: news.tkdsoftware.com | last post by:
Aside from comp.compilers, is there any other forum, newsgroup or medium where I can post questions concerning the development of a byte code compiler & virtual stack machine? --
1
by: Dave Dons | last post by:
GCC 3.3.4 setting Stack and Heap Help appreciated setting stack (and heap) in GCC in Linux: gcc (GCC) 3.2.3 (mingw special 20030504-1) has no problems with: g++...
1
by: opistobranchia | last post by:
blah F test = getF(); // Print out shows that an F object was destroyed by the ~F. F set var to 0 on delete; test.print(); //reveals that var is still 123 value is 123 still instead of 0 F...
9
by: Ajay | last post by:
Hi all, Can I know what is the stack space and heap space allocated by the compiler.Can i increase it or decrease it.if yes,pleae tell me theway to do it.Thanks in advance. Cheers, Ajay
3
by: Kirit Sælensminde | last post by:
I know that making new protected or private will (generally) prevent instances from being created on the heap, but I was wondering about preventing them on the stack. I saw in another post a...
7
by: Arpan | last post by:
The .NET Framework 2.0 documentation states that An Object variable always holds a pointer to the data, never the data itself. Now w.r.t. the following ASP.NET code snippet, can someone please...
87
by: CJ | last post by:
Hello: We know that C programs are often vulnerable to buffer overflows which overwrite the stack. But my question is: Why does C insist on storing local variables on the stack in the first...
275
by: Astley Le Jasper | last post by:
Sorry for the numpty question ... How do you find the reference name of an object? So if i have this bob = modulename.objectname() how do i find that the name is 'bob'
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.