473,795 Members | 3,331 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Interesting results in speed comparison with C++

I wrote two trivial test programs that do a billion iterations of a
virtual method call, first in C# (Visual Studio 2005):

Thing t = new DerivedThing();
for (System.Int64 n = 0; n < 10000000000; n++)
t.method();

Then in C++ (using Visual C++ 2005):

Thing *t = new DerivedThing();
for (__int64 n = 0; n < 10000000000L; n++)
t->method();

.... with appropriate declarations in each case for Thing (abstract
base class) and DerivedThing (method increments a counter).

C# took 47 seconds, C++ took 58 seconds. Both were release builds.

Now, given that the C++ implementation of virtual method dispatch is
very "close to the metal", this must mean that by the time the C#
version is running, there is no virtual method dispatch happening. The
CLR JIT must be inlining the method call, right? (I looked at the IL
and it's not being inlined by the C# compiler).

Then I tried moving the allocation of the DerivedThing inside the loop
- for the C++ program this also meant putting a 'delete t' after the
method call. Note that DerivedThing is a class in C#, not a struct,
and it holds some data.

C# took 13 seconds, C++ took 175 seconds. I was a bit shocked by this,
so ran both a few more times, with identical results.

I thought maybe the JIT looks at what I'm doing with the object and
realises that I'm not holding onto a reference to it outside of the
loop scope, and so it doesn't need to be allocated on the garbage
collected heap in the same way as a long-lived object. Of course to
know that, it would have to look at the code of method(), because it
could be stashing the 'this' reference somewhere.

So I modified DerivedThing's method so it stored the 'this' reference
in a static member, but only on the forteenth time (out of a billion!)
that it was called. Now the CLR has to allocate a garbage collected
object each time around the loop, right?

But this merely increased the running time to 16 seconds, still less
than 10% of the C++ result.

So maybe it inlines method(), then looks at what it does and
completely rewrites it to produce the same effect without allocating a
billion objects?

Are there any articles that will tell me what the CLR's garbage
collected heap (and/or the JIT) is actually doing in this case? How
can it be more than ten times faster than the non-garbage collected C+
+ heap?

Sep 7 '07
12 4492
Willy Denoyette [MVP] wrote:
"Arne Vajhøj" <ar**@vajhoej.d kwrote in message
news:46******** *************** @news.sunsite.d k...
>Willy Denoyette [MVP] wrote:
>>news:46****** *************** **@news.sunsite .dk...
And not possible IMHO, using a long instead of an int must result in
slower code on 32 bit windows.

32 bit windows mean that addresses are 32 bit. It does not say anything
about whether operations on 64 bit integers are fast or slow.

This is not true
Yes. It is true.
,the code produced by the JIT when the counter type is a
long is not the same as the code produced when it's an int.

for(long n = 0; n < 1000000000; n++) !!! note the 1000000000 value!!!
...

// 32 bit version:
// Increments ecx two times until ecx overflow (that is 2 * 2^32).
// increment ecx for the remaining of 10000000000 - 4(2^32) (that is the
value 540BE400h)

...
001f00d8 83c101 add ecx,1
001f00db 83d600 adc esi,0
001f00de 85f6 test esi,esi
001f00e0 7f0a jg 001f00ec
001f00e2 7cf4 jl 001f00d8
001f00e4 81f900ca9a3b cmp ecx,3B9ACA00h
001f00ea 72ec jb 001f00d8
...

while a 32 bit increment like this:
for(int n = 0; n < 100000000; n++) !!! note the 1000000000 value!!!
produces this:

...
002100d6 83c001 add eax,1
002100d9 3d00ca9a3b cmp eax,3B9ACA00h
002100de 7cf6 jl 002100d6
...
Quite different and quite faster (~2X) to execute ?
Very interesting.

But completely irrelevant.

We are discussing whether the speed of 64 bit operations depend on
whether Windows is 32 or 64 bit - not whether 32 bit operations are
faster than 64 bit operations.
>The point is that C++ with 64 bit counter without /Ox seems to be
unexplainabl e slow.

And my hypothesis is that the results the original poster was seeing
was due to the long counter without /Ox and not due to virtual methods
optimization s.

This is guesswork as long as the OP doesn't post the whole code and the
compiler flags used to compile he programs.
The C# compiler removes the call to the method when compiled with o+,
unless it returns a value that is used outside the loop, you are simply
counting is the time needed to increment a registered value 10000000000
times.
The 32 bit C++ compiler (using the Ox flag) does exactly the same, it
removes the call if the return value is not used outside the loop, the
code produced is exactly the same as what the JIT32 produces.
It is guesswork, but I think it is the best guess so far.

Arne
Sep 9 '07 #11
"Arne Vajhøj" <ar**@vajhoej.d kwrote in message
news:46******** *************** @news.sunsite.d k...
Willy Denoyette [MVP] wrote:
>"Arne Vajhøj" <ar**@vajhoej.d kwrote in message
news:46******* *************** *@news.sunsite. dk...
>>Willy Denoyette [MVP] wrote:
news:46***** *************** ***@news.sunsit e.dk...
And not possible IMHO, using a long instead of an int must result in
slower code on 32 bit windows.

32 bit windows mean that addresses are 32 bit. It does not say anything
about whether operations on 64 bit integers are fast or slow.

This is not true

Yes. It is true.
Yes it's true, why do you ignore the code I posed for both 32 bit and 64 bit
integers, the JIT produces different and less optimum code for 64 bit types.
32 bit windows has no 64 registers to use for the counter, it has to use a
32 bit register and some extra instructions to account for the overflow whe
he has to deal with long counter types.
>
> ,the code produced by the JIT when the counter type is
a long is not the same as the code produced when it's an int.

for(long n = 0; n < 1000000000; n++) !!! note the 1000000000 value!!!
...

// 32 bit version:
// Increments ecx two times until ecx overflow (that is 2 * 2^32).
// increment ecx for the remaining of 10000000000 - 4(2^32) (that is the
value 540BE400h)

...
001f00d8 83c101 add ecx,1
001f00db 83d600 adc esi,0
001f00de 85f6 test esi,esi
001f00e0 7f0a jg 001f00ec
001f00e2 7cf4 jl 001f00d8
001f00e4 81f900ca9a3b cmp ecx,3B9ACA00h
001f00ea 72ec jb 001f00d8
...

while a 32 bit increment like this:
for(int n = 0; n < 100000000; n++) !!! note the 1000000000 value!!!
produces this:

...
002100d6 83c001 add eax,1
002100d9 3d00ca9a3b cmp eax,3B9ACA00h
002100de 7cf6 jl 002100d6
...
Quite different and quite faster (~2X) to execute ?

Very interesting.

But completely irrelevant.
We are discussing whether the speed of 64 bit operations depend on
whether Windows is 32 or 64 bit - not whether 32 bit operations are
faster than 64 bit operations.
Where did you ever mention 64 bit Windows in this thread?
You were comparing 32 bit with 64 bit counter types didn't you?
I clearly showed you that Int64 operations (loop counters) are slower than
Int32 on 32 bit windows.
>>The point is that C++ with 64 bit counter without /Ox seems to be
unexplainab le slow.

And my hypothesis is that the results the original poster was seeing
was due to the long counter without /Ox and not due to virtual methods
optimizations .

This is guesswork as long as the OP doesn't post the whole code and the
compiler flags used to compile he programs.
The C# compiler removes the call to the method when compiled with o+,
unless it returns a value that is used outside the loop, you are simply
counting is the time needed to increment a registered value 10000000000
times.
The 32 bit C++ compiler (using the Ox flag) does exactly the same, it
removes the call if the return value is not used outside the loop, the
code produced is exactly the same as what the JIT32 produces.

It is guesswork, but I think it is the best guess so far.
No it's not, you did not prove anything, Just post your "complete code",
else this discussion makes no sense.
Willy.
Sep 10 '07 #12
"Daniel Earwicker" <da************ **@gmail.comwro te in message
news:11******** **************@ w3g2000hsg.goog legroups.com...
>I wrote two trivial test programs that do a billion iterations of a
virtual method call, first in C# (Visual Studio 2005):

Thing t = new DerivedThing();
for (System.Int64 n = 0; n < 10000000000; n++)
t.method();

Then in C++ (using Visual C++ 2005):

Thing *t = new DerivedThing();
for (__int64 n = 0; n < 10000000000L; n++)
t->method();

... with appropriate declarations in each case for Thing (abstract
base class) and DerivedThing (method increments a counter).

C# took 47 seconds, C++ took 58 seconds. Both were release builds.

Now, given that the C++ implementation of virtual method dispatch is
very "close to the metal", this must mean that by the time the C#
version is running, there is no virtual method dispatch happening. The
CLR JIT must be inlining the method call, right? (I looked at the IL
and it's not being inlined by the C# compiler).

Then I tried moving the allocation of the DerivedThing inside the loop
- for the C++ program this also meant putting a 'delete t' after the
method call. Note that DerivedThing is a class in C#, not a struct,
and it holds some data.

C# took 13 seconds, C++ took 175 seconds. I was a bit shocked by this,
so ran both a few more times, with identical results.

I thought maybe the JIT looks at what I'm doing with the object and
realises that I'm not holding onto a reference to it outside of the
loop scope, and so it doesn't need to be allocated on the garbage
collected heap in the same way as a long-lived object. Of course to
know that, it would have to look at the code of method(), because it
could be stashing the 'this' reference somewhere.

So I modified DerivedThing's method so it stored the 'this' reference
in a static member, but only on the forteenth time (out of a billion!)
that it was called. Now the CLR has to allocate a garbage collected
object each time around the loop, right?

But this merely increased the running time to 16 seconds, still less
than 10% of the C++ result.

So maybe it inlines method(), then looks at what it does and
completely rewrites it to produce the same effect without allocating a
billion objects?

Are there any articles that will tell me what the CLR's garbage
collected heap (and/or the JIT) is actually doing in this case? How
can it be more than ten times faster than the non-garbage collected C+
+ heap?


Running this on Windows 32 bit (Win2K3 SP2):

// C# code
using System;
using System.Diagnost ics;
namespace Willys
{
abstract class Thing
{
int i;
internal virtual int Method()
{
return i++;
}
}
sealed class DerivedThing : Thing
{}
class Program
{
static long oneBillion = 1000000000;
static void Main()

{
Test1();
GC.Collect();
GC.WaitForPendi ngFinalizers();
Test2();
}
static void Test1()
{
DerivedThing dt = new DerivedThing();
Stopwatch watch = Stopwatch.Start New();
for (long n = 0; n < oneBillion; n++)
dt.Method();
Console.WriteLi ne ("Test1: {0} msecs.", watch.ElapsedMi lliseconds);
}
static void Test2()
{
Stopwatch watch = Stopwatch.Start New();
for (long n = 0; n < oneBillion; n++)
{
DerivedThing dt = new DerivedThing();
dt.Method();
}
Console.WriteLi ne ("Test2: {0} msecs.", watch.ElapsedMi lliseconds);
}
}
}

compiled with /o+, results in:
Test1: 3620 msecs.
Test2: 11325 msecs.

While running this:

// CPP code
#include <windows.h>
#include <cstdio>
class B
{
protected:
int i;
virtual int Method() = 0; // = 0 =pure virtual function

};

class C : B
{
public:
virtual int Method() {return i++;}
};

static __int64 oneBillion = 1000000000;
void Test1()
{
C *c = new C;
LARGE_INTEGER start, stop;
QueryPerformanc eCounter(&start );
for(__int64 n = 0; n < oneBillion ; n++)
c->Method();
QueryPerformanc eCounter(&stop) ;
printf_s("Test1 : %I64d msecs.\n", (stop.QuadPart - start.QuadPart) /
10000);
}
void Test2()
{
LARGE_INTEGER start, stop;
QueryPerformanc eCounter(&start );
for(__int64 n = 0; n < oneBillion ; n++)
{
C *c = new C();
c->Method();
delete c;
}
QueryPerformanc eCounter(&stop) ;
printf_s("Test2 : %I64d msecs.\n", (stop.QuadPart - start.QuadPart) /
10000);
}
int main()
{
Test1();
Test2();
}

compiled with /O2 or Ox, results in:

Test1: 1135 msecs.
Test2: 157780 msecs.

You see that C++ is 3X faster than C# for Test1, the reasons are:
1. some better optimized Method (5 instructions for C# vs. 4 for C++)
2. faster virtual dispatch for C++

Here are the disassemblies for both C# and C++ methods and call sites
(partial)
Method cs:
001e01f0 8b5104 mov edx,dword ptr [ecx+4]
001e01f3 8d4201 lea eax,[edx+1]
001e01f6 894104 mov dword ptr [ecx+4],eax
001e01f9 8bc2 mov eax,edx
001e01fb c3 ret

Call site cs:
....
001e0164 8b4de8 mov ecx,dword ptr [ebp-18h]
001e0167 8b01 mov eax,dword ptr [ecx]
001e0169 ff5038 call dword ptr [eax+38h]
001e016c 83c601 add esi,1
001e016f 83d700 adc edi,0
001e0172 3b3d20301600 cmp edi,dword ptr ds:[163020h]
001e0178 7f0a jg 001e0184
001e017a 7ce8 jl 001e0164
001e017c 3b351c301600 cmp esi,dword ptr ds:[16301Ch]
001e0182 72e0 jb 001e0164
....
Method cpp:
00401000 8b4104 mov eax,dword ptr [ecx+4]
00401003 8d5001 lea edx,[eax+1]
00401006 895104 mov dword ptr [ecx+4],edx
00401009 c3 ret

Call site cpp:
....
00401060 8b17 mov edx,dword ptr [edi]
00401062 8b02 mov eax,dword ptr [edx]
00401064 8bcf mov ecx,edi
00401066 ffd0 call eax
00401068 83c301 add ebx,1
0040106b 83d600 adc esi,0
0040106e 3b3504d04000 cmp esi,dword ptr [image00400000+0 xd004
(0040d004)]
00401074 7cea jl image00400000+0 x1060 (00401060)
00401076 7f08 jg image00400000+0 x1080 (00401080)
00401078 3b1d00d04000 cmp ebx,dword ptr [image00400000+0 xd000
(0040d000)]
0040107e 72e0 jb image00400000+0 x1060 (00401060)
....

On the other end, Test2 is much faster in C#, this is because of the GC,
which can delay the collection of the garbage till after thousands of
instatiations. These collections are extremely fast, the net result is that
Test2 is ~12X faster in C# despite the tiny slower code and dispatch.

Willy.
Sep 10 '07 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
2717
by: Greg Brunet | last post by:
In doing some testing of different but simple algorithms for getting a list of prime numbers, I ended up getting some results that seem a bit contradictory. Given the following test program (testPrimes.py) with two algorithms that both check for primes by testing only odd numbers using factors up to the square root of the value, where Primes1 is based on all of the existing primes so far, and Primes2 is based on all odd numbers, I would...
29
3606
by: Bart Nessux | last post by:
Just fooling around this weekend. Wrote and timed programs in C, Perl and Python. Each Program counts to 1,000,000 and prints each number to the console as it counts. I was a bit surprised. I'm not an expert C or Perl programming expery, I'm most familiar with Python, but can use the others as well. Here are my results: C = 23 seconds Python = 26.5 seconds
15
2104
by: Nick Coghlan | last post by:
Thought some folks here might find this one interesting. No great revelations, just a fairly sensible piece on writing readable code :) The whole article: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=271&page=1 The section specifically on white space: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=271&page=3 Cheers,
3
1677
by: main\(\){}; | last post by:
I can't ignore the speed of .NET managed applications in manipulating string, I/O and arithmetic operations. However, we can never compare the speed of a C/C++ program with its .NET counterpart when it comes to some heavy operations, like long loops, graphics, load time and many other issues. The dream is; having an intermediate language (IL) run in the virtual machine at the speed of an unmanaged code. Java has in many ways solved this...
74
4611
by: aruna.mysore | last post by:
Hi all, I have a simple definitioin in a C file something like this. main() { char a; ....... int k; }
27
2341
by: Frederick Gotham | last post by:
I thought it might be interesting to share experiences of tracking down a subtle or mysterious bug. I myself haven't much experience with tracking down bugs, but there's one in particular which comes to mind. I was writing usable which dealt with strings. As per usual with my code, I made it efficient to the extreme. One thing I did was replace, where possible, any usages of "strlen" with something like: struct PtrAndLen { char *p;
27
3249
by: SQL Learner | last post by:
Hi all, I have an Access db with two large tables - 3,100,000 (tblA) and 7,000 (tblB) records. I created a select query using Inner Join by partial matching two fields (X from tblA and Y from tblB). The size of the db is about 200MBs. Now my issue is, the query has been running for over 3 hours already - I have no idea when it will end. I am using Access 2003. Are there ways to improve the speed performance? (Also, would the...
40
2731
by: nufuhsus | last post by:
Hello all, First let me appologise if this has been answered but I could not find an acurate answer to this interesting problem. If the following is true: C:\Python25\rg.py>python Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) on win32 Type "help", "copyright", "credits" or "license" for more
4
2343
by: wang frank | last post by:
Hi, While comparing the speed of octave and matlab, I decided to do a similar test for python and matlab. The result shows that python is slower than matlab by a factor of 5. It is not bad since octave is about 30 time slower than matlab. Here is the result in matlab: Elapsed time is 0.015389 seconds.
0
9672
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10435
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10213
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10163
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6779
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5436
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5563
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2920
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.