"Daniel Earwicker" <da**************@gmail.comwrote in message
news:11**********************@w3g2000hsg.googlegro ups.com...
>I wrote two trivial test programs that do a billion iterations of a
virtual method call, first in C# (Visual Studio 2005):
Thing t = new DerivedThing();
for (System.Int64 n = 0; n < 10000000000; n++)
t.method();
Then in C++ (using Visual C++ 2005):
Thing *t = new DerivedThing();
for (__int64 n = 0; n < 10000000000L; n++)
t->method();
... with appropriate declarations in each case for Thing (abstract
base class) and DerivedThing (method increments a counter).
C# took 47 seconds, C++ took 58 seconds. Both were release builds.
Now, given that the C++ implementation of virtual method dispatch is
very "close to the metal", this must mean that by the time the C#
version is running, there is no virtual method dispatch happening. The
CLR JIT must be inlining the method call, right? (I looked at the IL
and it's not being inlined by the C# compiler).
Then I tried moving the allocation of the DerivedThing inside the loop
- for the C++ program this also meant putting a 'delete t' after the
method call. Note that DerivedThing is a class in C#, not a struct,
and it holds some data.
C# took 13 seconds, C++ took 175 seconds. I was a bit shocked by this,
so ran both a few more times, with identical results.
I thought maybe the JIT looks at what I'm doing with the object and
realises that I'm not holding onto a reference to it outside of the
loop scope, and so it doesn't need to be allocated on the garbage
collected heap in the same way as a long-lived object. Of course to
know that, it would have to look at the code of method(), because it
could be stashing the 'this' reference somewhere.
So I modified DerivedThing's method so it stored the 'this' reference
in a static member, but only on the forteenth time (out of a billion!)
that it was called. Now the CLR has to allocate a garbage collected
object each time around the loop, right?
But this merely increased the running time to 16 seconds, still less
than 10% of the C++ result.
So maybe it inlines method(), then looks at what it does and
completely rewrites it to produce the same effect without allocating a
billion objects?
Are there any articles that will tell me what the CLR's garbage
collected heap (and/or the JIT) is actually doing in this case? How
can it be more than ten times faster than the non-garbage collected C+
+ heap?
Running this on Windows 32 bit (Win2K3 SP2):
// C# code
using System;
using System.Diagnostics;
namespace Willys
{
abstract class Thing
{
int i;
internal virtual int Method()
{
return i++;
}
}
sealed class DerivedThing : Thing
{}
class Program
{
static long oneBillion = 1000000000;
static void Main()
{
Test1();
GC.Collect();
GC.WaitForPendingFinalizers();
Test2();
}
static void Test1()
{
DerivedThing dt = new DerivedThing();
Stopwatch watch = Stopwatch.StartNew();
for (long n = 0; n < oneBillion; n++)
dt.Method();
Console.WriteLine ("Test1: {0} msecs.", watch.ElapsedMilliseconds);
}
static void Test2()
{
Stopwatch watch = Stopwatch.StartNew();
for (long n = 0; n < oneBillion; n++)
{
DerivedThing dt = new DerivedThing();
dt.Method();
}
Console.WriteLine ("Test2: {0} msecs.", watch.ElapsedMilliseconds);
}
}
}
compiled with /o+, results in:
Test1: 3620 msecs.
Test2: 11325 msecs.
While running this:
// CPP code
#include <windows.h>
#include <cstdio>
class B
{
protected:
int i;
virtual int Method() = 0; // = 0 =pure virtual function
};
class C : B
{
public:
virtual int Method() {return i++;}
};
static __int64 oneBillion = 1000000000;
void Test1()
{
C *c = new C;
LARGE_INTEGER start, stop;
QueryPerformanceCounter(&start);
for(__int64 n = 0; n < oneBillion ; n++)
c->Method();
QueryPerformanceCounter(&stop);
printf_s("Test1: %I64d msecs.\n", (stop.QuadPart - start.QuadPart) /
10000);
}
void Test2()
{
LARGE_INTEGER start, stop;
QueryPerformanceCounter(&start);
for(__int64 n = 0; n < oneBillion ; n++)
{
C *c = new C();
c->Method();
delete c;
}
QueryPerformanceCounter(&stop);
printf_s("Test2: %I64d msecs.\n", (stop.QuadPart - start.QuadPart) /
10000);
}
int main()
{
Test1();
Test2();
}
compiled with /O2 or Ox, results in:
Test1: 1135 msecs.
Test2: 157780 msecs.
You see that C++ is 3X faster than C# for Test1, the reasons are:
1. some better optimized Method (5 instructions for C# vs. 4 for C++)
2. faster virtual dispatch for C++
Here are the disassemblies for both C# and C++ methods and call sites
(partial)
Method cs:
001e01f0 8b5104 mov edx,dword ptr [ecx+4]
001e01f3 8d4201 lea eax,[edx+1]
001e01f6 894104 mov dword ptr [ecx+4],eax
001e01f9 8bc2 mov eax,edx
001e01fb c3 ret
Call site cs:
....
001e0164 8b4de8 mov ecx,dword ptr [ebp-18h]
001e0167 8b01 mov eax,dword ptr [ecx]
001e0169 ff5038 call dword ptr [eax+38h]
001e016c 83c601 add esi,1
001e016f 83d700 adc edi,0
001e0172 3b3d20301600 cmp edi,dword ptr ds:[163020h]
001e0178 7f0a jg 001e0184
001e017a 7ce8 jl 001e0164
001e017c 3b351c301600 cmp esi,dword ptr ds:[16301Ch]
001e0182 72e0 jb 001e0164
....
Method cpp:
00401000 8b4104 mov eax,dword ptr [ecx+4]
00401003 8d5001 lea edx,[eax+1]
00401006 895104 mov dword ptr [ecx+4],edx
00401009 c3 ret
Call site cpp:
....
00401060 8b17 mov edx,dword ptr [edi]
00401062 8b02 mov eax,dword ptr [edx]
00401064 8bcf mov ecx,edi
00401066 ffd0 call eax
00401068 83c301 add ebx,1
0040106b 83d600 adc esi,0
0040106e 3b3504d04000 cmp esi,dword ptr [image00400000+0xd004
(0040d004)]
00401074 7cea jl image00400000+0x1060 (00401060)
00401076 7f08 jg image00400000+0x1080 (00401080)
00401078 3b1d00d04000 cmp ebx,dword ptr [image00400000+0xd000
(0040d000)]
0040107e 72e0 jb image00400000+0x1060 (00401060)
....
On the other end, Test2 is much faster in C#, this is because of the GC,
which can delay the collection of the garbage till after thousands of
instatiations. These collections are extremely fast, the net result is that
Test2 is ~12X faster in C# despite the tiny slower code and dispatch.
Willy.