By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,934 Members | 1,366 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,934 IT Pros & Developers. It's quick & easy.

C++/CLI the fastest compiler? Yes, at least for me. :-)

P: n/a
Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster
compiler. Some said C# was just as fast as C++/CLI, whereas others said
C++/CLI was more optimized.

Anyway, I wrote up some very simple test code, and at least on my
computer C++/CLI came out the fastest. Here's the sample code, and just
for good measure I wrote one in java, and it was the slowest! ;-) Also,
I did no optimizing compiler switches and compiled the C++/CLI with
/clr:safe only to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
long start = Environment::TickCount;
for (int i = 0; i < 10000000; ++i) {}
long end = Environment::TickCount;
Console::WriteLine(end - start);
}
//C# code
using System;

public class ForLoopTest
{
public static void Main(string[] args)
{
long start = Environment.TickCount;
for (int i =0;i < 10000000; ++i) {}
long end = Environment.TickCount;
Console.WriteLine((end-start));
}
}

//Java code
public class Performance
{
public static void main(String args[])
{
long start = System.currentTimeMillis();
for (int i=0; i < 10000000; ++i) {}
long end = System.currentTimeMillis();
System.out.println(end-start);
}
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim
Mar 12 '06 #1
Share this Question
Share on Google+
44 Replies


P: n/a
Don Kim wrote:
C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and
results can vary by computer to computer, but at least on my system,
C++/CLI had the fastest results.

Maybe C++/CLI is the most optimized compiler?


After increasing the length of the loops by a factor of 100, I see about a
2X speed advantage for C++/CLI as well. Looking at the IL produced by the
two compilers for the respective main functions:

C++:

..method assembly static int32 main() cil managed
{
// Code size 40 (0x28)
.maxstack 2
.locals (int32 V_0,
int32 V_1,
int32 V_2)
IL_0000: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0005: stloc.2
IL_0006: ldc.i4.0
IL_0007: stloc.0
IL_0008: br.s IL_000e
// start of loop
IL_000a: ldloc.0
IL_000b: ldc.i4.1
IL_000c: add
IL_000d: stloc.0
IL_000e: ldloc.0
IL_000f: ldc.i4 0x3b9aca00
IL_0014: bge.s IL_0018
IL_0016: br.s IL_000a
// end of loop
IL_0018: call int32 [mscorlib]System.Environment::get_TickCount()
IL_001d: stloc.1
IL_001e: ldloc.1
IL_001f: ldloc.2
IL_0020: sub
IL_0021: call void [mscorlib]System.Console::WriteLine(int32)
IL_0026: ldc.i4.0
IL_0027: ret
} // end of method 'Global Functions'::main

C#:

..method public hidebysig static void Main(string[] args) cil managed
{
.entrypoint
// Code size 47 (0x2f)
.maxstack 2
.locals init (int64 V_0,
int32 V_1,
int64 V_2,
bool V_3)
IL_0000: nop
IL_0001: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0006: conv.i8
IL_0007: stloc.0
IL_0008: ldc.i4.0
IL_0009: stloc.1
IL_000a: br.s IL_0012
// start of loop
IL_000c: nop
IL_000d: nop
IL_000e: ldloc.1
IL_000f: ldc.i4.1
IL_0010: add
IL_0011: stloc.1
IL_0012: ldloc.1
IL_0013: ldc.i4 0x3b9aca00
IL_0018: clt
IL_001a: stloc.3
IL_001b: ldloc.3
IL_001c: brtrue.s IL_000c
// end of loop
IL_001e: call int32 [mscorlib]System.Environment::get_TickCount()
IL_0023: conv.i8
IL_0024: stloc.2
IL_0025: ldloc.2
IL_0026: ldloc.0
IL_0027: sub
IL_0028: call void [mscorlib]System.Console::WriteLine(int64)
IL_002d: nop
IL_002e: ret
} // end of method ForLoopTest::Main
The C++ compiler did generate more optimized IL. It's surprising to me that
the JIT didn't do a better job of optimizing the C#-produced code.

Note that the C# code converted the time to a 64 bit value (C#'s long is 64
bits, while C++'s long is 32 bits), but that occurred outside the loop so it
should have next to no impact on the overall speed of the code.

-cd
Mar 12 '06 #2

P: n/a
Hi Carl!
The C++ compiler did generate more optimized IL. It's surprising to me that
the JIT didn't do a better job of optimizing the C#-produced code.


Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...
--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
Mar 12 '06 #3

P: n/a
>> The C++ compiler did generate more optimized IL. It's surprising to
me that the JIT didn't do a better job of optimizing the C#-produced
code.


Wasn´t there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldn´t find it anymore...


Currently I could only find the confirmation of the "missing"
optimization for the CF. But I tought the same was true for the
"desktop"-framework...

http://blogs.msdn.com/stevenpr/archi...12/502978.aspx

<quote>
Because the CLR can throw away native code under memory pressure or when
an application moves to the background, it is quite possible that the
same IL code may need to be jit compiled again when the application
continues running. This fact leads to our second major jit compiler
design decision: the time it takes to compile IL code often takes
precedence over the quality of the resulting native code. As with all
good compilers, the Compact Framework jit compiler does some basic
optimizations, but because of the need to regenerate code quickly in
order for applications to remain responsive, more extensive
optimizations generally take a back seat to shear compilation speed.
</quote>

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
Mar 12 '06 #4

P: n/a
Jochen Kalmbach [MVP] wrote:

Currently I could only find the confirmation of the "missing"
optimization for the CF. But I tought the same was true for the
"desktop"-framework...

http://blogs.msdn.com/stevenpr/archi...12/502978.aspx

<quote>
Because the CLR can throw away native code under memory pressure or when
an application moves to the background, it is quite possible that the
same IL code may need to be jit compiled again when the application
continues running. This fact leads to our second major jit compiler
design decision: the time it takes to compile IL code often takes
precedence over the quality of the resulting native code. As with all
good compilers, the Compact Framework jit compiler does some basic
optimizations, but because of the need to regenerate code quickly in
order for applications to remain responsive, more extensive
optimizations generally take a back seat to shear compilation speed.
</quote>


That may be true. But I wonder why there cannot be both ?
A fast IL compiler and one that is slow, but optimizes much better. E.g.
"ngen" could have a command line switch to generate more optimized code.

Andre

Mar 12 '06 #5

P: n/a
Don Kim wrote:
I did no optimizing compiler switches


[...]

Then the test is meaningless. If you don't ask the compiler to optimize why
should it spend any effort on making your code fast?

[I don't have any stake in C++/CLI, C# or Java -- they can all die as far as
I am concerned -- my objection as an outsider is only about how you tested.]

--
Eugene
http://www.gershnik.com
Mar 12 '06 #6

P: n/a
> Wasnt there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldnt find it anymore...


No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.
Thanks,
Shawn
Mar 12 '06 #7

P: n/a
Eugene Gershnik wrote:
Then the test is meaningless. If you don't ask the compiler to optimize why
should it spend any effort on making your code fast?


That was the whole point. If I were to use optimizing options, there
would invariably be arguments that either I did not use the correct
ones, not in the proper order, that certain compiler switches are not
equivalent, etc., etc. Therefore, I compiled as is w/out any options to
see how each complier would compile on its own. I also made the test as
simple as possible so as to time how each compiler internally optimizes
a straight iteration of a common for loop.

In this case, it seems C++/CLI is the fastest in the managed Windows
environment.

-Don Kim
Mar 12 '06 #8

P: n/a
Hi Shawn!
Wasnt there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldnt find it anymore...


No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.


Really?
I thought the C++/CLI compiler does not care what code it is generating.
It always tryes to optimize the "pseudeo-code".

Nevertheless... I neither found docu that the JIT-compiler does
optimization nor does I found some docu that it does not...

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
Mar 12 '06 #9

P: n/a
Shawn B. wrote:
Wasnt there a statement that the JIT for .NET 2.0 is not doing
optimizations (only simple optimizations) ?
I just remember a blog-entry from someone at blogs.msdn.com... but
couldnt find it anymore...
No, but there is a recent thread in this group where some MVP's insist that
the C++ compiler doesn't do optimized IL code and produces roughly what the


You mean the sample where W.D. [MVP] gives a samples that the C++/CLI
doesn't do global optimization on IL code ?

It does. IMHO the example is wrong. If I interpret the given example
correctly it's based on a call to an external DLL. So the C++/CLI
compiler must do an optimization over DLL boundaries ?! Since the DLL is
loaded dynamically, how should the C++/CLI compiler do any optimization ?

Why should the C++/CLI compiler not optimize the code ? I don't know how
the C++/CLI compiler is implemented, but I assume that the code
generation of native or CLI code is done by optimizing the generated
intermediate code, before native or managed code is generated. So that
(nearly) the same optimizer is used for "native code compiled to IL
code" and "native x86 code". If my assumption is true it would be plain
nonsense to revert this optimization, already done.
C# compiler does, despite the fact that your test, some VC++ devs,
publications, and my own internal software production has proved that the
C++/CLI compiler is the best optimized for IL of the MS stack. That said,
the same MVP insists that some MS employees have stated that the C++/CLI
compiler leaves all the optimzation to the JIT rather than front-end
optimizing.
If he gives a valid link to the statements, I will believe it. Which
doesn't mean that the statements are true.
Thanks,
Shawn


Andre
Mar 12 '06 #10

P: n/a
Andre Kaufmann wrote:
Why should the C++/CLI compiler not optimize the code ? I don't know
how the C++/CLI compiler is implemented, but I assume that the code
generation of native or CLI code is done by optimizing the generated
intermediate code, before native or managed code is generated. So that
(nearly) the same optimizer is used for "native code compiled to IL
code" and "native x86 code". If my assumption is true it would be
plain nonsense to revert this optimization, already done.


That is indeed the case. There's a single front-end for both native and
managed code. That front end produces CIL ('C' Intermediate Language) which
is then fed to the back end. The back-end consists of target independent
parts (e.g. CIL optimizations) and target dependent parts (e.g. code
generation).

-cd
Mar 12 '06 #11

P: n/a

"Don Kim" <in**@nospam.donkim.info> wrote in message
news:OY**************@TK2MSFTNGP09.phx.gbl...
| Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
| and it forked over into another rant about which was the faster
| compiler. Some said C# was just as fast as C++/CLI, whereas others said
| C++/CLI was more optimized.
|
| Anyway, I wrote up some very simple test code, and at least on my
| computer C++/CLI came out the fastest. Here's the sample code, and just
| for good measure I wrote one in java, and it was the slowest! ;-) Also,
| I did no optimizing compiler switches and compiled the C++/CLI with
| /clr:safe only to compile to pure verifiable .net.
|
| //C++/CLI code
| using namespace System;
|
| int main()
| {
| long start = Environment::TickCount;
| for (int i = 0; i < 10000000; ++i) {}
| long end = Environment::TickCount;
| Console::WriteLine(end - start);
| }
|
|
| //C# code
| using System;
|
| public class ForLoopTest
| {
| public static void Main(string[] args)
| {
| long start = Environment.TickCount;
| for (int i =0;i < 10000000; ++i) {}
| long end = Environment.TickCount;
| Console.WriteLine((end-start));
| }
| }
|
| //Java code
| public class Performance
| {
| public static void main(String args[])
| {
| long start = System.currentTimeMillis();
| for (int i=0; i < 10000000; ++i) {}
| long end = System.currentTimeMillis();
| System.out.println(end-start);
| }
| }
|
| Results:
|
| C++/CLI -> 15-18 secs
| C# -> 31-48 secs
| Java -> 65-72 secs
|
| I know, I know, these kind of test are not always foolproof, and results
| can vary by computer to computer, but at least on my system, C++/CLI had
| the fastest results.
|
| Maybe C++/CLI is the most optimized compiler?
|
| -Don Kim

Such micro benchmark has little value, an empty loop will be hoisted in
optimized builds (you ain't gonna do this in real code do you?).
More important is the way you measure execution time here, it is wrong. The
reason or this is that Environment.TickCount is updated with the real time
clock tick. That is every 10 msec or 15,6 msec or higher, depending on the
CPU type (Intel AMD, variants...). For instance an AMD 64 ticks at an
interval of 15.5 msec, most intel based systems have an interval of 10msec,
most SMP systems tick at 20msec or higher.

To get accurate results you need to use the high performance counters or the
Stopwatch class in V2.
Here is the adapted code:

// C# code
// csc /o- bcs.cs
using System;
using System.Diagnostics;
public class ForLoopTest
{
public static void Main(string[] args)
{
long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;

Stopwatch sw = new Stopwatch();
sw.Start();
for (int i =0;i < 10000000; ++i) {}
sw.Stop();
long ticks = sw.Elapsed.Ticks;
Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
}
}

// C++/CLI code
// cl /CLR:safe /Od bcc.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;

int main()
{
Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;

Stopwatch^ sw = gcnew Stopwatch;
sw->Start();
for (int i = 0; i < 10000000; ++i) {}
sw->Stop();
Int64 ticks = sw->Elapsed.Ticks;
Console::WriteLine("{0} nanoseconds", ticks* nanosecPerTick);
}
On my system using above code and the command line arguments as specified in
the source (both non optimized) show following results:

C#
37714104 nanoseconds
C++/CLI
37389069 nanoseconds

That means both are equaly fast, but again this means nothing, such micro
benchmarks have no value.

Note that an optimized C++ build will hoist the empty loop (removes it
completely from IL). This kind of hoisting is not done by the C# compiler,
and there is a reason for it.
That doesn't mean there is no loop optimization, it's just done at the JIT
level!!.

Willy.

Mar 12 '06 #12

P: n/a
Willy Denoyette [MVP] wrote:
wrong. The reason or this is that Environment.TickCount is updated
with the real time clock tick. That is every 10 msec or 15,6 msec or
higher, depending on the CPU type (Intel AMD, variants...). For
instance an AMD 64 ticks at an interval of 15.5 msec, most intel
based systems have an interval of 10msec, most SMP systems tick at
20msec or higher.

To get accurate results you need to use the high performance counters
or the Stopwatch class in V2.

On my system using above code and the command line arguments as
specified in the source (both non optimized) show following results:

C#
37714104 nanoseconds
C++/CLI
37389069 nanoseconds


Interesting. I took Don' sample and increased the loop count by a factor of
100 and consistently got execution times of about 530ms for the C++ code and
1200ms for the C# code.

Granted, the resolution of GetTickCount is poor - but that's a large enough
difference to be significant.

Your results are actually much closer to what I expected - nearly identical
performance, but I can't see why replacing GetTickCount with StopWatch would
have any effect other than to increase the resolution of the time
measurement.

But... here's what I found with your examples: First, I changed both to
calculate nanosecPerTick as a double instead of a long - on a system with a
tick rate higher than 1Ghz, your calcuation results in 0 all the time.

With that change, I get a time of 15.8us for the C++ code and 42.3us for
the C# code - about the same difference I saw with GetTickCount.

It seems that there's something significantly different about your machine
as compared to mine & Don's when it comes to the performance of this code -
and that is very interesting!

What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM
under XP SP2. I'm suspicious of your times (and mine as well) as I doubt my
machine is 2000 times faster than yours.

-cd
Mar 12 '06 #13

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:ux**************@tk2msftngp13.phx.gbl...
| Willy Denoyette [MVP] wrote:
| > wrong. The reason or this is that Environment.TickCount is updated
| > with the real time clock tick. That is every 10 msec or 15,6 msec or
| > higher, depending on the CPU type (Intel AMD, variants...). For
| > instance an AMD 64 ticks at an interval of 15.5 msec, most intel
| > based systems have an interval of 10msec, most SMP systems tick at
| > 20msec or higher.
| >
| > To get accurate results you need to use the high performance counters
| > or the Stopwatch class in V2.
| >
| > On my system using above code and the command line arguments as
| > specified in the source (both non optimized) show following results:
| >
| > C#
| > 37714104 nanoseconds
| > C++/CLI
| > 37389069 nanoseconds
|
| Interesting. I took Don' sample and increased the loop count by a factor
of
| 100 and consistently got execution times of about 530ms for the C++ code
and
| 1200ms for the C# code.
|

Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
compiler will hoist the loop when optimization is on (O&, O2 or whatever).

| Granted, the resolution of GetTickCount is poor - but that's a large
enough
| difference to be significant.
|

It's not the resolution it's the interval which is the cullprit.

| Your results are actually much closer to what I expected - nearly
identical
| performance, but I can't see why replacing GetTickCount with StopWatch
would
| have any effect other than to increase the resolution of the time
| measurement.
|

Stopwatch uses the QueryPerformanceCounter and QueryPerformanceFrequency
high resolution counters of the OS.

| But... here's what I found with your examples: First, I changed both to
| calculate nanosecPerTick as a double instead of a long - on a system with
a
| tick rate higher than 1Ghz, your calcuation results in 0 all the time.
|

That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
should not be that high, notice that this Frequency is not the CPU clock
frequency, it's the output of a CPU clock divider, it's frequency is much
lower, on my System it's 3579545MHz (try with:
Console::WriteLine(System::Diagnostics::Stopwatch: :Frequency);)
If on your system it's much higher than 1GHz, you might have an issue with
your system.
| With that change, I get a time of 15.8us for the C++ code and 42.3us for
| the C# code - about the same difference I saw with GetTickCount.
|

Hmmm , 15.8 sec. for 10000000 loops in which you execute 6 instructions
[1]per loop, that would mean 60000000 instructions in 15.8sec or 0.000263
nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not possible
really, looks like the loop is hoisted or your clock is broken ;-).
| It seems that there's something significantly different about your machine
| as compared to mine & Don's when it comes to the performance of this
code -
| and that is very interesting!

Looks like you have to investigate the Frequency value returned first, and
inspect your code.

|
| What's your machine hardware? I'm running on a 3Ghz P4 with 1GB of RAM
| under XP SP2. I'm suspicious of your times (and mine as well) as I doubt
my
| machine is 2000 times faster than yours.
|

I have it running on an AMD64 Atlon 3500+, 2GB, XP SP2, whith CPU clock
throttling disabled.
Increasing the loop count by a factor 100 gives me:

3737032857 nanoseconds

or 3.7 seconds.
or 3737032857/1000000000 = 3.737032857 nsec/loop or ~0.63 nsec. per
instruction (avg.)

| -cd
|
|

[1]
00d100d2 83c201 add edx,0x1
00d100d5 81fa80969800 cmp edx,0x989680
00d100db 0f9cc0 setl al
00d100de 0fb6c0 movzx eax,al
00d100e1 85c0 test eax,eax
00d100e3 75ed jnz 00d100d2

notes:
- 0x989680 = 1.000.000.000 decimal
- that this is native code, generated by the JIT in non optimized build.

Willy.


Mar 12 '06 #14

P: n/a
Don Kim wrote:
Eugene Gershnik wrote:
Then the test is meaningless. If you don't ask the compiler to
optimize why should it spend any effort on making your code fast?

[Rearranging your post a little]
That was the whole point.
[...]
In this case, it seems C++/CLI is the fastest in the managed Windows
environment.
Let me see. Take a world record holder for a 100m dash and take me. Put us
both before a 100m range and ask as to get to the end at whatever pace we
want. He walks. I run. I get there before him. You conclusion seems to be
that I am a faster runner.
If I were to use optimizing options, there
would invariably be arguments that either I did not use the correct
ones, not in the proper order, that certain compiler switches are not
equivalent, etc., etc.
Yes measuring compiler performance is hard. If you want to get meaningful
results you will need to study each one's options in detail, determine what
people usually set in their optimized builds, create a meaningfull test set
etc. etc. If you don't do all this anouncing to the world that X compiler is
faster is a waste of electrons.
I also made the
test as simple as possible so as to time how each compiler internally
optimizes a straight iteration of a common for loop.


You didn't ask them to optimize the loop.

--
Eugene
http://www.gershnik.com

Mar 12 '06 #15

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:ul**************@TK2MSFTNGP11.phx.gbl...
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
Are you sure you compiled non optimized (/Od and /o-)? As I said, the C++
compiler will hoist the loop when optimization is on (O&, O2 or whatever).
Quite certain - I used the exact command lines given in your posting
(optimization if off by default as well, so specifying nothing is equivalent
to /Od).
It's not the resolution it's the interval which is the cullprit.
We're talking about the same thing - 15ms precision is quite sufficient for
measuring intervals of 500ms or more and certainly won't account for a 50%
measurement error for such intervals - only 3% or so.
That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
should not be that high, notice that this Frequency is not the CPU clock
frequency, it's the output of a CPU clock divider, it's frequency is much
lower, on my System it's 3579545MHz (try with:
Console::WriteLine(System::Diagnostics::Stopwatch: :Frequency);)
If on your system it's much higher than 1GHz, you might have an issue with
your system.
(You made a typo - on your system it's 3579545Hz, not MHz)

If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
instruction which does report actual CPU core clocks. If your system
doesn't use the MP HAL, then QPC uses the system board timer, which
generally has a clock speed of 1X or 0.5X the NTSC color burst frequency of
3.57954545 Mhz. Note that this rate has absolutely nothing to do with your
CPU clock - it's a completely independent crystal oscillator on the MB.
| With that change, I get a time of 15.8us for the C++ code and 42.3us
for
| the C# code - about the same difference I saw with GetTickCount.
|

Hmmm , 15.8 sec. for 10000000 loops in which you execute 6 instructions
[1]per loop, that would mean 60000000 instructions in 15.8sec or 0.000263
nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not
possible
really, looks like the loop is hoisted or your clock is broken ;-).
I agree - it doesn't add up. I'm quite sure that I did unoptimized builds,
and the results are 100% reproducible. But see below.
| It seems that there's something significantly different about your
machine
| as compared to mine & Don's when it comes to the performance of this
code -
| and that is very interesting!

Looks like you have to investigate the Frequency value returned first, and
inspect your code.


Well, it's your code - not mine. The Frequency value is right on for this
machine.

I'm at my office right now, on a different computer. This one's a 3GHz
Pentium D. I modified the samples as before to make nanosecPerTick double
instead of Int64 and added code to print the value of Stopwatch.Frequency
and the raw Ticks and nanosecPerTick. Here are the results:

C:\Dev\Misc\fortest>fortest0312cs
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
240117 ticks
78664.4695028862 nanoseconds

C:\Dev\Misc\fortest>fortest0312cpp
Stopwatch frequency=3052420000
0.327608913583321 ns/tick
49225 ticks
16126.548771139 nanoseconds

Increasing the loop count by a factor of 10 increases the times by a factor
of 10. Decreasing by a factor of 10 decreases the times by a factor of 10.
Clearly the loop has not been optimized out, but that still doesn't explain
the apparent execution speed of more than 200 adds per clock cycle (I know
modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
so!)

I don't know what's going on here, but two things seem to be true:

1. The C++ code is faster on these machines. If I increase the loop count
to 1,000,000,000 I can clearly see the difference in execution time with my
eyes.
2. The Stopwatch class doesn't appear to work correctly on these machines -
it's measuring times that are orders of magnitude too short, yet still
proportional to the actual time spent.

Working on the assumpting that #2 is true, I modified the code to call
QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the
results:

C:\Dev\Misc\fortest>fortest0312cpp
QPC frequency=3052420000
0.327608913583321 ns/tick
22388910 ticks
7334806.48141475 nanoseconds

C:\Dev\Misc\fortest>fortest0312cs
QPC frequency=3052420000
0.327608913583321 ns/tick
58980368 ticks
19322494.2832245 nanoseconds

The times are now much more reasonable - Stopwatch apparently doesn't work
correctly with such a high value from QPF (it's apparently off by a factor
of 1000). The ratio of times remains about equal though- the C++ code is
still nearly 2X faster on this machine (despite the fact that that makes no
sense at all, it seems to be true).

-cd

Mar 12 '06 #16

P: n/a
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message
The times are now much more reasonable - Stopwatch apparently doesn't work
correctly with such a high value from QPF (it's apparently off by a factor
of 1000). The ratio of times remains about equal though- the C++ code is
still nearly 2X faster on this machine (despite the fact that that makes
no sense at all, it seems to be true).


Follow-up -

It appears that Stopwatch scales the QPF/QPC values internally if the
frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled
value, but Stopwatch.Frequency still reports the full resolution value
returned by QPF.

Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly
scaled values.

This is clearly a bug in the Stopwatch class.

-cd
Mar 12 '06 #17

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:OB**************@TK2MSFTNGP09.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:ul**************@TK2MSFTNGP11.phx.gbl...
| > "Carl Daniel [VC++ MVP]"
<cp*****************************@mvps.org.nospam >
| > Are you sure you compiled non optimized (/Od and /o-)? As I said, the
C++
| > compiler will hoist the loop when optimization is on (O&, O2 or
whatever).
|
| Quite certain - I used the exact command lines given in your posting
| (optimization if off by default as well, so specifying nothing is
equivalent
| to /Od).
|
| > It's not the resolution it's the interval which is the cullprit.
|
| We're talking about the same thing - 15ms precision is quite sufficient
for
| measuring intervals of 500ms or more and certainly won't account for a 50%
| measurement error for such intervals - only 3% or so.

Yes, but not for a loop of 10.000.000 (as in Don's code), which takes only
takes 37 msecs. to complete. And as I said on SMP systems this interval can
be as large as 60 msecs. (as I have measured here on a Compaq Proliant 8 way
system).

|
| > That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
| > should not be that high, notice that this Frequency is not the CPU clock
| > frequency, it's the output of a CPU clock divider, it's frequency is
much
| > lower, on my System it's 3579545MHz (try with:
| > Console::WriteLine(System::Diagnostics::Stopwatch: :Frequency);)
| > If on your system it's much higher than 1GHz, you might have an issue
with
| > your system.
|
| (You made a typo - on your system it's 3579545Hz, not MHz)
|

Right, sorry for that.

| If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
| instruction which does report actual CPU core clocks. If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency
of
| 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
your
| CPU clock - it's a completely independent crystal oscillator on the MB.
|
True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz), but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated, the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the dividers
value.

No my current test machine is not a MP or HT, so it doesn't use an MP HAL,
and you didn't specify that either in your previous reply, it's quite
important as I know about the MP HAL.

| > | With that change, I get a time of 15.8us for the C++ code and 42.3us
| > for
| > | the C# code - about the same difference I saw with GetTickCount.
| > |
| >
| > Hmmm , 15.8 sec. for 10000000 loops in which you execute 6 instructions
| > [1]per loop, that would mean 60000000 instructions in 15.8sec or
0.000263
| > nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not
| > possible
| > really, looks like the loop is hoisted or your clock is broken ;-).
|
| I agree - it doesn't add up. I'm quite sure that I did unoptimized
builds,
| and the results are 100% reproducible. But see below.
|
| > | It seems that there's something significantly different about your
| > machine
| > | as compared to mine & Don's when it comes to the performance of this
| > code -
| > | and that is very interesting!
| >
| > Looks like you have to investigate the Frequency value returned first,
and
| > inspect your code.
|
| Well, it's your code - not mine. The Frequency value is right on for this
| machine.
|

Well ..., it's Don's code. What do you mean with the Frequency value is
right? The Frequency is also right on mine :-).

| I'm at my office right now, on a different computer. This one's a 3GHz
| Pentium D. I modified the samples as before to make nanosecPerTick double
| instead of Int64 and added code to print the value of Stopwatch.Frequency
| and the raw Ticks and nanosecPerTick. Here are the results:
|

| C:\Dev\Misc\fortest>fortest0312cs
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 240117 ticks
| 78664.4695028862 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cpp
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 49225 ticks
| 16126.548771139 nanoseconds
|

That's for 10000000 loops I assume.

| Increasing the loop count by a factor of 10 increases the times by a
factor
| of 10. Decreasing by a factor of 10 decreases the times by a factor of
10.
| Clearly the loop has not been optimized out, but that still doesn't
explain
| the apparent execution speed of more than 200 adds per clock cycle (I know
| modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
| so!)
|

That's not possible, Intel Pentium IV CPU's fetches and executes 2
instruction per cycle.
The AMD Athlon 64 fetches and executes a max. of 3 instructions per cycle,
(mine clocks at 2.2GHz)

These are the results on PIV 3GHz not HT running W2K3 R2.
C#
Frequency = 3579545
46632867 nanoseconds

C++
Frequency = 3579545
40659177 nanoseconds

Notice the difference between C++ and C#, looks like the X86 JIT'd code is
not exactly the same, have to check this.
Remember the results on AMD 64 bit (XP SP2) - 37368702 nanoseconds, that
means that the AMD the Intel 3GHz show comparable results, as expected.
| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines. If I increase the loop count
| to 1,000,000,000 I can clearly see the difference in execution time with
my
| eyes.

Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.

| 2. The Stopwatch class doesn't appear to work correctly on these
machines -
| it's measuring times that are orders of magnitude too short, yet still
| proportional to the actual time spent.
|
| Working on the assumpting that #2 is true, I modified the code to call
| QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the
| results:
|
| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

How many loops here?
| The times are now much more reasonable - Stopwatch apparently doesn't work
| correctly with such a high value from QPF (it's apparently off by a factor
| of 1000).
This is really strange as Stopwatch uses the same QueryPerformanceCounter
and Frequency under the hood.

The ratio of times remains about equal though- the C++ code is
| still nearly 2X faster on this machine (despite the fact that that makes
no
| sense at all, it seems to be true).
|
Time to expect the Stopwatch code, and I'll try to prepare a multicore or HT
box to do some more tests.

wd.
Mar 12 '06 #18

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:eG**************@TK2MSFTNGP14.phx.gbl...
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
| If your machine uses the MP HAL (which mine does), then QPC uses the
RDTSC
| instruction which does report actual CPU core clocks. If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency
of
| 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
your
| CPU clock - it's a completely independent crystal oscillator on the MB.
|
True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz),
but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the
dividers
value.
Not (necessarily) true. For example, this Pentium D machine uses a BCLK
frequency of 200Mhz with a multiplier of 15. There's no requirement
(imposed by the CPU or MCH) that the CPU clock be related to color burst
frequency at all.

Now, it's entirely possible that the motherboard generates that 200Mhz BCLK
by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
motherboard detail that's unrelated to the CPU. Without really digging,
there's no way I can tell one way or another - just looking at the MB, I see
at least 4 different crystal oscillators of unknown frequency. Historically,
the only reason color burst crystals are used is that they're cheap -
they're manufactured by the gazillion for NTSC televisions.
| Working on the assumpting that #2 is true, I modified the code to call
| QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
the
| results:
|
| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

How many loops here?
That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
resonable rate to me - certainly not off by orders of magnitude.
| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines. If I increase the loop
count
| to 1,000,000,000 I can clearly see the difference in execution time with
my
| eyes.

Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.


It's completely reasonable based on the times reported directly by QPC, not
the bogus values from Stopwatch, which is off by a factor of 1000 on these
machines.

So, any theory why the C++ code consistently runs faster than the C# code on
both of my machines? I can't think of any reasonable argument why having a
dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
code is different for the two loops - maybe there's some pathological code
in the C# case that the P4 executes much more slowly than AMD, or some
optimal code in the C++ case that the P4 executes much more quickly than
AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
Single/HT/Dual, etc.

-cd
Mar 12 '06 #19

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:eP**************@TK2MSFTNGP10.phx.gbl...
| "Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
| wrote in message
| > The times are now much more reasonable - Stopwatch apparently doesn't
work
| > correctly with such a high value from QPF (it's apparently off by a
factor
| > of 1000). The ratio of times remains about equal though- the C++ code
is
| > still nearly 2X faster on this machine (despite the fact that that makes
| > no sense at all, it seems to be true).
|
| Follow-up -
|
| It appears that Stopwatch scales the QPF/QPC values internally if the
| frequency is "high", causing Stopwatch.ElapsedTicks to report a scaled
| value, but Stopwatch.Frequency still reports the full resolution value
| returned by QPF.
|
| Stopwatch.ElapsedMilliseconds and Stopwatch.Elapsed both return correctly
| scaled values.
|
| This is clearly a bug in the Stopwatch class.
|
| -cd
|
|

I see, but it still doesn't explain this:

| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|

Why is C++ almost 3 times faster than C#? Are we sure the ticks are
accurate, are we sure the OS counter is updated for every tick, Are we sure
the OS goes to the HAL to read the HW clock tick value at each call of
QueryPerformanceCounter (this must be quite expensive)?

And why is it 2 and 5 times faster than on my AMD box, while the results are
comparable (AMD a little faster) when I run it on Intel 3GHz non HT (see my
previous post) ?

That means that the native code must be different, while it is on my AMD box
(dispite the fact that the IL is different).
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Which is not realy the best algorithm for X86, wonder how it looks like on
Intel. Grr.. micro benchmarks, what a mess ;-)

Willy.

Mar 12 '06 #20

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:ug*************@TK2MSFTNGP12.phx.gbl...
That means that the native code must be different, while it is on my AMD
box
(dispite the fact that the IL is different).
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Which is not realy the best algorithm for X86, wonder how it looks like on
Intel. Grr.. micro benchmarks, what a mess ;-)


Here's what I see (loops going 1 billion times):

The JIT'd C++ code:
// ---------------------------------------------------
for (int i = 0; i < 1000000000; ++i) {}
00000077 xor edx,edx
00000079 mov dword ptr [esp],edx
0000007c nop
0000007d jmp 00000082
// start of loop
0000007f inc dword ptr [esp]
00000082 cmp dword ptr [esp],3B9ACA00h
00000089 jge 0000008E
0000008b nop
0000008c jmp 0000007F
// end of loop

The JIT'd C# code:
// ---------------------------------------------------
for (int i =0;i < 1000000000; ++i) {}
00000098 xor ebx,ebx
0000009a nop
0000009b jmp 000000A0
// start of loop
0000009d nop
0000009e nop
0000009f inc ebx
000000a0 cmp ebx,3B9ACA00h
000000a6 setl al
000000a9 movzx eax,al
000000ac mov dword ptr [ebp-6Ch],eax
000000af cmp dword ptr [ebp-6Ch],0
000000b3 jne 0000009D
// end of loop

Neither of these represent ideal code by any stretch of the imagination -
but instruction count alone probably accounts for the bulk of the difference
between the two programs on this machine. Why the results are so different
from what you see on your AMD machine I can't even guess.

-cd

Mar 12 '06 #21

P: n/a
Carl Daniel [VC++ MVP] wrote:
So, any theory why the C++ code consistently runs faster than the C# code on
both of my machines? I can't think of any reasonable argument why having a
dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
code is different for the two loops - maybe there's some pathological code
in the C# case that the P4 executes much more slowly than AMD, or some
optimal code in the C++ case that the P4 executes much more quickly than
AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
Single/HT/Dual, etc.


Wow, this is becomming interesting. We're getting down to dicussions
CPU architecture and instructions sets. Talk about getting down to the
metal!

Anyway, I just reran my test code with larger loop factors, as well as
the other code with my original and larger loop factors, and C++/CLI
still came out around 2X faster.

I ran these both on my laptop and desktop. Here's the configuration:

Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2

I know someone who has an AMD computer, and I'm going to run my programs
on that computer to see if there's something in the CPU that's causing
the discrepencies.

-Don Kim
Mar 13 '06 #22

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:uc**************@TK2MSFTNGP14.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:eG**************@TK2MSFTNGP14.phx.gbl...
| > "Carl Daniel [VC++ MVP]"
<cp*****************************@mvps.org.nospam >
| > | If your machine uses the MP HAL (which mine does), then QPC uses the
| > RDTSC
| > | instruction which does report actual CPU core clocks. If your system
| > | doesn't use the MP HAL, then QPC uses the system board timer, which
| > | generally has a clock speed of 1X or 0.5X the NTSC color burst
frequency
| > of
| > | 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
| > your
| > | CPU clock - it's a completely independent crystal oscillator on the
MB.
| > |
| > True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz),
| > but
| > the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
| > the
| > CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
| > instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
| > 995MHz. The stepping number is important here, as it may change the
| > dividers
| > value.
|
| Not (necessarily) true. For example, this Pentium D machine uses a BCLK
| frequency of 200Mhz with a multiplier of 15. There's no requirement
| (imposed by the CPU or MCH) that the CPU clock be related to color burst
| frequency at all.
|
Carl, I'm not saying this is the case for all type of CPU's and mother
boards, I only say that it's true for Pentiums up to III, things are
different for other type of CPU's. See, AMD clocks at 200MHz with a
multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz clock
can be adjusted (overclocked or underclocked), the Frequency returned by
QueryPerformanceFrequency stays the same, the same is true for recent PIV's
Pentium M and D. So here it's true that both aren't related, and the
3.57954545MHz clock is derived from the on baord Graphics controller or an
external clock source (on mobo or not) when no on board graphics controller,
but the value remains the same 3.57954545MHz unless you are using a MP HAL.

| Now, it's entirely possible that the motherboard generates that 200Mhz
BCLK
| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
| motherboard detail that's unrelated to the CPU. Without really digging,
| there's no way I can tell one way or another - just looking at the MB, I
see
| at least 4 different crystal oscillators of unknown frequency.
Historically,
| the only reason color burst crystals are used is that they're cheap -
| they're manufactured by the gazillion for NTSC televisions.
|

I know,carl, I've been working for IHV's (HP before Compac, before DEC ...)
I know what you are talking about. Even on DEC Alpha (AXP) systems, the
QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not the
case, Jeez, what a bunch of problems did we have when porting W2K (never
released for well known reasons) from intel code to AXP, just because some
drivers and core OS components did not expect QueryPerformanceCounter speeds
higher that 1GHz (that is when we overclocked an 800MHz CPU).

| > | Working on the assumpting that #2 is true, I modified the code to call
| > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
| > the
| > | results:
| > |
| > | C:\Dev\Misc\fortest>fortest0312cpp
| > | QPC frequency=3052420000
| > | 0.327608913583321 ns/tick
| > | 22388910 ticks
| > | 7334806.48141475 nanoseconds
| > |
| > | C:\Dev\Misc\fortest>fortest0312cs
| > | QPC frequency=3052420000
| > | 0.327608913583321 ns/tick
| > | 58980368 ticks
| > | 19322494.2832245 nanoseconds
| > |
| >
| > How many loops here?
|
| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
| resonable rate to me - certainly not off by orders of magnitude.
|

Sure it is, I was wrong when reading the tick values (largely over midnight
here, time to go to bed).

| > | I don't know what's going on here, but two things seem to be true:
| > |
| > | 1. The C++ code is faster on these machines. If I increase the loop
| > count
| > | to 1,000,000,000 I can clearly see the difference in execution time
with
| > my
| > | eyes.
| >
| > Assumed the timings are correct, it's simply not possible to execute
that
| > number instructions during that time, so there must be something going
on
| > here.
|
| It's completely reasonable based on the times reported directly by QPC,
not
| the bogus values from Stopwatch, which is off by a factor of 1000 on these
| machines.
|
| So, any theory why the C++ code consistently runs faster than the C# code
on
| both of my machines? I can't think of any reasonable argument why having
a
| dual core or HT CPU would make the C++ code run faster. Clearly the JIT'd
| code is different for the two loops - maybe there's some pathological code
| in the C# case that the P4 executes much more slowly than AMD, or some
| optimal code in the C++ case that the P4 executes much more quickly than
| AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
| Single/HT/Dual, etc.
|
| -cd
|

Well I have investigated the native code generated on the Intel PIV (see
previous .
Here is (part of) the disassembly (VS2005)for C++:
....
0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop ---> not sure why this one is good for, it's ignored by the
CPU anyway
00000029 EB F4 jmp 0000001F
....

That means 4 instructions per loop compared to 6 on AMD.
And the results are comparable to yours (for C++).
Did not look at the C# code and it's result, but above shows that the JIT
compiler generates (better?) code for PIV (don't know what the __cpuid call
returns, but I know the CLR checks it when booting). Again, notice this is
an unoptimized code build (/Od flag set), optimized code is a totally
different story.

Willy.




Mar 13 '06 #23

P: n/a

"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:eg**************@TK2MSFTNGP11.phx.gbl...
|
| "Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
| wrote in message news:uc**************@TK2MSFTNGP14.phx.gbl...
|| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
|| news:eG**************@TK2MSFTNGP14.phx.gbl...
|| > "Carl Daniel [VC++ MVP]"
| <cp*****************************@mvps.org.nospam >
|| > | If your machine uses the MP HAL (which mine does), then QPC uses the
|| > RDTSC
|| > | instruction which does report actual CPU core clocks. If your system
|| > | doesn't use the MP HAL, then QPC uses the system board timer, which
|| > | generally has a clock speed of 1X or 0.5X the NTSC color burst
| frequency
|| > of
|| > | 3.57954545 Mhz. Note that this rate has absolutely nothing to do
with
|| > your
|| > | CPU clock - it's a completely independent crystal oscillator on the
| MB.
|| > |
|| > True MP HAL uses the externam CPU clock (yours runs at 3.052420000
GHz),
|| > but
|| > the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
|| > the
|| > CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
|| > instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
|| > 995MHz. The stepping number is important here, as it may change the
|| > dividers
|| > value.
||
|| Not (necessarily) true. For example, this Pentium D machine uses a BCLK
|| frequency of 200Mhz with a multiplier of 15. There's no requirement
|| (imposed by the CPU or MCH) that the CPU clock be related to color burst
|| frequency at all.
||
| Carl, I'm not saying this is the case for all type of CPU's and mother
| boards, I only say that it's true for Pentiums up to III, things are
| different for other type of CPU's. See, AMD clocks at 200MHz with a
| multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz
clock
| can be adjusted (overclocked or underclocked), the Frequency returned by
| QueryPerformanceFrequency stays the same, the same is true for recent
PIV's
| Pentium M and D. So here it's true that both aren't related, and the
| 3.57954545MHz clock is derived from the on baord Graphics controller or an
| external clock source (on mobo or not) when no on board graphics
controller,
| but the value remains the same 3.57954545MHz unless you are using a MP
HAL.
|
|| Now, it's entirely possible that the motherboard generates that 200Mhz
| BCLK
|| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
|| motherboard detail that's unrelated to the CPU. Without really digging,
|| there's no way I can tell one way or another - just looking at the MB, I
| see
|| at least 4 different crystal oscillators of unknown frequency.
| Historically,
|| the only reason color burst crystals are used is that they're cheap -
|| they're manufactured by the gazillion for NTSC televisions.
||
|
| I know,carl, I've been working for IHV's (HP before Compac, before DEC
....)
| I know what you are talking about. Even on DEC Alpha (AXP) systems, the
| QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
| on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not
the
| case, Jeez, what a bunch of problems did we have when porting W2K (never
| released for well known reasons) from intel code to AXP, just because some
| drivers and core OS components did not expect QueryPerformanceCounter
speeds
| higher that 1GHz (that is when we overclocked an 800MHz CPU).
|
|| > | Working on the assumpting that #2 is true, I modified the code to
call
|| > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
|| > the
|| > | results:
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cpp
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 22388910 ticks
|| > | 7334806.48141475 nanoseconds
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cs
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 58980368 ticks
|| > | 19322494.2832245 nanoseconds
|| > |
|| >
|| > How many loops here?
||
|| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
|| resonable rate to me - certainly not off by orders of magnitude.
||
|
| Sure it is, I was wrong when reading the tick values (largely over
midnight
| here, time to go to bed).
|
|| > | I don't know what's going on here, but two things seem to be true:
|| > |
|| > | 1. The C++ code is faster on these machines. If I increase the loop
|| > count
|| > | to 1,000,000,000 I can clearly see the difference in execution time
| with
|| > my
|| > | eyes.
|| >
|| > Assumed the timings are correct, it's simply not possible to execute
| that
|| > number instructions during that time, so there must be something going
| on
|| > here.
||
|| It's completely reasonable based on the times reported directly by QPC,
| not
|| the bogus values from Stopwatch, which is off by a factor of 1000 on
these
|| machines.
||
|| So, any theory why the C++ code consistently runs faster than the C# code
| on
|| both of my machines? I can't think of any reasonable argument why having
| a
|| dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
|| code is different for the two loops - maybe there's some pathological
code
|| in the C# case that the P4 executes much more slowly than AMD, or some
|| optimal code in the C++ case that the P4 executes much more quickly than
|| AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
|| Single/HT/Dual, etc.
||
|| -cd
||
|
| Well I have investigated the native code generated on the Intel PIV (see
| previous .
| Here is (part of) the disassembly (VS2005)for C++:
| ...
| 0000001f 46 inc esi
| 00000020 81 FE 80 96 98 00 cmp esi,989680h
| 00000026 7D 03 jge 0000002B
| 00000028 90 nop ---> not sure why this one is good for, it's ignored by
the
| CPU anyway
| 00000029 EB F4 jmp 0000001F
| ...
|
| That means 4 instructions per loop compared to 6 on AMD.
| And the results are comparable to yours (for C++).
| Did not look at the C# code and it's result, but above shows that the JIT
| compiler generates (better?) code for PIV (don't know what the __cpuid
call
| returns, but I know the CLR checks it when booting). Again, notice this is
| an unoptimized code build (/Od flag set), optimized code is a totally
| different story.
|
| Willy.
|
Last follow up, (before my spouse pulls the plugs).
Here is the X86 output of a C# release build on both AMD and Intel PIV:
[1]
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl

this results in 6.235684 msec on AMD and 7.023547 msec on PIV (10.000.000
loops).

while this is the debug build on Intel:

00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

See that the release build is the most optimum X86 code possible for the
loop. The C++/CLI compiler in optimized build hoists the loop completely, so
can't compare.
Carl, could you look at the disassembly on your box, not a problem if you
can't (It doesn't mean that much anyway), it looks like on you box the
C++/CLI output looks more like [1] above.

Willy.


Mar 13 '06 #24

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:eg**************@TK2MSFTNGP11.phx.gbl...
Pentium M and D. So here it's true that both aren't related, and the
3.57954545MHz clock is derived from the on baord Graphics controller or an
external clock source (on mobo or not) when no on board graphics
controller,
but the value remains the same 3.57954545MHz unless you are using a MP
HAL.


I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
color burst, or 1.7897727Mhz. But this particular branch has drifted far
from the real point of this thread - interesting though (made me go look at
the Pentium D data sheet, afterall!)

-cd
Mar 13 '06 #25

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:O$**************@tk2msftngp13.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:ug*************@TK2MSFTNGP12.phx.gbl...
| > That means that the native code must be different, while it is on my AMD
| > box
| > (dispite the fact that the IL is different).
| > add edx,0x1
| > cmp edx,0x989680
| > setl al
| > movzx eax,al
| > test eax,eax
| > jnz 00d100d2
| >
| > Which is not realy the best algorithm for X86, wonder how it looks like
on
| > Intel. Grr.. micro benchmarks, what a mess ;-)
|
| Here's what I see (loops going 1 billion times):
|
| The JIT'd C++ code:
| // ---------------------------------------------------
| for (int i = 0; i < 1000000000; ++i) {}
| 00000077 xor edx,edx
| 00000079 mov dword ptr [esp],edx
| 0000007c nop
| 0000007d jmp 00000082
| // start of loop
| 0000007f inc dword ptr [esp]
| 00000082 cmp dword ptr [esp],3B9ACA00h
| 00000089 jge 0000008E
| 0000008b nop
| 0000008c jmp 0000007F
| // end of loop
|
| The JIT'd C# code:
| // ---------------------------------------------------
| for (int i =0;i < 1000000000; ++i) {}
| 00000098 xor ebx,ebx
| 0000009a nop
| 0000009b jmp 000000A0
| // start of loop
| 0000009d nop
| 0000009e nop
| 0000009f inc ebx
| 000000a0 cmp ebx,3B9ACA00h
| 000000a6 setl al
| 000000a9 movzx eax,al
| 000000ac mov dword ptr [ebp-6Ch],eax
| 000000af cmp dword ptr [ebp-6Ch],0
| 000000b3 jne 0000009D
| // end of loop
|
| Neither of these represent ideal code by any stretch of the imagination -
| but instruction count alone probably accounts for the bulk of the
difference
| between the two programs on this machine. Why the results are so
different
| from what you see on your AMD machine I can't even guess.
|
| -cd
|

Thanks, that's almost exactly what I've noticed see my previous reply.

C# Intel...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030

C# AMD...
add edx,0x1
cmp edx,0x989680
setl al
movzx eax,al
test eax,eax
jnz 00d100d2

Conclusion: the JIT takes care of the CPU type even in debug builds! So
generates different X86 even from the same IL.
This is extremely weird, for instance the inc esi used on Intel, is an add,
edx, 1 on AMD;
so different register allocations and a different instruction. Well I know
add on AMD is prefered over an inc (according their "Optimization guide for
AMD64 Processors"), can you believe MSFT went that far with the JIT (in
debug builds)?

Willy.

Mar 13 '06 #26

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:%2***************@TK2MSFTNGP12.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:eg**************@TK2MSFTNGP11.phx.gbl...
| > Pentium M and D. So here it's true that both aren't related, and the
| > 3.57954545MHz clock is derived from the on baord Graphics controller or
an
| > external clock source (on mobo or not) when no on board graphics
| > controller,
| > but the value remains the same 3.57954545MHz unless you are using a MP
| > HAL.
|
| I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
| color burst, or 1.7897727Mhz. But this particular branch has drifted far
| from the real point of this thread - interesting though (made me go look
at
| the Pentium D data sheet, afterall!)
|

Can't remember this, but I guess you are right, much depends on the chip set
used, I was on the Alpha team by that time (where we build the AXP HAL's and
drivers), I moved to Intel architectures after the Compaq merge ;-). Digital
had their own chip sets for Alpha systems (that's why they were too
expensive, right?), nothing commodity, like there is available now.

Willy.

Mar 13 '06 #27

P: n/a
Willy Denoyette [MVP] wrote:
"Optimization guide for AMD64 Processors"), can you believe MSFT went
that far with the JIT (in debug builds)?


Well, yeah. Maybe. I'm under the (possibly misguided) impression that
debug primarily stops the JIT from inlining and hoisting - things that
change the relative order of the native code compared to the IL code.
Within those guidelines, I guess it still picks the best codegen it can
based on the machine.

My belief is that there are multiple full-time Intel and AMD employees at
MSFT that do nothing but work on the compiler back-ends, including the CLR
JIT.

-cd
Mar 13 '06 #28

P: n/a
r"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote:

I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
color burst, or 1.7897727Mhz.


Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original
PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
12 for the counter.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Mar 13 '06 #29

P: n/a
Tim Roberts wrote:
r"Carl Daniel [VC++ MVP]"
<cp*****************************@mvps.org.nospam > wrote:

I'm 99.99% sure that my old P-II machine produced a QPC frequency of
1/2 color burst, or 1.7897727Mhz.


Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The
original PC had a 14.31818 MHz crystal (4x the color burst), and they
divided it by 12 for the counter.


Yep. That sounds right - 1.789 just didn't feel quite right :)

-cd
Mar 13 '06 #30

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:OP**************@TK2MSFTNGP11.phx.gbl...
| Willy Denoyette [MVP] wrote:
| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
| > that far with the JIT (in debug builds)?
|
| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
| debug primarily stops the JIT from inlining and hoisting - things that
| change the relative order of the native code compared to the IL code.
| Within those guidelines, I guess it still picks the best codegen it can
| based on the machine.
|
| My belief is that there are multiple full-time Intel and AMD employees at
| MSFT that do nothing but work on the compiler back-ends, including the CLR
| JIT.
|

Well, I would expect this for the C++ compiler back-end, but not directly
for the JIT compiler which is more time constrained, but I guess I'm wrong.

Willy.
Mar 13 '06 #31

P: n/a

"Tim Roberts" <ti**@probo.com> wrote in message
news:1r********************************@4ax.com...
| r"Carl Daniel [VC++ MVP]"
<cp*****************************@mvps.org.nospam >
| wrote:
| >
| >I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
| >color burst, or 1.7897727Mhz.
|
| Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original
| PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
| 12 for the counter.
| --
| - Tim Roberts, ti**@probo.com
| Providenza & Boekelheide, Inc.

Yep, an old 200MHz (199.261) P6 "Model 1, Stepping 7" of mine, gives a QPC
of 1.193182 MHz, that is CPU clock/167.

Willy.
Mar 13 '06 #32

P: n/a

"Don Kim" <in**@nospam.donkim.info> wrote in message
news:ez**************@TK2MSFTNGP09.phx.gbl...
| Carl Daniel [VC++ MVP] wrote:
| > So, any theory why the C++ code consistently runs faster than the C#
code on
| > both of my machines? I can't think of any reasonable argument why
having a
| > dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
| > code is different for the two loops - maybe there's some pathological
code
| > in the C# case that the P4 executes much more slowly than AMD, or some
| > optimal code in the C++ case that the P4 executes much more quickly than
| > AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
| > Single/HT/Dual, etc.
|
| Wow, this is becomming interesting. We're getting down to dicussions
| CPU architecture and instructions sets. Talk about getting down to the
| metal!
|

That's true, if you are running empty loops, you are not only comparing
compiler optimizations, you are measuring architectural differences at the
CPU, L1/L2 cache & memory controler level. That's also why such
micro-benchmarks have little or no value.
| Anyway, I just reran my test code with larger loop factors, as well as
| the other code with my original and larger loop factors, and C++/CLI
| still came out around 2X faster.
|
| I ran these both on my laptop and desktop. Here's the configuration:
|
| Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
| Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2
|

Just currious what the QPD is on the Centrino.

| I know someone who has an AMD computer, and I'm going to run my programs
| on that computer to see if there's something in the CPU that's causing
| the discrepencies.

Well, I noticed that for debug builds, C++/CLI produces smaller IL, and
different X86 code produced by the JIT for both C# and C++/CLI, here are the
for loops...

X86 for C# (debug)
...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030
....

X86 for C++/CLI (debug)

0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop
00000029 EB F4 jmp 0000001F

An optimized C# build produces even a shorter code path:
...
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl 0000001C
...

Now, while one would think that the run times would be better, they do not,
all take the same time to finish.

The reason for this (AFAIK) is that super scalars like AMD prefer longer
code paths (longer than a cacheline) in order to feed the instruction
pipeline with longer bursts. Don't know how this behaves on Intel Centrino
and PVI HT, but it looks like they behave differently. (I'll try this with
an assembly code program).

Anyway I don't care that much about this, empty loops are not that common I
guess (and C++ will hoist them anyway). Once you start something reasonable
inside the loop, the loop overhead is reduced to dust and the pipeline gets
filed in a more optimum way.

Willy.


Mar 13 '06 #33

P: n/a

"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:%2****************@tk2msftngp13.phx.gbl...
|
| "Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
| wrote in message news:OP**************@TK2MSFTNGP11.phx.gbl...
|| Willy Denoyette [MVP] wrote:
|| > "Optimization guide for AMD64 Processors"), can you believe MSFT went
|| > that far with the JIT (in debug builds)?
||
|| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
|| debug primarily stops the JIT from inlining and hoisting - things that
|| change the relative order of the native code compared to the IL code.
|| Within those guidelines, I guess it still picks the best codegen it can
|| based on the machine.
||
|| My belief is that there are multiple full-time Intel and AMD employees at
|| MSFT that do nothing but work on the compiler back-ends, including the
CLR
|| JIT.
||
|
| Well, I would expect this for the C++ compiler back-end, but not directly
| for the JIT compiler which is more time constrained, but I guess I'm
wrong.
|
| Willy.
|
|

Some more fun.

Consider this program:

//C++/CLI code
// File : EmptyLoop.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,989680h ; i < 10000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{
Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnostics::Stopwatch::Frequency;
Stopwatch^ sw = gcnew Stopwatch;
sw->Start();
ForLoopTest();
sw->Stop();
Int64 ticks = sw->Elapsed.Ticks;
Console::WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
}

Compiled with:
cl /clr /O2 EmptyLoop.cpp
output:
24935346 nanoseconds

cl /clr /Od EmptyLoop.cpp
output:
37636821 nanoseconds

See the loop is in assembly, pure unmanaged X86 code, the code produced by
the C++ compiler [1] is the same except for the function prolog and epilog,
altough the results are different. Any takers?

[1]
/Od build

void ForLoopTest( void )
{
00401000 55 push ebp
00401001 8B EC mov ebp,esp
00401003 56 push esi
__asm {
xor esi,esi; 0 -> esi
00401004 33 F6 xor esi,esi
jmp begin;
00401006 EB 01 jmp begin (401009h)
iter:;
inc esi; i++
00401008 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401009 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100F 7C F7 jl iter (401008h)
}
return;
}
00401011 5E pop esi
00401012 5D pop ebp
00401013 C3 ret

/O2 build

void ForLoopTest( void )
{
00401000 56 push esi
xor esi,esi; 0 -> esi
00401001 33 F6 xor esi,esi
jmp begin;
00401003 EB 01 jmp begin (401006h)
iter:;
inc esi; i++
00401005 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401006 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100C 7C F7 jl iter (401005h)
__asm {
0040100E 5E pop esi
}
return;
}
0040100F C3 ret

Willy.

Mar 13 '06 #34

P: n/a
Ok, final update.
The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
on all platforms.

Using StopWatch.Elapsed.Milliseconds gives folowing results.

Values are averges for 10 runs.

C# ~12.8 msec. for 10.000.000 loops
C++/CLI ~9.1 msec.

Release build:

C# ~9.1 msec.
C++/CLI - loop hoisted by C++/CLI compiler (no IL body)

The X86 code for the loop C++/CLI /Od and C# optimized build are nearly the
same (different registers allocated and inc i.s.o add).

Now this:

#using <System.dll>
using namespace System;
using namespace System::Diagnostics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,100000000 ; < 100000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{

Stopwatch^ sw = gcnew Stopwatch;
sw->Reset();
sw->Start();
ForLoopTest();
sw->Stop();

Int64 ms = sw->Elapsed.Milliseconds;
Console::WriteLine("{0} msec.", ms);
}

compiled with:
cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 135 msec.

cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 91 msec.
Notice the same result for C# optimized build as C++/CLI with loop in
assembly optimized build.
Remains the question why the debug build is that much slower, guess this is
due to the CLR starting some actions when running debug builds, IMO there is
an GC/Finalizer run after the call to Stopwatch.Start and before running the
loop. That would explain different behavior (better results) on an HT CPU as
the finalizer runs on a second CPU, so doesn't disturb the user thread which
runs on another core or logical CPU, on a single CPU core the finalizer
pre-empts the user thread.
I'll try to get an HW analizer from the lab to check this, this is simply
not possible to check only by SW tools.

Willy.

Mar 13 '06 #35

P: n/a
Richard Grimes'a article 'Is Managed Code Slower than Unmanaged Code' might
be of interest.
http://www.grimes.demon.co.uk/dotnet/man_unman.htm

Seems to indicate that there isn't much to choose between c# and c++/cli. c#
can be faster in some circumstances.

Michael
"Don Kim" <in**@nospam.donkim.info> wrote in message
news:OY**************@TK2MSFTNGP09.phx.gbl...
Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster compiler.
Some said C# was just as fast as C++/CLI, whereas others said C++/CLI was
more optimized.

Anyway, I wrote up some very simple test code, and at least on my computer
C++/CLI came out the fastest. Here's the sample code, and just for good
measure I wrote one in java, and it was the slowest! ;-) Also, I did no
optimizing compiler switches and compiled the C++/CLI with /clr:safe only
to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
long start = Environment::TickCount;
for (int i = 0; i < 10000000; ++i) {}
long end = Environment::TickCount;
Console::WriteLine(end - start);
}
//C# code
using System;

public class ForLoopTest
{
public static void Main(string[] args)
{
long start = Environment.TickCount;
for (int i =0;i < 10000000; ++i) {}
long end = Environment.TickCount;
Console.WriteLine((end-start));
}
}

//Java code
public class Performance
{
public static void main(String args[])
{
long start = System.currentTimeMillis();
for (int i=0; i < 10000000; ++i) {}
long end = System.currentTimeMillis();
System.out.println(end-start);
}
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim

Mar 13 '06 #36

P: n/a

"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:eb**************@tk2msftngp13.phx.gbl...
| Ok, final update.
| The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
| on all platforms.
|

Followup.
!!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!

One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
The only correct way to get this high precision count is by using
Stopwatch.ElapsedTicks like this:

long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
....
long ticks = sw.ElapsedTicks;
Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);

or use Stopwatch.ElapsedMiliseconds.

Note that the Stopwatch code is not broken, the code I posted used
Stopwatch.Elapsed.Ticks which is wrong in this context.
Sorry for all the confusion.
Willy.

Mar 13 '06 #37

P: n/a

"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...
|
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:eb**************@tk2msftngp13.phx.gbl...
|| Ok, final update.
|| The Stopwatch.Ticks is broken, so the calculated nanoseconds are
incorrect
|| on all platforms.
||
|
| Followup.
| !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
|
| One should not use Elapsed.Ticks to calculate the elapsed time in
| nanoseconds.
| The only correct way to get this high precision count is by using
| Stopwatch.ElapsedTicks like this:
|
| long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
| ...
| long ticks = sw.ElapsedTicks;
| Console.WriteLine("{0} nanoseconds", ticks * nanosecPerTick);
|
| or use Stopwatch.ElapsedMiliseconds.
|
| Note that the Stopwatch code is not broken, the code I posted used
| Stopwatch.Elapsed.Ticks which is wrong in this context.
| Sorry for all the confusion.
|
|
| Willy.
|
|
|

Mystery solved, finally :-).

A C++/CLI debug build ( /Od flag - the default), does not generate sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast as
the C++/CLI sample built with /Od. The IL produced is identical.

Willy.


Mar 13 '06 #38

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:%2****************@TK2MSFTNGP11.phx.gbl...
Followup.
!!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
A ha! I obviously hadn't looked at the code closely enough to realize that
it was using Elapsed.Ticks and not ElapsedTicks.
One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
True - one should use it to calculate the elapsed time in 0.1us units, since
that's what TimeSpan.Ticks is expressed as.
The only correct way to get this high precision count is by using
Stopwatch.ElapsedTicks like this:

long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;


but make this a double. Stopwatch.Frequency is more than 1E9 on modern
machines using the MP HAL.

double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;

-cd

Mar 13 '06 #39

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
Mystery solved, finally :-).

A C++/CLI debug build ( /Od flag - the default), does not generate
sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to
a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but
the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast
as
the C++/CLI sample built with /Od. The IL produced is identical.


Good sleuthing! In the end, they really ought to be about the same -
having the C++ code execute 2x faster just didn't make sense.

-cd
Mar 13 '06 #40

P: n/a
This is very useful info. It was causing confusion given mixed
information coming from MSFT itself.

--------
Ajay Kalra
aj*******@yahoo.com

Mar 13 '06 #41

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:%2****************@TK2MSFTNGP10.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:%2****************@TK2MSFTNGP11.phx.gbl...
| > Followup.
| > !!! Stopwatch.Elapsed.Ticks != Stopwatch.ElapsedTicks !!!
|
| A ha! I obviously hadn't looked at the code closely enough to realize
that
| it was using Elapsed.Ticks and not ElapsedTicks.
|
| > One should not use Elapsed.Ticks to calculate the elapsed time in
| > nanoseconds.
|
| True - one should use it to calculate the elapsed time in 0.1us units,
since
| that's what TimeSpan.Ticks is expressed as.
|
| > The only correct way to get this high precision count is by using
| > Stopwatch.ElapsedTicks like this:
| >
| > long nanosecPerTick = (1000L*1000L*1000L) / Stopwatch.Frequency;
|
| but make this a double. Stopwatch.Frequency is more than 1E9 on modern
| machines using the MP HAL.
|
| double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequency;
|
Sure, or use picoseconds :-)

long picosecPerTick = 1000L * 1000L * 1000L * 1000L / Stopwatch.Frequency;

90614831400 picoseconds
Looks real crazy isn't it?
Willy.


Mar 13 '06 #42

P: n/a

"Ajay Kalra" <aj*******@yahoo.com> wrote in message
news:11**********************@i40g2000cwc.googlegr oups.com...
| This is very useful info. It was causing confusion given mixed
| information coming from MSFT itself.
|
| --------
| Ajay Kalra
| aj*******@yahoo.com
|

Well, the C++/CLI team did not want to generate explicit sequence points in
the IL, so the JIT compiler can only rely on the implicit sequence points
(that is when the evaluation stack is empty). That means also that it's not
possible to synchronise the IL with the actual code while debugging C++/CLI
in managed mode and you need the PDB to set breakpoint in your code, not a
big deal IMO.
Willy.
Mar 13 '06 #43

P: n/a
"Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
news:Om**************@TK2MSFTNGP09.phx.gbl...
Sure, or use picoseconds :-)


Nah - that's short sighted. Let's standardize on Attoseconds :)

long attosecPerTick = 1000L * 1000L * 1000L * 1000L * 1000L * 1000L
/Stopwatch.Frequency;

now that's just getting silly... for the next few decades at least.

-cd
Mar 14 '06 #44

P: n/a

"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:OJ**************@tk2msftngp13.phx.gbl...
| "Willy Denoyette [MVP]" <wi*************@telenet.be> wrote in message
| news:Om**************@TK2MSFTNGP09.phx.gbl...
| > Sure, or use picoseconds :-)
|
| Nah - that's short sighted. Let's standardize on Attoseconds :)
|
| long attosecPerTick = 1000L * 1000L * 1000L * 1000L * 1000L * 1000L
| /Stopwatch.Frequency;
|
| now that's just getting silly... for the next few decades at least.
|
| -cd
|
|

LOL, I'll keep it in mind for a next life maybe :-)

Willy.
Mar 15 '06 #45

This discussion thread is closed

Replies have been disabled for this discussion.