473,883 Members | 1,648 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

C++/CLI the fastest compiler? Yes, at least for me. :-)

Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster
compiler. Some said C# was just as fast as C++/CLI, whereas others said
C++/CLI was more optimized.

Anyway, I wrote up some very simple test code, and at least on my
computer C++/CLI came out the fastest. Here's the sample code, and just
for good measure I wrote one in java, and it was the slowest! ;-) Also,
I did no optimizing compiler switches and compiled the C++/CLI with
/clr:safe only to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
long start = Environment::Ti ckCount;
for (int i = 0; i < 10000000; ++i) {}
long end = Environment::Ti ckCount;
Console::WriteL ine(end - start);
}
//C# code
using System;

public class ForLoopTest
{
public static void Main(string[] args)
{
long start = Environment.Tic kCount;
for (int i =0;i < 10000000; ++i) {}
long end = Environment.Tic kCount;
Console.WriteLi ne((end-start));
}
}

//Java code
public class Performance
{
public static void main(String args[])
{
long start = System.currentT imeMillis();
for (int i=0; i < 10000000; ++i) {}
long end = System.currentT imeMillis();
System.out.prin tln(end-start);
}
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim
Mar 12 '06
44 3246

"Carl Daniel [VC++ MVP]" <cp************ *************** **@mvps.org.nos pam>
wrote in message news:OP******** ******@TK2MSFTN GP11.phx.gbl...
| Willy Denoyette [MVP] wrote:
| > "Optimizati on guide for AMD64 Processors"), can you believe MSFT went
| > that far with the JIT (in debug builds)?
|
| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
| debug primarily stops the JIT from inlining and hoisting - things that
| change the relative order of the native code compared to the IL code.
| Within those guidelines, I guess it still picks the best codegen it can
| based on the machine.
|
| My belief is that there are multiple full-time Intel and AMD employees at
| MSFT that do nothing but work on the compiler back-ends, including the CLR
| JIT.
|

Well, I would expect this for the C++ compiler back-end, but not directly
for the JIT compiler which is more time constrained, but I guess I'm wrong.

Willy.
Mar 13 '06 #31

"Tim Roberts" <ti**@probo.com > wrote in message
news:1r******** *************** *********@4ax.c om...
| r"Carl Daniel [VC++ MVP]"
<cp************ *************** **@mvps.org.nos pam>
| wrote:
| >
| >I'm 99.99% sure that my old P-II machine produced a QPC frequency of 1/2
| >color burst, or 1.7897727Mhz.
|
| Nope, it was actually 1/3 of the color burst, 1.193182 MHz. The original
| PC had a 14.31818 MHz crystal (4x the color burst), and they divided it by
| 12 for the counter.
| --
| - Tim Roberts, ti**@probo.com
| Providenza & Boekelheide, Inc.

Yep, an old 200MHz (199.261) P6 "Model 1, Stepping 7" of mine, gives a QPC
of 1.193182 MHz, that is CPU clock/167.

Willy.
Mar 13 '06 #32

"Don Kim" <in**@nospam.do nkim.info> wrote in message
news:ez******** ******@TK2MSFTN GP09.phx.gbl...
| Carl Daniel [VC++ MVP] wrote:
| > So, any theory why the C++ code consistently runs faster than the C#
code on
| > both of my machines? I can't think of any reasonable argument why
having a
| > dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
| > code is different for the two loops - maybe there's some pathological
code
| > in the C# case that the P4 executes much more slowly than AMD, or some
| > optimal code in the C++ case that the P4 executes much more quickly than
| > AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
| > Single/HT/Dual, etc.
|
| Wow, this is becomming interesting. We're getting down to dicussions
| CPU architecture and instructions sets. Talk about getting down to the
| metal!
|

That's true, if you are running empty loops, you are not only comparing
compiler optimizations, you are measuring architectural differences at the
CPU, L1/L2 cache & memory controler level. That's also why such
micro-benchmarks have little or no value.
| Anyway, I just reran my test code with larger loop factors, as well as
| the other code with my original and larger loop factors, and C++/CLI
| still came out around 2X faster.
|
| I ran these both on my laptop and desktop. Here's the configuration:
|
| Laptop: Pentium Centrino 1.86 GHz, 1 GB Ram, Windows XP Pro, SP 2
| Desktop: Pentium 4, 2.8 GHz, 1 GB RAM, Windows XP Pro, SP2
|

Just currious what the QPD is on the Centrino.

| I know someone who has an AMD computer, and I'm going to run my programs
| on that computer to see if there's something in the CPU that's causing
| the discrepencies.

Well, I noticed that for debug builds, C++/CLI produces smaller IL, and
different X86 code produced by the JIT for both C# and C++/CLI, here are the
for loops...

X86 for C# (debug)
...
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030
....

X86 for C++/CLI (debug)

0000001f 46 inc esi
00000020 81 FE 80 96 98 00 cmp esi,989680h
00000026 7D 03 jge 0000002B
00000028 90 nop
00000029 EB F4 jmp 0000001F

An optimized C# build produces even a shorter code path:
...
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl 0000001C
...

Now, while one would think that the run times would be better, they do not,
all take the same time to finish.

The reason for this (AFAIK) is that super scalars like AMD prefer longer
code paths (longer than a cacheline) in order to feed the instruction
pipeline with longer bursts. Don't know how this behaves on Intel Centrino
and PVI HT, but it looks like they behave differently. (I'll try this with
an assembly code program).

Anyway I don't care that much about this, empty loops are not that common I
guess (and C++ will hoist them anyway). Once you start something reasonable
inside the loop, the loop overhead is reduced to dust and the pipeline gets
filed in a more optimum way.

Willy.


Mar 13 '06 #33

"Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
news:%2******** ********@tk2msf tngp13.phx.gbl. ..
|
| "Carl Daniel [VC++ MVP]" <cp************ *************** **@mvps.org.nos pam>
| wrote in message news:OP******** ******@TK2MSFTN GP11.phx.gbl...
|| Willy Denoyette [MVP] wrote:
|| > "Optimizati on guide for AMD64 Processors"), can you believe MSFT went
|| > that far with the JIT (in debug builds)?
||
|| Well, yeah. Maybe. I'm under the (possibly misguided) impression that
|| debug primarily stops the JIT from inlining and hoisting - things that
|| change the relative order of the native code compared to the IL code.
|| Within those guidelines, I guess it still picks the best codegen it can
|| based on the machine.
||
|| My belief is that there are multiple full-time Intel and AMD employees at
|| MSFT that do nothing but work on the compiler back-ends, including the
CLR
|| JIT.
||
|
| Well, I would expect this for the C++ compiler back-end, but not directly
| for the JIT compiler which is more time constrained, but I guess I'm
wrong.
|
| Willy.
|
|

Some more fun.

Consider this program:

//C++/CLI code
// File : EmptyLoop.cpp
#using <System.dll>
using namespace System;
using namespace System::Diagnos tics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,989680h ; i < 10000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{
Int64 nanosecPerTick = (1000L * 1000L * 1000L) /
System::Diagnos tics::Stopwatch ::Frequency;
Stopwatch^ sw = gcnew Stopwatch;
sw->Start();
ForLoopTest();
sw->Stop();
Int64 ticks = sw->Elapsed.Tick s;
Console::WriteL ine("{0} nanoseconds", ticks * nanosecPerTick) ;
}

Compiled with:
cl /clr /O2 EmptyLoop.cpp
output:
24935346 nanoseconds

cl /clr /Od EmptyLoop.cpp
output:
37636821 nanoseconds

See the loop is in assembly, pure unmanaged X86 code, the code produced by
the C++ compiler [1] is the same except for the function prolog and epilog,
altough the results are different. Any takers?

[1]
/Od build

void ForLoopTest( void )
{
00401000 55 push ebp
00401001 8B EC mov ebp,esp
00401003 56 push esi
__asm {
xor esi,esi; 0 -> esi
00401004 33 F6 xor esi,esi
jmp begin;
00401006 EB 01 jmp begin (401009h)
iter:;
inc esi; i++
00401008 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401009 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100F 7C F7 jl iter (401008h)
}
return;
}
00401011 5E pop esi
00401012 5D pop ebp
00401013 C3 ret

/O2 build

void ForLoopTest( void )
{
00401000 56 push esi
xor esi,esi; 0 -> esi
00401001 33 F6 xor esi,esi
jmp begin;
00401003 EB 01 jmp begin (401006h)
iter:;
inc esi; i++
00401005 46 inc esi
begin:;
cmp esi,989680h ; < 10000000?
00401006 81 FE 80 96 98 00 cmp esi,989680h
jl iter; no
0040100C 7C F7 jl iter (401005h)
__asm {
0040100E 5E pop esi
}
return;
}
0040100F C3 ret

Willy.

Mar 13 '06 #34
Ok, final update.
The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
on all platforms.

Using StopWatch.Elaps ed.Milliseconds gives folowing results.

Values are averges for 10 runs.

C# ~12.8 msec. for 10.000.000 loops
C++/CLI ~9.1 msec.

Release build:

C# ~9.1 msec.
C++/CLI - loop hoisted by C++/CLI compiler (no IL body)

The X86 code for the loop C++/CLI /Od and C# optimized build are nearly the
same (different registers allocated and inc i.s.o add).

Now this:

#using <System.dll>
using namespace System;
using namespace System::Diagnos tics;
#pragma unmanaged
void ForLoopTest( void )
{
__asm {
xor esi,esi; 0 -> esi
jmp begin;
iter:;
inc esi; i++
begin:;
cmp esi,100000000 ; < 100000000?
jl iter; no
}
return;
}
#pragma managed
int main()
{

Stopwatch^ sw = gcnew Stopwatch;
sw->Reset();
sw->Start();
ForLoopTest();
sw->Stop();

Int64 ms = sw->Elapsed.Millis econds;
Console::WriteL ine("{0} msec.", ms);
}

compiled with:
cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 135 msec.

cl /clr /Od bcca.cpp
output: for 100.000.000 loops!!
avg. 91 msec.
Notice the same result for C# optimized build as C++/CLI with loop in
assembly optimized build.
Remains the question why the debug build is that much slower, guess this is
due to the CLR starting some actions when running debug builds, IMO there is
an GC/Finalizer run after the call to Stopwatch.Start and before running the
loop. That would explain different behavior (better results) on an HT CPU as
the finalizer runs on a second CPU, so doesn't disturb the user thread which
runs on another core or logical CPU, on a single CPU core the finalizer
pre-empts the user thread.
I'll try to get an HW analizer from the lab to check this, this is simply
not possible to check only by SW tools.

Willy.

Mar 13 '06 #35
Richard Grimes'a article 'Is Managed Code Slower than Unmanaged Code' might
be of interest.
http://www.grimes.demon.co.uk/dotnet/man_unman.htm

Seems to indicate that there isn't much to choose between c# and c++/cli. c#
can be faster in some circumstances.

Michael
"Don Kim" <in**@nospam.do nkim.info> wrote in message
news:OY******** ******@TK2MSFTN GP09.phx.gbl...
Ok, so I posted a rant earlier about the lack of marketing for C++/CLI,
and it forked over into another rant about which was the faster compiler.
Some said C# was just as fast as C++/CLI, whereas others said C++/CLI was
more optimized.

Anyway, I wrote up some very simple test code, and at least on my computer
C++/CLI came out the fastest. Here's the sample code, and just for good
measure I wrote one in java, and it was the slowest! ;-) Also, I did no
optimizing compiler switches and compiled the C++/CLI with /clr:safe only
to compile to pure verifiable .net.

//C++/CLI code
using namespace System;

int main()
{
long start = Environment::Ti ckCount;
for (int i = 0; i < 10000000; ++i) {}
long end = Environment::Ti ckCount;
Console::WriteL ine(end - start);
}
//C# code
using System;

public class ForLoopTest
{
public static void Main(string[] args)
{
long start = Environment.Tic kCount;
for (int i =0;i < 10000000; ++i) {}
long end = Environment.Tic kCount;
Console.WriteLi ne((end-start));
}
}

//Java code
public class Performance
{
public static void main(String args[])
{
long start = System.currentT imeMillis();
for (int i=0; i < 10000000; ++i) {}
long end = System.currentT imeMillis();
System.out.prin tln(end-start);
}
}

Results:

C++/CLI -> 15-18 secs
C# -> 31-48 secs
Java -> 65-72 secs

I know, I know, these kind of test are not always foolproof, and results
can vary by computer to computer, but at least on my system, C++/CLI had
the fastest results.

Maybe C++/CLI is the most optimized compiler?

-Don Kim

Mar 13 '06 #36

"Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
news:eb******** ******@tk2msftn gp13.phx.gbl...
| Ok, final update.
| The Stopwatch.Ticks is broken, so the calculated nanoseconds are incorrect
| on all platforms.
|

Followup.
!!! Stopwatch.Elaps ed.Ticks != Stopwatch.Elaps edTicks !!!

One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
The only correct way to get this high precision count is by using
Stopwatch.Elaps edTicks like this:

long nanosecPerTick = (1000L*1000L*10 00L) / Stopwatch.Frequ ency;
....
long ticks = sw.ElapsedTicks ;
Console.WriteLi ne("{0} nanoseconds", ticks * nanosecPerTick) ;

or use Stopwatch.Elaps edMiliseconds.

Note that the Stopwatch code is not broken, the code I posted used
Stopwatch.Elaps ed.Ticks which is wrong in this context.
Sorry for all the confusion.
Willy.

Mar 13 '06 #37

"Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
news:%2******** ********@TK2MSF TNGP11.phx.gbl. ..
|
| "Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
| news:eb******** ******@tk2msftn gp13.phx.gbl...
|| Ok, final update.
|| The Stopwatch.Ticks is broken, so the calculated nanoseconds are
incorrect
|| on all platforms.
||
|
| Followup.
| !!! Stopwatch.Elaps ed.Ticks != Stopwatch.Elaps edTicks !!!
|
| One should not use Elapsed.Ticks to calculate the elapsed time in
| nanoseconds.
| The only correct way to get this high precision count is by using
| Stopwatch.Elaps edTicks like this:
|
| long nanosecPerTick = (1000L*1000L*10 00L) / Stopwatch.Frequ ency;
| ...
| long ticks = sw.ElapsedTicks ;
| Console.WriteLi ne("{0} nanoseconds", ticks * nanosecPerTick) ;
|
| or use Stopwatch.Elaps edMiliseconds.
|
| Note that the Stopwatch code is not broken, the code I posted used
| Stopwatch.Elaps ed.Ticks which is wrong in this context.
| Sorry for all the confusion.
|
|
| Willy.
|
|
|

Mystery solved, finally :-).

A C++/CLI debug build ( /Od flag - the default), does not generate sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast as
the C++/CLI sample built with /Od. The IL produced is identical.

Willy.


Mar 13 '06 #38
"Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
news:%2******** ********@TK2MSF TNGP11.phx.gbl. ..
Followup.
!!! Stopwatch.Elaps ed.Ticks != Stopwatch.Elaps edTicks !!!
A ha! I obviously hadn't looked at the code closely enough to realize that
it was using Elapsed.Ticks and not ElapsedTicks.
One should not use Elapsed.Ticks to calculate the elapsed time in
nanoseconds.
True - one should use it to calculate the elapsed time in 0.1us units, since
that's what TimeSpan.Ticks is expressed as.
The only correct way to get this high precision count is by using
Stopwatch.Elaps edTicks like this:

long nanosecPerTick = (1000L*1000L*10 00L) / Stopwatch.Frequ ency;


but make this a double. Stopwatch.Frequ ency is more than 1E9 on modern
machines using the MP HAL.

double nanosecPerTick = 1000.0 * 1000L * 1000L / Stopwatch.Frequ ency;

-cd

Mar 13 '06 #39
"Willy Denoyette [MVP]" <wi************ *@telenet.be> wrote in message
news:%2******** ********@TK2MSF TNGP10.phx.gbl. ..
Mystery solved, finally :-).

A C++/CLI debug build ( /Od flag - the default), does not generate
sequence
points in IL, however it generates optimized IL.
A sequence point is used to mark a spot in the IL code that corresponds to
a
specific location in the original source. If you look at the IL generated
by C# when compiled with /o-, you'll notice the nop's inserted in the
stream, these nop's are used by the JIT to produce sequence points, but
the
/o- flags doesn't produce optimized IL. To have the same behavior in C# as
/Od in C++/CLI, you need to set /debug+ /o+. This generates debug builds
without nop's to trigger the sequence point, just like C++/CLI does.
The "empty loop" C# sample compiled with /debug+ /o+, runs just as fast
as
the C++/CLI sample built with /Od. The IL produced is identical.


Good sleuthing! In the end, they really ought to be about the same -
having the C++ code execute 2x faster just didn't make sense.

-cd
Mar 13 '06 #40

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
32696
by: Rune Strand | last post by:
Hi, If I have a lot of integers and want do something with each digit as integer, what is the fastest way to get there? Eg. Make 12345 into an iterable object, like or "12345" (Btw: What is the English term for this process; itemize? tokenize? digitize? sequence?) Some examples:
354
15985
by: Montrose... | last post by:
After working in c# for a year, the only conclusion I can come to is that I wish I knew c. All I need is Linux, the gnu c compiler and I can do anything. Web services are just open sockets hooked up to interfaces. The Gtk is more than enough gui.
6
50281
by: Klaas Vantournhout | last post by:
Hi, I have a question, which is just out of interest. What is the fastest way to do an odd/even check with c++ and if needed assembler. Assume n is an unsigned integer like type (unsigned int, unsigned long int), what is the fastest? using the modulo operator
24
2307
by: ThunderMusic | last post by:
Hi, The subject says it all... I want to use a byte and use it as byte* so I can increment the pointer to iterate through it. What is the fastest way of doing so in C#? Thanks ThunderMusic
22
2726
by: SETT Programming Contest | last post by:
The SETT Programming Contest: The fastest set<Timplementation Write the fastest set<Timplementation using only standard C++/C. Ideally it should have the same interface like std::set. At least the following methods must be implemented: insert(), find(), begin(), end(), erase(), size(), operator<(), and at least the forward iterator. Here, speed and correctness are the 2 most important factors. Functionally it should behave similar to...
0
9781
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11121
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10734
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10407
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9564
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7960
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5982
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4606
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4210
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.