Why 'without ref' test works faster that 'using ref' test?
As I understand in 'without ref' test, parameter is passed by value,
so
a new storage location created each time we enter Test method.
And in 'using ref' test we pass parameter by reference. So rather than
creating a new storage location for the variable
in the function member declaration, the same storage location is used.
>From my point of view it must work faster.
There are two points of overhead that keep ref parameters from working faster:
1. The address of the value must be retrieved before passing it as a ref
parameter.
2. That pointer has to be dereferenced in the called method before the value
can be used.
Can you give some comments about this situation plz. Thanx.
The only real difference between passing a parameter by value and passing
parameter by reference is that a pointer is used to pass by reference. So,
the overhead is getting the address of the value to pass and dereferencing
the pointer to get the value in the method that is called.
The tests that you ran were not optimal for getting accurate timings. Here
is the code that I used.
First, this is my HighResolutionTimer class:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Security;
namespace RefTest
{
public class HighResolutionTimer
{
// private fields...
private long m_Frequency;
private long m_StartCounter;
private long m_StopCounter;
// constructors...
public HighResolutionTimer() : this(false) { }
public HighResolutionTimer(bool start)
{
if (!QueryPerformanceFrequency(out m_Frequency))
{
Debug.WriteLine("HighResolutionTimer.ctor(): Error occurred while
calling QueryPerformanceFrequency.");
return;
}
if (start)
Start();
}
// win32 api methods...
[SuppressUnmanagedCodeSecurity]
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool QueryPerformanceCounter(
[Out] out long lpPerformanceCount);
[SuppressUnmanagedCodeSecurity]
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool QueryPerformanceFrequency(
[Out] out long lpFrequency);
// private methods...
private double CalcDuration()
{
return ((double)(m_StopCounter - m_StartCounter)) / (double)m_Frequency;
}
// public methods...
public void Reset()
{
m_StartCounter = 0;
m_StopCounter = 0;
}
public void Start()
{
Reset();
if (!QueryPerformanceCounter(out m_StartCounter))
Debug.WriteLine("HighResolutionTimer.Start(): Error occurred while
calling QueryPerformanceCounter.");
}
public double Stop()
{
if (!QueryPerformanceCounter(out m_StopCounter))
{
Debug.WriteLine("HighResolutionTimer.Stop(): Error occurred while
calling QueryPerformanceCounter.");
return Double.NaN;
}
return Duration;
}
// public overridden methods...
public override string ToString()
{
return CalcDuration().ToString("0.######") + " seconds";
}
// public properties...
public double Duration
{
get
{
return CalcDuration();
}
}
}
}
Second, here is a helper CodeTimer class that I use for timing code:
using System;
namespace RefTest
{
public static class CodeTimer
{
private static double Average(double[] values)
{
if (values == null)
throw new ArgumentNullException("values");
int valueCount = values.Length;
if (valueCount == 0)
return 0.0d;
double sum = 0.0d;
for (int i = 0; i < valueCount; i++)
sum += values[i];
return sum / valueCount;
}
public delegate void TimingCode();
public static double Execute(TimingCode code)
{
if (code == null)
throw new ArgumentNullException("code");
const int NUM_SAMPLES = 100;
double[] timings = new double[NUM_SAMPLES];
HighResolutionTimer timer = new HighResolutionTimer();
for (int i = 0; i < NUM_SAMPLES; i++)
{
timer.Reset();
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
timer.Start();
code();
timer.Stop();
timings[i] = timer.Duration;
}
return Average(timings);
}
}
}
And finally, here's the Program class for my test console application:
using System;
using System.Runtime.CompilerServices;
namespace RefTest
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("TestWithoutRef: {0:###,###,##0.000000}", CodeTimer.Execute(TestWithoutRefLoop));
Console.WriteLine("TestWithRef: {0:###,###,##0.000000}", CodeTimer.Execute(TestWithRefLoop));
Console.ReadLine();
}
static void TestWithRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithRef(ref i);
}
static void TestWithoutRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithoutRef(i);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int TestWithRef(ref int k)
{
return k;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int TestWithoutRef(int k)
{
return k;
}
}
}
In VS 2005, create a new console application and add those files to get a
more optimal test. Here are the timings that I get:
TestWithoutRef: 0.192421 seconds
TestWithRef: 0.194921
So, according to my results, passing a parameter by reference 50,000,000
times results in approximately 2.5 milliseconds. Yee-ha! This is not something
to worry about. :-)
----
If you're interested in seeing what is going on under the covers, let's take
a look at the IL that is generated:
static void TestWithoutRef()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithoutRef(i);
}
..method private hidebysig static void TestWithoutRefLoop() cil managed
{
.maxstack 2
.locals init (
[0] int32 i)
L_0000: ldc.i4.0
L_0001: stloc.0
L_0002: br.s L_000f
L_0004: ldloc.0
L_0005: call int32 RefTest.Program::TestWithoutRef(int32)
L_000a: pop
L_000b: ldloc.0
L_000c: ldc.i4.1
L_000d: add
L_000e: stloc.0
L_000f: ldloc.0
L_0010: ldc.i4 50000000
L_0015: blt.s L_0004
L_0017: ret
}
static void TestWithRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithRef(ref i);
}
..method private hidebysig static void TestWithRefLoop() cil managed
{
.maxstack 2
.locals init (
[0] int32 i)
L_0000: ldc.i4.0
L_0001: stloc.0
L_0002: br.s L_0010
L_0004: ldloca.s i
L_0006: call int32 RefTest.Program::TestWithRef(int32&)
L_000b: pop
L_000c: ldloc.0
L_000d: ldc.i4.1
L_000e: add
L_000f: stloc.0
L_0010: ldloc.0
L_0011: ldc.i4 50000000
L_0016: blt.s L_0004
L_0018: ret
}
These methods only differ by one byte in length and the reason is found at
L_0004. In TestWithoutRefLoop, the "ldloc.0" instruction is used. This simply
loads the local variable at index 0 ('i') onto the stack. Because we're passing
by value, that's all that's needed to make the call to TestWithoutRef(int32).
However, in TestWithRefLoop, the "ldloc.a i" instruction is used. This is
one byte larger because there is a byte for the instruction and a byte to
indicate the index of the local to use. And, instead of loading the specified
local variable onto the stack, it loads the *address* of said local variable
in order to set up the TestWithRef(int32&) method call. On my machine, when
I look at the optimized JITted code for these methods, I see the following
x86:
TestWithoutRefLoop:
00000000 push esi
00000001 xor esi,esi
00000003 mov ecx,esi
00000005 call dword ptr ds:[00913070h]
0000000b inc esi
0000000c cmp esi,2FAF080h
00000012 jl 00000003
00000014 pop esi
00000015 ret
TestWithRefLoop
00000000 push eax
00000001 xor eax,eax
00000003 mov dword ptr [esp],eax
00000006 xor edx,edx
00000008 mov dword ptr [esp],edx
0000000b cmp dword ptr [esp],2FAF080h
00000012 jge 00000029
00000014 lea ecx,[esp]
00000017 call dword ptr ds:[0091306Ch]
0000001d inc dword ptr [esp]
00000020 cmp dword ptr [esp],2FAF080h
00000027 jl 00000014
00000029 pop ecx
0000002a ret
Obviously, a lot more work is necessary at the x86 level to get the address
of this pointer.
Now, let's look at the methods that get called.
static int TestWithoutRef(int k)
{
return k;
}
..method private hidebysig static int32 TestWithoutRef(int32 k) cil managed
noinlining
{
.maxstack 8
L_0000: ldarg.0
L_0001: ret
}
static int TestWithRef(ref int k)
{
return k;
}
..method private hidebysig static int32 TestWithRef(int32& k) cil managed
noinlining
{
.maxstack 8
L_0000: ldarg.0
L_0001: ldind.i4
L_0002: ret
}
In this case, TestWithRef has one additional instruction: "ldind.i4". This
instruction takes the managed pointer on the top of the evaluation stack
and loads the int32 value indirectly from it (hence "ldind"). IOW, this is
the pointer dereference that needs to happen before the value can be used
(in this case, returned).
For completeness, here's the x86 of the optimized JITted code:
TestWithoutRef
00000000 mov eax,ecx
00000002 ret
TestWithRef
00000000 mov eax,dword ptr [ecx]
00000002 ret
Obviously, there is a lot less going on here than at the calling site. The
only difference is the pointer dereference. So, most of the overhead that
we observed takes place in the calling site. But, IMO, it is neglible. There's
nothing to get worked up about. Take a deep breath. If you need to be concerned
about performance at this low of a level, you probably shouldn't be working
in a garbage-collected environment. :-)
Best Regards,
Dustin Campbell
Developer Express Inc.