472,805 Members | 3,073 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,805 software developers and data experts.

C# (float) cast is costly for speed if not used appropriately

Folks,

We ran into a pretty significant performance penalty when casting floats.
We've identified a code workaround that we wanted to pass along but also was
wondering if others had experience with this and if there is a better
solution.

-jeff
.....
I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.

------------------------------------------------------------
Slow cases
------------------------------------------------------------
(A)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output[i] = (float)Math.Log10(input[i]); // <--- inline (float)
cast is slow!
}
}

(B)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output[i] = (float)input[i]; // <--- inline (float)
cast is slow!
}
}

In these examples, "inline" (float) casts are executed on the same line as
other operations
such as Math.Log10() or simple data fetch from input array.

These are slow. Even with Release build.
(A): It takes 3 to 6 % more than double[] case. ;-)
(B): It takes as twice(!) as double[] case. ;-)

In my understanding and articles on the Net, the slow down comes from
writing intermediate value
back to memory as follows. The extra trips are costly.
(A) CPU/FPU +--fetch --Math.Log10 --+ +--(float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

(B) CPU/FPU +--fetch --+ +--(float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

------------------------------------------------------------
Fast cases
------------------------------------------------------------

To avoid the extra memory access, we can use a temporary variable to store
the intermediate data.
The temporary variable is allocated in CPU register and we can keep the
speed fast.

(C)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = Math.Log10(input[i]); // <-- store in a
temporary variable in CPU register
output[i] = (float)tmp; // <-- then (float) cast.
Fast!
}
}

(D)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = input[i]; // <-- store in a
temporary variable in CPU register
output[i] = (float)tmp; // <-- then (float) cast.
Fast!
}
}

In these improved versions, the intermediate data are not written back to
the memory.
The improved versions are actually slightly faster than the double[] case.
(C): 1% faster than double[] case. :)
(D): 3% faster than double[] case. :)

(C) CPU/FPU +--fetch --Math.Log10 --stays in -----(float) --+
| CPU register |
| Fast! |
| V
memory input
output

(D) CPU/FPU +--fetch --stays in -----(float) --+
| CPU register |
| Fast! |
| V
memory input output
OK, this is what we found from benchmarking and googling.

The same thing can be said for ArraySegment<floatarrays as well.
This is because the issue relates to float variables in the array, not the
array itself.

You would say this is .NET compiler optimization issue.
If you know optimization flags or anything that can fix this issue on
compiler side, please let us know.
That would be a great help!
(By the way, simple release build does not help.)

Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.
Dec 31 '07 #1
3 10471
Arnie <je*****************@msn.comwrote:
We ran into a pretty significant performance penalty when casting floats.
To be honest, it doesn't really sound that significant to me. Read
on...
We've identified a code workaround that we wanted to pass along but also was
wondering if others had experience with this and if there is a better
solution.
<snip>
I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.
<snip>
In my understanding and articles on the Net, the slow down comes from
writing intermediate value back to memory as follows. The extra trips
are costly.
I see no reason to believe that there's an extra value written to the
*heap* (rather than the stack), and no reason why the JIT shouldn't use
a register for the intermediate value without an explicit local
variable.

<snip>

I have included a short but complete program below which uses an array
of a million elements and iterates each method a thousand times. Here
are the results on my laptop:

Log10Fast: 64489ms
Log10Slow: 70420ms
CopyFast: 3841ms
CopySlow: 4070ms

So your optimisation improves things by about 10% for the Log10 case
and about 5% for the Copy case.
Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.
And have you any reason to believe that's *actually* the bottleneck in
your code? Do you regularly convert a billion floats and care about
200ms of performance loss?

I don't understand why the results are as they are (it would be worth
looking at the JITted, optimised code to find out) - but even so, I
certainly wouldn't start micro-optimising all over the place. Find out
where the *actual* bottleneck in your code is, and consider reducing
readability/simplicity for the sake of performance just in the most
significant parts. Don't start doing it all over the place, which
sounds like the course of action you're considering at the moment.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Jan 1 '08 #2
Thanks for the feedback Jon.

This is a mature system where they are "ringing" out the last bit of
performance.

It is a scientific test insrument (spectrum analyzer) so they are acquiring
and converting extremely large chunks of data (wave forms). Some runs can
acquire as much as 500MB of data at a time.

So they have "progressed" to the point where they are looking at the right
optimization spots in their code. Casting from double[] is indeed 2x as slow
without the optimization and quite different then the 5-10% case you
demonstrated. Again the only thing they changed was assigning a local
variable hence their curiosity in what the C#/jit compiler is doing.

I think our premise is that given this single change .... it would seem that
the there would be no performance difference if the compiler were taking
advantage of every reasonable performance optimization.

Time to look at IL as see what is going on.

-jeff
"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP*********************@msnews.microsoft.com. ..
Arnie <je*****************@msn.comwrote:
>We ran into a pretty significant performance penalty when casting floats.

To be honest, it doesn't really sound that significant to me. Read
on...
>We've identified a code workaround that we wanted to pass along but also
was
wondering if others had experience with this and if there is a better
solution.

<snip>
>I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.

<snip>
>In my understanding and articles on the Net, the slow down comes from
writing intermediate value back to memory as follows. The extra trips
are costly.

I see no reason to believe that there's an extra value written to the
*heap* (rather than the stack), and no reason why the JIT shouldn't use
a register for the intermediate value without an explicit local
variable.

<snip>

I have included a short but complete program below which uses an array
of a million elements and iterates each method a thousand times. Here
are the results on my laptop:

Log10Fast: 64489ms
Log10Slow: 70420ms
CopyFast: 3841ms
CopySlow: 4070ms

So your optimisation improves things by about 10% for the Log10 case
and about 5% for the Copy case.
>Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.

And have you any reason to believe that's *actually* the bottleneck in
your code? Do you regularly convert a billion floats and care about
200ms of performance loss?

I don't understand why the results are as they are (it would be worth
looking at the JITted, optimised code to find out) - but even so, I
certainly wouldn't start micro-optimising all over the place. Find out
where the *actual* bottleneck in your code is, and consider reducing
readability/simplicity for the sake of performance just in the most
significant parts. Don't start doing it all over the place, which
sounds like the course of action you're considering at the moment.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Jan 3 '08 #3
Arnie <je*****************@msn.comwrote:
This is a mature system where they are "ringing" out the last bit of
performance.
Hmm... it still sounds dubious to me. I doubt you'll really see
performance benefits which are significant in the context of the whole
app. Mind you, it sounds like you're not seeing the same behaviour as
me to start with, so hey...
It is a scientific test insrument (spectrum analyzer) so they are acquiring
and converting extremely large chunks of data (wave forms). Some runs can
acquire as much as 500MB of data at a time.
500MB isn't that much though, in the context of the tests I was doing -
it was using a billion points of data, which would be 8GB. The copy was
then only taking 4 seconds, and making the change only shaved off a
very small amount.
So they have "progressed" to the point where they are looking at the right
optimization spots in their code. Casting from double[] is indeed 2x as slow
without the optimization and quite different then the 5-10% case you
demonstrated.
So can you give a short but complete program which *does* demonstrate
the 2x difference?

Just as a thought, which CLR are you using? I'm on the 2.0, on x86. If
you're using 1.0, 1.1, or 2.0 on x64, that could account for some
differences.
Again the only thing they changed was assigning a local
variable hence their curiosity in what the C#/jit compiler is doing.

I think our premise is that given this single change .... it would seem that
the there would be no performance difference if the compiler were taking
advantage of every reasonable performance optimization.

Time to look at IL as see what is going on.
The IL doesn't show much. It's the optimised assembly you need to be
looking at, really. cordbg is your friend - but don't forget to tell it
to perform JIT optimisations. SOS may help too. I don't envy you...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk
Jan 3 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Phil... | last post by:
Does anyone have any data comparing the speed of java double vs. float operations? According to a document I found on the Intel site, hardware support for 64 bit floating point was not added...
5
by: Peter Scheurer | last post by:
Hi, we found some strange behavior when operating with floats and round(). The following simplified statement reproduces the problem. select 6.56 - round(convert(float, 6.56), 2) from...
0
by: A. W. Dunstan | last post by:
I'm porting some code to Visual C++ and have run into a problem - the compiler won't use a user-written cast operator. The code uses an envelope-letter approach to passing (potentially) large...
5
by: Code4u | last post by:
In the course of writing numerical code I needed to convert a float to an int with a defined behavior: if the float is great than INT_MAX, set the int to INT_MAX, otherwise assign directly. The...
54
by: Andy | last post by:
Hi, I don't know if this is the correct group to post this, but when I multiply a huge floating point value by a really small (non-zero) floating point value, I get 0 (zero) for the result. This...
19
by: Jon Shemitz | last post by:
Is there a difference between a constant like "12.34f" and "(float) 12.34"? In principle, at least, the latter is a double constant being cast to a float; while the two both generate actual...
9
by: Nicolas Blais | last post by:
Hi, I have this following class which I use as a timer: #include <sys/time.h> using namespace std; class chrono { public: chrono() {};
6
by: toebens | last post by:
Hi, i read this http://msdn.microsoft.com/chats/transcripts/vstudio/05_0811_dn_csharp.aspx : Cyrusn_MS (Expert): Q: please explain the difference 1: Convert.ToInt32(o) 2: (int)o 3: o as...
2
by: Mike | last post by:
I'm running DB2 v7 for z/OS. When I use SPUFI, SELECT CAST(6.0 AS FLOAT)/CAST(10.0 AS FLOAT) FROM SYSIBM.SYSDUMMY1 returns 0.6000000000000000E+00. When I use DSNTIAUL,DSNTEP2, or DSNALI (call...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.