By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,997 Members | 1,270 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,997 IT Pros & Developers. It's quick & easy.

double to float rounding error in 8th digit

P: n/a
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

Feb 8 '07 #1
Share this Question
Share on Google+
13 Replies


P: n/a
"Shirsoft" <sh******@gmail.comwrote in message
news:11*********************@k78g2000cwa.googlegro ups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.
Feb 8 '07 #2

P: n/a
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Shirsoft" <shirs...@gmail.comwrote in message

news:11*********************@k78g2000cwa.googlegro ups.com...
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.

Feb 8 '07 #3

P: n/a
On Feb 8, 3:56 am, "Shirsoft" <shirs...@gmail.comwrote:
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
Most machines use IEEE754 floating point numbers and represent a float
as a 32 bit version of that format. As such, you only have seven
significant digits of accuracy so anything beyond that will be subject
to errors.

Feb 8 '07 #4

P: n/a
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
>"Shirsoft" <shirs...@gmail.comwrote in message

news:11*********************@k78g2000cwa.googlegr oups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.
In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.
Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
Feb 8 '07 #5

P: n/a
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström

Feb 8 '07 #6

P: n/a
"Shirsoft" <sh******@gmail.comschrieb im Newsbeitrag
news:11*********************@k78g2000cwa.googlegro ups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
Just another idea: does your hardware use a numeric coprocessor? If so, the
calculation is partly done in the coprocessor, and it uses its mantissa size
regardless of your variable declarations. This could introduce a difference
in precision.

Franz
Feb 8 '07 #7

P: n/a
On Feb 8, 4:32 pm, Kai-Uwe Bux <jkherci...@gmx.netwrote:
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Shirsoft" <shirs...@gmail.comwrote in message
>news:11*********************@k78g2000cwa.googlegr oups.com...
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.
Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.

In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.

Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
a) I am sorry about that, i am not very familiar to news group
etiquettes

Feb 8 '07 #8

P: n/a
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);

Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.

Regards
Shireesh

Feb 8 '07 #9

P: n/a
On Feb 8, 2:50 pm, "Shirsoft" <shirs...@gmail.comwrote:
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like
double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;
For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like
for (int i = 0; i < MAX; ++i)
sum += arr[i];
--
Erik Wikström

What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats?
No, though I didn't make any exact measures I haven't noticed any
difference in speed.
I have also read the double are as fast as float but i dont quite understand why.
With no guarantee of correctness: A long time ago (relatively) float
was faster than double since the processor could keep the whole float
in a register but the doubles didn't, so they had to divide the double
into two registers and operate on both of them. Processors are
designed to perform operations on registers and can perform most
simple operations on values stored in registers in on clock cycle.
This was in the age of 16 bit processors, the floating point registers
are usually larger than the integer ones, so while integers were 16bit
floats were 32.

Fast forward to now and processors are at least 32bit, meaning that
integer registers are 32bits and floating point registers are larger
still, about 64 bits in fact. Which just happens to be the size of a
double. So now a days you can fit a double into registers and work on
them in just one clock cycle, just like floats. Should that not be
enough there's the long double which is 128 bits, but then you have to
split it into more than one register again.

To keep in mind is that all of this is very much platform specific and
probably not the whole truth, but it's probably close enough. As an
example I seem to recall that x86 computers use 80 bits internally for
FP calculations.

--
Erik Wikström

Feb 8 '07 #10

P: n/a
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);

Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.

Regards
Shireesh

Feb 8 '07 #11

P: n/a
Shirsoft wrote:
...
On 32 bit(intel , vs 2003, C++), some watch variables are
...
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value

On 64 bit(amd , vs 2005, C++), some watch variables are
...
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
What you provide in your message is not the actual values. What you provide is
the decimal representations of the actual values generated by the built-in
debugger of your development environment (at least that's how I understood your
"watch variables"). Since you are using different development environments in
each case (VS 2003 in the first, VS 2005 in the second) it is quite possible
that the difference in decimal representations is actually caused (at least
partially) by the difference in the binary-to-decimal conversion algorithms.

BTW, I'm not saying that what you are describing is not really happening. I just
think that it would be useful to find the exact step, the exact operation that
produces the difference in the result. From the history of watch variables you
provided, it appears that it happens at the last stage - the rounding. It is
quite possible that in reality it happens [much] earlier. You just can't see it
from the debugger output, because the binary-to-decimal conversion algorithm
hides the problem. Maybe you should try watching the actual binary
representations of the intermediate results.

--
Best regards,
Andrey Tarasevich
Feb 8 '07 #12

P: n/a
On Feb 8, 2:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);

On 32 bit(intel , vs 2003, C++), some watch variables are
---------------------------------------------------------------------------*-------
b[2][12] 2.7591736 float <-------------- This is the different value

On 64 bit(amd , vs 2005, C++), some watch variables are
---------------------------------------------------------------------------*-------
b[2][12] 2.7591734 float <-------------- This is the different value

This is likely due to an difference in the way rounding happens on
intermediate results on the two platforms. The 32 bit code is
probably using the x87 instructions, which by default computes all
results to 64 bits of precision (extended real format), which means
that your intermediates have more precision that you'd expect (IOW,
the intermediate result for (float)(mode*val) will still be
-4.7720763683319092, and it will be subtracted from 7.5312500 with 64
bit precision). The 64 bit version is using SSE2 scalar instructions
which does not extend intermediates like that, and subtractions
between floats will get 23 bits of precision, not 64.

In VS03 the "Improve Floating Point Consistency" option may improve
things in the 32 bit version. In VS05 the /fp:strict option is even
more restrictive.
Feb 9 '07 #13

P: n/a
In article <11**********************@p10g2000cwp.googlegroups .com>,
sh******@gmail.com says...

[ ... ]
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Probably. If you could tell us a bit more about what you're doing, it
might be helpful. It sounds like you're writing floating point data out
to a file, then reading it back in and need maximum accuracy, but you're
concerned about the amount of space occupied when you store the data.

Unless you're already doing so, I'd advise storing the data in binary
format -- that'll typically let you store numbers with about 16 digits
of precision in only 8 bytes.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.
That depends on the hardware, but the general idea is that many CPUs
work with double precision numbers internally anyway, and the format you
specify only controls how the data is stored, not how the math is done.

For example, the x86 and compatible CPUs have 80-bit floating point
registers. The precision of calculations is typically set in the statup
code and left fixed throughout a program. Changing the precision is
possible, but slow enough that it's a net loss unless you do quite a bit
of work at the lower precision (and it only affects the more complex
calculations like division, square root, and trig functions, not things
like addition and subtraction).

OTOH, depending on the sorts of things you're doing, the speed of the
calculation may not matter much. If you do a small amount of calculation
on each of a large number of data items, the bandwidth to storage may be
what really matters. An extreme case would be something like adding 1.0
to each number in a multi-gigabyte database. In this case, the real
bottleneck is likely to be reading/writing the data from/to disk.

In a case like that, using single-precision is likely to make a big
difference in overall speed -- simply because it only requires reading
and writing half as much data.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Feb 10 '07 #14

This discussion thread is closed

Replies have been disabled for this discussion.