473,946 Members | 6,164 Online

# double to float rounding error in 8th digit

I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

Feb 8 '07 #1
13 6212
"Shirsoft" <sh******@gmail .comwrote in message
news:11******** *************@k 78g2000cwa.goog legroups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.
Feb 8 '07 #2
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rock etmail.comwrote :
"Shirsoft" <shirs...@gmail .comwrote in message

news:11******** *************@k 78g2000cwa.goog legroups.com...
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.

Feb 8 '07 #3
On Feb 8, 3:56 am, "Shirsoft" <shirs...@gmail .comwrote:
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
Most machines use IEEE754 floating point numbers and represent a float
as a 32 bit version of that format. As such, you only have seven
significant digits of accuracy so anything beyond that will be subject
to errors.

Feb 8 '07 #4
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rock etmail.comwrote :
>"Shirsoft" <shirs...@gmail .comwrote in message

news:11******* **************@ k78g2000cwa.goo glegroups.com.. .
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.
In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.
Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
Feb 8 '07 #5
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail .comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<flo at>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström

Feb 8 '07 #6
"Shirsoft" <sh******@gmail .comschrieb im Newsbeitrag
news:11******** *************@k 78g2000cwa.goog legroups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
Just another idea: does your hardware use a numeric coprocessor? If so, the
calculation is partly done in the coprocessor, and it uses its mantissa size
regardless of your variable declarations. This could introduce a difference
in precision.

Franz
Feb 8 '07 #7
On Feb 8, 4:32 pm, Kai-Uwe Bux <jkherci...@gmx .netwrote:
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rock etmail.comwrote :
"Shirsoft" <shirs...@gmail .comwrote in message
>news:11******* **************@ k78g2000cwa.goo glegroups.com.. .
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319 092 double
(float)(mode*va l) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*v al) 2.7591733932495 117 double
(float)(b[2][12]+(float)(mode*v al)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.
Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.

In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.

Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
a) I am sorry about that, i am not very familiar to news group
etiquettes

Feb 8 '07 #8
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student .chalmers.sewro te:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail .comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);

Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<flo at>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.

Regards
Shireesh

Feb 8 '07 #9
On Feb 8, 2:50 pm, "Shirsoft" <shirs...@gmail .comwrote:
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student .chalmers.sewro te:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail .comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like
double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<flo at>tmp;
For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like
for (int i = 0; i < MAX; ++i)
sum += arr[i];
--
Erik Wikström

What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
No, though I didn't make any exact measures I haven't noticed any
difference in speed.
I have also read the double are as fast as float but i dont quite understand why.
With no guarantee of correctness: A long time ago (relatively) float
was faster than double since the processor could keep the whole float
in a register but the doubles didn't, so they had to divide the double
into two registers and operate on both of them. Processors are
designed to perform operations on registers and can perform most
simple operations on values stored in registers in on clock cycle.
This was in the age of 16 bit processors, the floating point registers
are usually larger than the integer ones, so while integers were 16bit
floats were 32.

Fast forward to now and processors are at least 32bit, meaning that
integer registers are 32bits and floating point registers are larger
still, about 64 bits in fact. Which just happens to be the size of a
double. So now a days you can fit a double into registers and work on
them in just one clock cycle, just like floats. Should that not be
enough there's the long double which is 128 bits, but then you have to
split it into more than one register again.

To keep in mind is that all of this is very much platform specific and
probably not the whole truth, but it's probably close enough. As an
example I seem to recall that x86 computers use 80 bits internally for
FP calculations.

--
Erik Wikström

Feb 8 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.