473,395 Members | 1,726 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

double to float rounding error in 8th digit

I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

Feb 8 '07 #1
13 6132
"Shirsoft" <sh******@gmail.comwrote in message
news:11*********************@k78g2000cwa.googlegro ups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.
Feb 8 '07 #2
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Shirsoft" <shirs...@gmail.comwrote in message

news:11*********************@k78g2000cwa.googlegro ups.com...
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If you
need 100% accuracy you'll need to use something else.

Feb 8 '07 #3
On Feb 8, 3:56 am, "Shirsoft" <shirs...@gmail.comwrote:
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
Most machines use IEEE754 floating point numbers and represent a float
as a 32 bit version of that format. As such, you only have seven
significant digits of accuracy so anything beyond that will be subject
to errors.

Feb 8 '07 #4
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
>"Shirsoft" <shirs...@gmail.comwrote in message

news:11*********************@k78g2000cwa.googlegr oups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value

This is a statement, not a question.

Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.
a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.
In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.
Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
Feb 8 '07 #5
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström

Feb 8 '07 #6
"Shirsoft" <sh******@gmail.comschrieb im Newsbeitrag
news:11*********************@k78g2000cwa.googlegro ups.com...
>I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
Just another idea: does your hardware use a numeric coprocessor? If so, the
calculation is partly done in the coprocessor, and it uses its mantissa size
regardless of your variable declarations. This could introduce a difference
in precision.

Franz
Feb 8 '07 #7
On Feb 8, 4:32 pm, Kai-Uwe Bux <jkherci...@gmx.netwrote:
Shirsoft wrote [top-posting corrected]
On Feb 8, 3:42 pm, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Shirsoft" <shirs...@gmail.comwrote in message
>news:11*********************@k78g2000cwa.googlegr oups.com...
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
C++ code is
------------------
b[2][12] += (float)(mode *val);
On 32 bit(intel , vs 2003, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value
On 64 bit(amd , vs 2005, C++), some watch variables are
----------------------------------------------------------------------------------
b[2][12] 7.5312500 float
mode*val -4.7720763683319092 double
(float)(mode*val) -4.7720766 float
(float)(b[2][12]+mode*val) 2.7591736 float
b[2][12]+(float)(mode*val) 2.7591733932495117 double
(float)(b[2][12]+(float)(mode*val)) 2.7591734 float
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
This is a statement, not a question.
Yes, floating point numbers are not accurate to all decimal places. If
you need 100% accuracy you'll need to use something else.
I am sorry for the confusion, but my question is that why 32 bit
machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines
does it right. Is there some way to fix it.

a) Please don't top post.

b) As far as converting from a higher to a lower precision floating point
type is concerned, the standard says [4.8/1]:

An rvalue of floating point type can be converted to an rvalue of another
floating point type. If the source value can be exactly represented in the
destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination
values, the result of the conversion is an implementation-defined choice
of either of those values. Otherwise, the behavior is undefined.

Assuming that your compiler is compliant, that leaves the following possible
explanations for the behavior you observe:

1) The float types in both versions have different precision.
2) One version is rounding down the value, the other is rounding it up.

In either case, there is nothing you can do about it short off writing your
own bit-fiddling rounding function.

Why do you want to convert to float in the first place? Maybe, you should do
the whole computation in double.

Best

Kai-Uwe Bux
a) I am sorry about that, i am not very familiar to news group
etiquettes

Feb 8 '07 #8
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);

Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.

Regards
Shireesh

Feb 8 '07 #9
On Feb 8, 2:50 pm, "Shirsoft" <shirs...@gmail.comwrote:
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.
As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);
Here you could perhaps do something like
double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;
For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like
for (int i = 0; i < MAX; ++i)
sum += arr[i];
--
Erik Wikström

What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats?
No, though I didn't make any exact measures I haven't noticed any
difference in speed.
I have also read the double are as fast as float but i dont quite understand why.
With no guarantee of correctness: A long time ago (relatively) float
was faster than double since the processor could keep the whole float
in a register but the doubles didn't, so they had to divide the double
into two registers and operate on both of them. Processors are
designed to perform operations on registers and can perform most
simple operations on values stored in registers in on clock cycle.
This was in the age of 16 bit processors, the floating point registers
are usually larger than the integer ones, so while integers were 16bit
floats were 32.

Fast forward to now and processors are at least 32bit, meaning that
integer registers are 32bits and floating point registers are larger
still, about 64 bits in fact. Which just happens to be the size of a
double. So now a days you can fit a double into registers and work on
them in just one clock cycle, just like floats. Should that not be
enough there's the long double which is 128 bits, but then you have to
split it into more than one register again.

To keep in mind is that all of this is very much platform specific and
probably not the whole truth, but it's probably close enough. As an
example I seem to recall that x86 computers use 80 bits internally for
FP calculations.

--
Erik Wikström

Feb 8 '07 #10
On Feb 8, 5:36 pm, "Erik Wikström" <eri...@student.chalmers.sewrote:
On Feb 8, 9:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

As others have already explained the whys I will not, instead I'll
give you some advice since I recently had the same problem. First of,
if possible try not to use float unless you really have to, double is
much better and on most computers just as fast. If you, like me, are
forced to use float (for space concerns perhaps) you can probably
still perform many of the calculations with doubles by introducing
temporary variables.
b[2][12] += (float)(mode *val);

Here you could perhaps do something like

double tmp = b[2][12];
tmp += mode * val;
b[2][12] = static_cast<float>tmp;

For some calculations there will be no difference but if you perform
"compound" calculations (many operations) and/or multiplications/
division you probably will notice the difference. Also when performing
summations like

for (int i = 0; i < MAX; ++i)
sum += arr[i];

--
Erik Wikström
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.

Regards
Shireesh

Feb 8 '07 #11
Shirsoft wrote:
...
On 32 bit(intel , vs 2003, C++), some watch variables are
...
After increment it becomes
b[2][12] 2.7591736 float <-------------- This is the different value

On 64 bit(amd , vs 2005, C++), some watch variables are
...
After increment it becomes
b[2][12] 2.7591734 float <-------------- This is the different value
What you provide in your message is not the actual values. What you provide is
the decimal representations of the actual values generated by the built-in
debugger of your development environment (at least that's how I understood your
"watch variables"). Since you are using different development environments in
each case (VS 2003 in the first, VS 2005 in the second) it is quite possible
that the difference in decimal representations is actually caused (at least
partially) by the difference in the binary-to-decimal conversion algorithms.

BTW, I'm not saying that what you are describing is not really happening. I just
think that it would be useful to find the exact step, the exact operation that
produces the difference in the result. From the history of watch variables you
provided, it appears that it happens at the last stage - the rounding. It is
quite possible that in reality it happens [much] earlier. You just can't see it
from the debugger output, because the binary-to-decimal conversion algorithm
hides the problem. Maybe you should try watching the actual binary
representations of the intermediate results.

--
Best regards,
Andrey Tarasevich
Feb 8 '07 #12
On Feb 8, 2:55 am, "Shirsoft" <shirs...@gmail.comwrote:
I have a 32 bit intel and 64 bit AMD machine. There is a rounding
error in the 8th digit. Unfortunately because of the algorithm we use,
the errors percolate into higher digits.

C++ code is
------------------
b[2][12] += (float)(mode *val);

On 32 bit(intel , vs 2003, C++), some watch variables are
---------------------------------------------------------------------------*-------
b[2][12] 2.7591736 float <-------------- This is the different value

On 64 bit(amd , vs 2005, C++), some watch variables are
---------------------------------------------------------------------------*-------
b[2][12] 2.7591734 float <-------------- This is the different value

This is likely due to an difference in the way rounding happens on
intermediate results on the two platforms. The 32 bit code is
probably using the x87 instructions, which by default computes all
results to 64 bits of precision (extended real format), which means
that your intermediates have more precision that you'd expect (IOW,
the intermediate result for (float)(mode*val) will still be
-4.7720763683319092, and it will be subtracted from 7.5312500 with 64
bit precision). The 64 bit version is using SSE2 scalar instructions
which does not extend intermediates like that, and subtractions
between floats will get 23 bits of precision, not 64.

In VS03 the "Improve Floating Point Consistency" option may improve
things in the 32 bit version. In VS05 the /fp:strict option is even
more restrictive.
Feb 9 '07 #13
In article <11**********************@p10g2000cwp.googlegroups .com>,
sh******@gmail.com says...

[ ... ]
What you mention is exactly what i am facing. I have a very huge
dataset so i have to use float. I guess if i use a temp double var,
then the problem would be fixed.
Probably. If you could tell us a bit more about what you're doing, it
might be helpful. It sounds like you're writing floating point data out
to a file, then reading it back in and need maximum accuracy, but you're
concerned about the amount of space occupied when you store the data.

Unless you're already doing so, I'd advise storing the data in binary
format -- that'll typically let you store numbers with about 16 digits
of precision in only 8 bytes.
Could you please tell me if you found some speed up by using doubles
instead of floats? I have also read the double are as fast as float
but i dont quite understand why.
That depends on the hardware, but the general idea is that many CPUs
work with double precision numbers internally anyway, and the format you
specify only controls how the data is stored, not how the math is done.

For example, the x86 and compatible CPUs have 80-bit floating point
registers. The precision of calculations is typically set in the statup
code and left fixed throughout a program. Changing the precision is
possible, but slow enough that it's a net loss unless you do quite a bit
of work at the lower precision (and it only affects the more complex
calculations like division, square root, and trig functions, not things
like addition and subtraction).

OTOH, depending on the sorts of things you're doing, the speed of the
calculation may not matter much. If you do a small amount of calculation
on each of a large number of data items, the bandwidth to storage may be
what really matters. An extreme case would be something like adding 1.0
to each number in a multi-gigabyte database. In this case, the real
bottleneck is likely to be reading/writing the data from/to disk.

In a case like that, using single-precision is likely to make a big
difference in overall speed -- simply because it only requires reading
and writing half as much data.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Feb 10 '07 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: towers | last post by:
Hello, I've got a bit of experience in C++, but I'm writing my first app that is dependent on relatively precise math functions. The app requires that I get a time stamp based on s sample...
19
by: Allagappan | last post by:
Hello everyone, Just check this code: #include<stdio.h> int main() { double a,b; scanf("%lf",&a); scanf("%lf",&b); printf("\nNumber a: %0.16lf",a);
19
by: morc | last post by:
hey, I have float values that look something like this when they are printed: 6.0E-4 7.0E-4 I don't want them to be like this I want them to be normalized with 4 decimal places.
116
by: Dilip | last post by:
Recently in our code, I ran into a situation where were stuffing a float inside a double. The precision was extended automatically because of that. To make a long story short, this caused...
2
by: simun.selak | last post by:
Hi, Inserting number 99999.99 into a FLOAT 17 16 field results in 9.99999900....5E+004 being inserted into the base. What is confusing me is the digit 5 that appears at the end of the number....
18
by: jdrott1 | last post by:
i'm trying to round my currency string to end in 9. it's for a pricing application. this is the function i'm using to get the item in currency: FormatCurrency(BoxCost, , , , TriState.True) if...
206
by: md | last post by:
Hi Does any body know, how to round a double value with a specific number of digits after the decimal points? A function like this: RoundMyDouble (double &value, short numberOfPrecisions) ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.