By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
 428,997 Members | 1,270 Online + Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,997 IT Pros & Developers. It's quick & easy.

# double to float rounding error in 8th digit

 P: n/a I have a 32 bit intel and 64 bit AMD machine. There is a rounding error in the 8th digit. Unfortunately because of the algorithm we use, the errors percolate into higher digits. C++ code is ------------------ b += (float)(mode *val); On 32 bit(intel , vs 2003, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591736 float <-------------- This is the different value On 64 bit(amd , vs 2005, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591734 float <-------------- This is the different value Feb 8 '07 #1
Share this Question
13 Replies

 P: n/a "Shirsoft" I have a 32 bit intel and 64 bit AMD machine. There is a rounding error in the 8th digit. Unfortunately because of the algorithm we use, the errors percolate into higher digits. C++ code is ------------------ b += (float)(mode *val); On 32 bit(intel , vs 2003, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591736 float <-------------- This is the different value On 64 bit(amd , vs 2005, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591734 float <-------------- This is the different value This is a statement, not a question. Yes, floating point numbers are not accurate to all decimal places. If you need 100% accuracy you'll need to use something else. Feb 8 '07 #2

 P: n/a I am sorry for the confusion, but my question is that why 32 bit machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines does it right. Is there some way to fix it. On Feb 8, 3:42 pm, "Jim Langston"

 P: n/a On Feb 8, 3:56 am, "Shirsoft"

 P: n/a Shirsoft wrote [top-posting corrected] On Feb 8, 3:42 pm, "Jim Langston" "Shirsoft" I have a 32 bit intel and 64 bit AMD machine. There is a rounding error in the 8th digit. Unfortunately because of the algorithm we use, the errors percolate into higher digits. C++ code is ------------------ b += (float)(mode *val); On 32 bit(intel , vs 2003, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591736 float <-------------- This is the different value On 64 bit(amd , vs 2005, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591734 float <-------------- This is the different value This is a statement, not a question.Yes, floating point numbers are not accurate to all decimal places. Ifyou need 100% accuracy you'll need to use something else. I am sorry for the confusion, but my question is that why 32 bit machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines does it right. Is there some way to fix it. a) Please don't top post. b) As far as converting from a higher to a lower precision floating point type is concerned, the standard says [4.8/1]: An rvalue of floating point type can be converted to an rvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined. Assuming that your compiler is compliant, that leaves the following possible explanations for the behavior you observe: 1) The float types in both versions have different precision. 2) One version is rounding down the value, the other is rounding it up. In either case, there is nothing you can do about it short off writing your own bit-fiddling rounding function. Why do you want to convert to float in the first place? Maybe, you should do the whole computation in double. Best Kai-Uwe Bux Feb 8 '07 #5

 P: n/a On Feb 8, 9:55 am, "Shirsoft" tmp; For some calculations there will be no difference but if you perform "compound" calculations (many operations) and/or multiplications/ division you probably will notice the difference. Also when performing summations like for (int i = 0; i < MAX; ++i) sum += arr[i]; -- Erik Wikström Feb 8 '07 #6

 P: n/a "Shirsoft" I have a 32 bit intel and 64 bit AMD machine. There is a rounding error in the 8th digit. Unfortunately because of the algorithm we use, the errors percolate into higher digits. Just another idea: does your hardware use a numeric coprocessor? If so, the calculation is partly done in the coprocessor, and it uses its mantissa size regardless of your variable declarations. This could introduce a difference in precision. Franz Feb 8 '07 #7

 P: n/a On Feb 8, 4:32 pm, Kai-Uwe Bux news:11*********************@k78g2000cwa.googlegr oups.com... I have a 32 bit intel and 64 bit AMD machine. There is a rounding error in the 8th digit. Unfortunately because of the algorithm we use, the errors percolate into higher digits. C++ code is ------------------ b += (float)(mode *val); On 32 bit(intel , vs 2003, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591736 float <-------------- This is the different value On 64 bit(amd , vs 2005, C++), some watch variables are ---------------------------------------------------------------------------------- b 7.5312500 float mode*val -4.7720763683319092 double (float)(mode*val) -4.7720766 float (float)(b+mode*val) 2.7591736 float b+(float)(mode*val) 2.7591733932495117 double (float)(b+(float)(mode*val)) 2.7591734 float After increment it becomes b 2.7591734 float <-------------- This is the different value This is a statement, not a question. Yes, floating point numbers are not accurate to all decimal places. If you need 100% accuracy you'll need to use something else. I am sorry for the confusion, but my question is that why 32 bit machines rounds it off to x.xxxxx36 instead of 34. The 64 bit machines does it right. Is there some way to fix it. a) Please don't top post. b) As far as converting from a higher to a lower precision floating point type is concerned, the standard says [4.8/1]: An rvalue of floating point type can be converted to an rvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined. Assuming that your compiler is compliant, that leaves the following possible explanations for the behavior you observe: 1) The float types in both versions have different precision. 2) One version is rounding down the value, the other is rounding it up. In either case, there is nothing you can do about it short off writing your own bit-fiddling rounding function. Why do you want to convert to float in the first place? Maybe, you should do the whole computation in double. Best Kai-Uwe Bux a) I am sorry about that, i am not very familiar to news group etiquettes Feb 8 '07 #8

 P: n/a On Feb 8, 5:36 pm, "Erik Wikström" tmp; For some calculations there will be no difference but if you perform "compound" calculations (many operations) and/or multiplications/ division you probably will notice the difference. Also when performing summations like for (int i = 0; i < MAX; ++i) sum += arr[i]; -- Erik Wikström What you mention is exactly what i am facing. I have a very huge dataset so i have to use float. I guess if i use a temp double var, then the problem would be fixed. Could you please tell me if you found some speed up by using doubles instead of floats? I have also read the double are as fast as float but i dont quite understand why. Regards Shireesh Feb 8 '07 #9

 P: n/a On Feb 8, 2:50 pm, "Shirsoft" tmp; For some calculations there will be no difference but if you perform "compound" calculations (many operations) and/or multiplications/ division you probably will notice the difference. Also when performing summations like for (int i = 0; i < MAX; ++i) sum += arr[i]; -- Erik Wikström What you mention is exactly what i am facing. I have a very huge dataset so i have to use float. I guess if i use a temp double var, then the problem would be fixed. Could you please tell me if you found some speed up by using doubles instead of floats? No, though I didn't make any exact measures I haven't noticed any difference in speed. I have also read the double are as fast as float but i dont quite understand why. With no guarantee of correctness: A long time ago (relatively) float was faster than double since the processor could keep the whole float in a register but the doubles didn't, so they had to divide the double into two registers and operate on both of them. Processors are designed to perform operations on registers and can perform most simple operations on values stored in registers in on clock cycle. This was in the age of 16 bit processors, the floating point registers are usually larger than the integer ones, so while integers were 16bit floats were 32. Fast forward to now and processors are at least 32bit, meaning that integer registers are 32bits and floating point registers are larger still, about 64 bits in fact. Which just happens to be the size of a double. So now a days you can fit a double into registers and work on them in just one clock cycle, just like floats. Should that not be enough there's the long double which is 128 bits, but then you have to split it into more than one register again. To keep in mind is that all of this is very much platform specific and probably not the whole truth, but it's probably close enough. As an example I seem to recall that x86 computers use 80 bits internally for FP calculations. -- Erik Wikström Feb 8 '07 #10

 P: n/a On Feb 8, 5:36 pm, "Erik Wikström" tmp; For some calculations there will be no difference but if you perform "compound" calculations (many operations) and/or multiplications/ division you probably will notice the difference. Also when performing summations like for (int i = 0; i < MAX; ++i) sum += arr[i]; -- Erik Wikström What you mention is exactly what i am facing. I have a very huge dataset so i have to use float. I guess if i use a temp double var, then the problem would be fixed. Could you please tell me if you found some speed up by using doubles instead of floats? I have also read the double are as fast as float but i dont quite understand why. Regards Shireesh Feb 8 '07 #11

 P: n/a Shirsoft wrote: ... On 32 bit(intel , vs 2003, C++), some watch variables are ... After increment it becomes b 2.7591736 float <-------------- This is the different value On 64 bit(amd , vs 2005, C++), some watch variables are ... After increment it becomes b 2.7591734 float <-------------- This is the different value What you provide in your message is not the actual values. What you provide is the decimal representations of the actual values generated by the built-in debugger of your development environment (at least that's how I understood your "watch variables"). Since you are using different development environments in each case (VS 2003 in the first, VS 2005 in the second) it is quite possible that the difference in decimal representations is actually caused (at least partially) by the difference in the binary-to-decimal conversion algorithms. BTW, I'm not saying that what you are describing is not really happening. I just think that it would be useful to find the exact step, the exact operation that produces the difference in the result. From the history of watch variables you provided, it appears that it happens at the last stage - the rounding. It is quite possible that in reality it happens [much] earlier. You just can't see it from the debugger output, because the binary-to-decimal conversion algorithm hides the problem. Maybe you should try watching the actual binary representations of the intermediate results. -- Best regards, Andrey Tarasevich Feb 8 '07 #12

 P: n/a On Feb 8, 2:55 am, "Shirsoft"

 P: n/a In article <11**********************@p10g2000cwp.googlegroups .com>, sh******@gmail.com says... [ ... ] What you mention is exactly what i am facing. I have a very huge dataset so i have to use float. I guess if i use a temp double var, then the problem would be fixed. Probably. If you could tell us a bit more about what you're doing, it might be helpful. It sounds like you're writing floating point data out to a file, then reading it back in and need maximum accuracy, but you're concerned about the amount of space occupied when you store the data. Unless you're already doing so, I'd advise storing the data in binary format -- that'll typically let you store numbers with about 16 digits of precision in only 8 bytes. Could you please tell me if you found some speed up by using doubles instead of floats? I have also read the double are as fast as float but i dont quite understand why. That depends on the hardware, but the general idea is that many CPUs work with double precision numbers internally anyway, and the format you specify only controls how the data is stored, not how the math is done. For example, the x86 and compatible CPUs have 80-bit floating point registers. The precision of calculations is typically set in the statup code and left fixed throughout a program. Changing the precision is possible, but slow enough that it's a net loss unless you do quite a bit of work at the lower precision (and it only affects the more complex calculations like division, square root, and trig functions, not things like addition and subtraction). OTOH, depending on the sorts of things you're doing, the speed of the calculation may not matter much. If you do a small amount of calculation on each of a large number of data items, the bandwidth to storage may be what really matters. An extreme case would be something like adding 1.0 to each number in a multi-gigabyte database. In this case, the real bottleneck is likely to be reading/writing the data from/to disk. In a case like that, using single-precision is likely to make a big difference in overall speed -- simply because it only requires reading and writing half as much data. -- Later, Jerry. The universe is a figment of its own imagination. Feb 10 '07 #14

### This discussion thread is closed

Replies have been disabled for this discussion. 