Float precision (0.5678f becomes 0.56779998540878)

Mad Butch

void Test()
{
float fValue = 0.5678f; // Value is 0.567800
double dValue = (double)fValue; // Value is 0.56779998540878
}

Is there anyway I can round a float at 6 positions behind the decimal?
That way the value would be 0.567800 again?

Jul 19 '05 #1

Subscribe Reply

7071

Gianni Mariani

Mad Butch wrote:

void Test()
{
float fValue = 0.5678f; // Value is 0.567800
double dValue = (double)fValue; // Value is 0.56779998540878
}

Is there anyway I can round a float at 6 positions behind the decimal?
That way the value would be 0.567800 again?

try this:

#include <iostream>
#include <cmath>

int main()
{
float fValue = 0.5678f;
double dValue = (double)fValue;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;

dValue = std::floor( 1e6l * fValue + 0.5l ) / 1e6l;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;
}

Jul 19 '05 #2

Howard

"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bk********@dispatch.concentric.net...

Mad Butch wrote:
void Test()
{
float fValue = 0.5678f; // Value is 0.567800
double dValue = (double)fValue; // Value is 0.56779998540878
}

Is there anyway I can round a float at 6 positions behind the decimal?
That way the value would be 0.567800 again?

try this:

#include <iostream>
#include <cmath>

int main()
{
float fValue = 0.5678f;
double dValue = (double)fValue;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;

dValue = std::floor( 1e6l * fValue + 0.5l ) / 1e6l;

std::cout.precision(15);
std::cout.width(20);

std::cout << dValue << std::endl;
}

Without trying that code out, it appears that it *writes out* a result that
looks like what the OP was asking for.

However, that is not what I got from the OP's question. There was no "cout"
in his code. I *think* what was being asked was how to actually get the
value stored in memory to have the same value as a double that it had as a
float. And the answer to that is, in general, you can't!

Floating-point values are stored in binary, not decimal, and there (in most
cases) no way to get a specific floating-point value accurate to some number
of *decimal* places. It can only be accurate to some number of *binary*
digits, because it's binary, not decimal.

One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).
-Howard

Jul 19 '05 #3

Kevin Goodsell

Howard wrote:

One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

As a specific example, the IEEE format that is used on a number of
systems (including Intel x86 systems) has doubles that can represent
every integer from something like -9,007,199,254,740,992 to
+9,007,199,254,740,992 (I worked that out myself, so I may have made a
mistake). Outside that range, I don't think you can find two consecutive
integers that can be precisely represented.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Jul 19 '05 #4

Jack Klein

On Wed, 17 Sep 2003 18:57:29 GMT, Kevin Goodsell
<us*********************@neverbox.com> wrote in comp.lang.c++:

Howard wrote:

One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

Sure there is. The C++ standard specifically inherits <float.h> from
ISO C, with either that name or <climits>, and makes the following
normative: ISO C subclause 7.1.5, 5.2.4.2.2, 5.2.4.2.1.

DBL_DIG, FLT_DIG, and LDBL_DIG are required to be available and have
the same meaning they have in C, namely the maximum number of decimal
digits a whole number can contain and be guaranteed to be converted to
the floating point type and back again with the original value.

These might be available in <limits> as well, I haven't looked it up.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq

Jul 19 '05 #5

Kevin Goodsell

Jack Klein wrote:

On Wed, 17 Sep 2003 18:57:29 GMT, Kevin Goodsell
<us*********************@neverbox.com> wrote in comp.lang.c++:

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

Sure there is. The C++ standard specifically inherits <float.h> from
ISO C, with either that name or <climits>, and makes the following
normative: ISO C subclause 7.1.5, 5.2.4.2.2, 5.2.4.2.1.

DBL_DIG, FLT_DIG, and LDBL_DIG are required to be available and have
the same meaning they have in C, namely the maximum number of decimal
digits a whole number can contain and be guaranteed to be converted to
the floating point type and back again with the original value.

These might be available in <limits> as well, I haven't looked it up.

Oh, OK. Most of those macros relating to floating point types are a
mystery to me. Thanks for the correction.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Jul 19 '05 #6

Karl Heinz Buchegger

Kevin Goodsell wrote:

Howard wrote:

One should, in general, never rely on a floating-point number to be exactly
anything (except perhaps a whole number, such as zero).

There's usually a pretty broad range of contiguous integers that can be
exactly represented, but as far as I know there's no guarantee of this
in the standard.

As a specific example, the IEEE format that is used on a number of
systems (including Intel x86 systems) has doubles that can represent
every integer from something like -9,007,199,254,740,992 to
+9,007,199,254,740,992 (I worked that out myself, so I may have made a
mistake). Outside that range, I don't think you can find two consecutive
integers that can be precisely represented.

I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0
30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.
Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

--
Karl Heinz Buchegger
kb******@gascad.at

Jul 19 '05 #7

Rob Williscroft

Karl Heinz Buchegger wrote in news:3F***************@gascad.at:

As a specific example, the IEEE format that is used on a number of
systems (including Intel x86 systems) has doubles that can represent
every integer from something like -9,007,199,254,740,992 to
+9,007,199,254,740,992 (I worked that out myself, so I may have made
a mistake). Outside that range, I don't think you can find two
consecutive integers that can be precisely represented.
I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0

Note that 0.3 can't be stored (exactly) in binary, 3 can and so
can 0.5, 0.25, 0.125 ... etc. I mean binary as base 2 not just
as a collection of bits (which is probably what you meant).

30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.
0.3 can't be represented exactly as an integer multiplied by a
power of 2, 30 can. My point here is that your argument is
backwards, it should be a floating point value is an integer
multiplied by some other integer (usually 2 but possibly 10)
raised to the power of yet another integer. I.e. you can
inferer things about a non-integer by making reasoned arguments
about integers not the other way around.

Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

I get your point but jack said: "Outside that range, I don't think you
can find two consecutive integers that can be precisely represented."

You missed "...consecutive ..." I think.

Rob.
--
http://www.victim-prime.dsl.pipex.com/

Jul 19 '05 #8

Karl Heinz Buchegger

Rob Williscroft wrote:

0.3 can't be represented exactly as an integer multiplied by a
power of 2, 30 can. My point here is that your argument is
backwards, it should be a floating point value is an integer
multiplied by some other integer (usually 2 but possibly 10)
raised to the power of yet another integer. I.e. you can
inferer things about a non-integer by making reasoned arguments
about integers not the other way around.

I'm not sure I understand what you are trying to say to me.
My point is (ist's hard to express all of this for me, since
I am not a native english speaker, so be patient)

the mantissa can be seen as the sum of fractions.
Thus you need to find the coefficients for

sizeof( double ) * 8 (assuming 8 bits per byte)
+----
\ 1
\ bit * -----
/ i 2
/ i
+----
i = 0

such that this sum equals the requested floating point number.

eg.
1 1 1 1 1
0.3 = 0 * - + 1 * - + 0 * - + 0 * -- + 1 * -- + ....
2 4 8 16 32

thus the bit sequence for 0.3 starts with 01001....
I don't know, if this repeated summing finally ends up at 0.3
(within the limited bits of double), but if it does not
(replace 0.3 with some other number if necc.), then all
2-multiples (0.6, 1.2, 2.4, 4.8, 9.6, ... ) of that number will
also be not representable.

Ohh. Now I see where I have gone wrong and what you mean.
I started with the some number and tried to end up at an
integer with that dubbling. I tried to base my doubt on the fact
that not all numbers in the range 0 .. 1) are representable by
a sum_of_fractions.
But in fact the opposite is happening.
I should have started with an integer and by halfing that (and
incementing the exponent) I will always end up with a number
in the range 0 .. 1( where such a sum_of_fractions exists.

mantissa exponent
3.0 0
1.5 1
0.75 2

0.75 = 0.5 + 0.25

1 1 2
( - + - ) * 2 = 3.0
2 4
Kevin was right and I was wrong.
Thanks for making me think, although
I should have done that before replying.

--
Karl Heinz Buchegger
kb******@gascad.at

Jul 19 '05 #9

Rob Williscroft

Rob Williscroft wrote in news:Xns93FA6F98C7E3EukcoREMOVEfreenetrtw@
195.129.110.130:

I get your point but jack said: "Outside that range, I don't think you
can find two consecutive integers that can be precisely represented."

My apologies to Kevin and Jack for some reason I thought Karl was
responding to Jack and not Kevin and attributed Kevin's statments
to Jack.

Rob.
--
http://www.victim-prime.dsl.pipex.com/

Jul 19 '05 #10

Kevin Goodsell

Karl Heinz Buchegger wrote:

I doubt that.
The only difference between 0.3 and 30 as a floating point number
is the exponent.

0.3 is stored (of course in binary) as 0.3 E 0
30.0 is stored as 0.3 E 2

so if the same mantissa is used and we know that 0.3
cannot be represented exactly by a binary floating point
number, then 30 also cannot be represented exactly.
Of course the real thing is different, since the exponent
usually is not base 10, but base 2, but the principle is the
same.

It looks like you've worked it out (I didn't totally follow the
discussion), but let me explain my logic, and show how integers are
represented in IEEE doubles.

First, the IEEE double is 64 bits. 1 sign bit, 11 exponent bits, and 52
mantissa bits. The mantissa actually has one more 'implied' bit - it's
value is always 1, and it is logically placed to the left of the other
52 bits. Now, integer value 1 is represented like this:

1|.0000[...]0000

Where the 1 is not actually stored, but is implied (represented here by
using a vertical bar to separate it from the physical bits). All the
physical mantissa bits are zero ([...] is used so I don't have to type
all 52 bits), and the binary point (represented with a dot) is placed
between the implied 1 bit and the 'real' mantissa bits. The exponent
isn't shown, but it is implied by the position of the binary point (you
run out of mantissa long before you run out of exponent, so there's no
need to watch for exponent overflow in this example).

Moving on, the integer 2 is represented this way:

1|0.0000[...]0000

Only the exponent (position of the binary point) changes. 3 looks like this:

1|1.0000[...]0000

4-9:

1|00.000[...]0000
1|01.000[...]0000
1|10.000[...]0000
1|11.000[...]0000
1|000.00[...]0000
1|001.00[...]0000

If you continue this for a very, very long time, you arrive at this:

1|1111[...]1111.

All the mantissa bits are set to 1. This isn't quite the end, though.
You can add one more to get this:

1|0000[...]0000|0.

Now there's an implied 0 on the right side of the mantissa. Actually,
there's a lot of implied zeros over there, they just haven't come into
play until now. This should give the value I listed as the upper limit
(pow(2, 53)).

At this point, you can't add 1 and get anything different. It would
require flipping the implied 0 bit, which you can't do. You can add 2
and get this, though:

1|0000[...]0001|0.

There are obviously many more integers that can be precisely
represented, but none of them are contiguous - given two contiguous
integers, one must have it's least significant bit set to 1 (in other
words, must be odd), but all integers above this point use an implied 0
bit for their least significant bit. As you go higher, more implied 0
bits are used, making the distance from one exact integer to the next
even greater. For a while, you have only even integers. Then, only
integers that are divisible by 4, then 8, 16, etc. (until you run out of
exponent, a very long time later).

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Jul 19 '05 #11

Similar topics

float problem

by: Tomasz Stochmal | last post by:

Hi I need to write a function that will convert a float number given as string into long and reverse function of that, example: '713566671863.6850' becomes 7135666718636850L...