Floating Point and Wide Registers

Robert Gamble

9899:1999 5.1.2.3 Example 4 reads:
"EXAMPLE 4 Implementations employing wide registers have to take care
to honor appropriate semantics. Values are independent of whether they
are represented in a register or in memory. For example, an implicit
spilling of a register is not permitted to alter the value. Also, an
explicit store and load is required to round to the precision of the
storage type. In particular, casts and assignments are required to
perform their specified conversion. For the fragment

double d1, d2;
float f;
d1 = f = expression;
d2 = (float) expression;

the values assigned to d1 and d2 are required to have been converted to
float."

The output of the following program is:

d3 != d1 * d2
d3 != (double) (d1 * d2)
fdim == 0

I expected an output of

d3 != d1 * d2
d3 == (double) (d1 * d2)
fdim == 0

Here is the program:

#include <math.h>
#include <stdio.h>

int main (void) {
double d1, d2, d3;
d1 = 0.1;
d2 = 10.0;
d3 = d1 * d2;

/* First part */
if (d3 == d1 * d2)
puts("d3 == d1 * d2");
else
puts("d3 != d1 * d2");

/* Second part */
if (d3 == (double) (d1 * d2))
puts("d3 == (double) (d1 * d2)");
else
puts("d3 != (double) (d1 * d2)");

/* Third part */
if (fdim(d3, d1 * d2) == 0)
puts("fdim == 0");
else
puts("fdim != 0");

return 0;
}

It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. This does not appear to be happening. If I use
the gcc option -ffloat-store the result is as expected but this
shouldn't be required in a Standard-conforming mode.
The result of the last part of the program shows that when the results
of "d1 * d2" is actually converted to a double, it compares equal to
d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Robert Gamble

Aug 21 '06

Subscribe Reply

3514

Jun Woong

Dik T. Winter wrote:

In article <ln************@nuthaus.mib.orgKeith Thompson <ks***@mib.orgwrites:
"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.
>
Where is that stated?

5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 23 '06 #51

Jun Woong

Douglas A. Gwyn wrote:

Robert Gamble wrote:
... Is the following guaranteed:
double d1 = 0.1;
double d2 = d1;
d1 == d2; /* always true? */

I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

Exactly what part of the standard leaves it not guaranteed? It is hard
for me to imagine a case where the equlity comparison does not hold.
The given code should differ from

double d1 = 0.1;
double d2 = 0.1;
d1 == d2;
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 23 '06 #52

Richard Bos

"Dik T. Winter" <Di********@cwi.nlwrote:

In article <44***************@news.xs4all.nlrl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
"Robert Gamble" <rg*******@gmail.comwrote:
...

Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.
>
Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.

FLT_DIG is required to be at least 6.

I know. Count the decimals :-)

Richard

Aug 23 '06 #53

Jun Woong

Robert Gamble wrote:

>
I think this part still stands. What I really care about is that the
following never aborts for any ordered double values of d1 and d2:

if (d1 < d2) {
if (d1 < d2)
;
else
abort();
}

This would allow for sorting arrays of floating point values, is this
guaranteed?

I think so; but I still wonder if somebody can imagine a conforming
implementation where this does not hold. As you know, if you replace
d1 and d2 with fp constants (even when they are the same constant)
the result can differ; this is never intuitive, but what the
standard says.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 23 '06 #54

ena8t8si

Robert Gamble wrote:

Douglas A. Gwyn wrote:
Robert Gamble wrote:
... Is the following guaranteed:
double d1 = 0.1;
double d2 = d1;
d1 == d2; /* always true? */
I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

First off I'd like to thank you and everyone else who has contributed
to this thread, your patience and insights have been valuable and are
appreciated.
I accept the fact that not-withstanding IEEE-compliance 0.1==0.1 is not
guaranteed and all of the related points that lead to such a
conclusion. What I don't understand though is how the above example
isn't guaranteed, even without the volatile qualifier. In my
understanding the value of 0.1 is stored, either exactly or rounded in
an implementation-defined way, as a double value in d1. How can
additional rounding occur when d2 is then assigned the value of d1? I
really can't think of an allowable scenerio where this could be the
case. I understand that:
d1 = 0.1; d2 = 0.1;
may not result in d1 and d2 having values that compare equal but this
is because there is the potential for rounding to occur twice with the
results being different each time. Similiar to my original example, I
would think that "d2 = d1 = 0.1;" would also result in values for d1
and d2 that must compare equal.

Right.

Not only that, but given the apparent
guarantees of 6.3.1.5, I would think that the following is also always
true:
float f1 = 0.1;
double d1 = f1;
f1 == d1;
The Standard seems pretty clear about this, or am I misinterpreting
something here?

Your analysis is correct. A careful reading of the relevant
sections shows that equality must hold.

Aug 24 '06 #55

Robert Gamble

Jun Woong wrote:

Dik T. Winter wrote:
In article <ln************@nuthaus.mib.orgKeith Thompson <ks***@mib.orgwrites:
"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.
>
Where is that stated?
5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Robert Gamble

Aug 24 '06 #56

zebedee

Robert Gamble wrote:

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed.

The only difference between your two examples is one has an "implicit
conversion" and one has an "explicit conversion".

Given the quoted clause below (emphasis mine), I don't believe you are
justified in believing either can give a result the other can't.

Neil.
6.3 Conversions
1 Several operators convert operand values from one type to another
automatically. This subclause specifies the result required from such an
implicit conversion, *as well as* those that result from a cast
operation (an explicit conversion). The list in 6.3.1.8 summarizes the
conversions performed by most ordinary operators; it is supplemented as
required by the discussion of each operator in 6.5.
2 *Conversion* of an operand value to a compatible type causes no change
to the value or the representation.

Aug 24 '06 #57

Jun Woong

Robert Gamble wrote:

Jun Woong wrote:

[...]

There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes, I think so. My point was that there is no guarantee for an
implemention to represent 1.0 exactly even if the standard's fp number
model has 1.0 in it. I meant to exclude 1 (an integer constant) with
the phrase "with the fp number model."
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 24 '06 #58

Jun Woong

Jun Woong wrote:

Robert Gamble wrote:
Jun Woong wrote:
[...]

>
There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.
6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes, I think so. My point was that there is no guarantee for an
implemention to represent 1.0 exactly even if the standard's fp number
model has 1.0 in it. I meant to exclude 1 (an integer constant) with
the phrase "with the fp number model."

I think we missed an important fact on the fp number model. My answers
and possibly some of other's are true only when an implementation
strictly conforms to the fp number model the standard provides. You
might think that if it does not conform to the model then it should be
a non-conforming implementation, but that's not the case. (the
following is not the only evidence, I remember a committee member on
the fp area of the standard confirmed the intent.)

DR025 for C90 says in part:

- Implementations are allowed considerable latitude in the way they
represent floating-point quantities; in particular, as noted in
Footnote 10 on page 14, the implementation need not exactly conform
to the model given in subclause 5.2.4.2.2 for ``normalized floating-
point numbers.''

and also from DR233 for C99:

- If there is no implementation representation of ZERO, but rather a
very small number. In this case, we generally thought that this was
a user problem, that they could not rely on a true ZERO having a
representation, in which case, they would need to place their own
checks for what approximations were acceptable as ZERO and print a
literal instead.

Of course, for zero the answers to DR025 and DR233 seem to conflict,

- (from DR025) There shall be at least one exact representation for
the value zero.

but the committee didn't forget to note in DR025 that:

- they[the principles some of which are quoted above] are not meant
to impose additional constraints on conforming implementations

So even if the fp number model should include 1.0, a conforming
implementation is allowed to have no way to represent it exactly,
so on such an implementation the above test for equality need not
hold.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 25 '06 #59

Jun Woong

ku****@wizard.net wrote:

Robert Gamble wrote:
Richard Bos wrote:
"Robert Gamble" <rg*******@gmail.comwrote:

...

The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.
>
Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.
How would you represent 2.0 with a radix of 3 in the floating point
model?

As indicated above, 0.200000e+1. In terms of 5.2.4.2.2:

s = +1
b = 3
e = 1
f[1] = 2, all other f[k] = 0

The value give by the formula in 5.2.4.2.2p2 is then

x = +1*3*2*3^-1 == 2.0

For the model defined in 5.2.4.2.2, there do exist values of b, p,
emin, and emax such that 2.0 isn't exactly representable: if e-min is
high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
DBL_EPSILON*DBL_MAX.

Should it be 2.0 DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
general since we have greater e-max than p, the precision matters when
inspecting whether or not a positive integer can be represented with
the given radix.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 25 '06 #60

Dik T. Winter

In article <11**********************@m73g2000cwd.googlegroups .com"Jun Woong" <wo***@icu.ac.krwrites:
....

Should it be 2.0 DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
general since we have greater e-max than p, the precision matters when
inspecting whether or not a positive integer can be represented with
the given radix.

If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 xxx_MAX.

I ignored e-min in the analysis. But if e_min p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min p, the smallest positive integer that is
representable is b ** (e_min - p).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 26 '06 #61

ena8t8si

Robert Gamble wrote:

Jun Woong wrote:
Dik T. Winter wrote:
In article <ln************@nuthaus.mib.orgKeith Thompson <ks***@mib.orgwrites:
"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.
>
Where is that stated?
>
5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.
There is a gap between "can be represented exactly in a program" and
"required to be represented exactly in the fp number model." Writing
1.0 in the source code is necessarily involved with converting it to
an internal representation. Since an implementation is free to choose
one of its nearby values, not the exact value even when it can be
represented exactly, no one can assume that 1.0 should be represented
exactly in his/her program in every conforming implementation. It is
true that the fp number model requires 1.0 to belong to the model and
it is very likely to be represented in a practical implementation,
but there is no way to make a fp variable have 1.0 only with the fp
number model the standard provides.

6.3.1.4p2 states in part:
"When a value of integer type is converted to a real floating type, if
the value being
converted can be represented exactly in the new type, it is unchanged."

double d = 1; /* d is now exactly 1.0 */
d == 1; /* Always true */
(double) 1 == (double) 1; /* Always true */
d == 1.0 /* Not guaranteed */

Correct?

Yes.

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */
if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

Aug 26 '06 #62

ena8t8si

Dik T. Winter wrote:

In article <11**********************@m73g2000cwd.googlegroups .com"Jun Woong" <wo***@icu.ac.krwrites:
...
Should it be 2.0 DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
general since we have greater e-max than p, the precision matters when
inspecting whether or not a positive integer can be represented with
the given radix.

If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p, else it is xxx_MAX.
If e_max >= p and assuming p 0, the first formula gives at least 2.
So if 2.0 is not representable, we have e_max < p, and the additional
requirement that 2.0 xxx_MAX.

I ignored e-min in the analysis. But if e_min p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min p, the smallest positive integer that is
representable is b ** (e_min - p).

The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.

Aug 26 '06 #63

Dik T. Winter

In article <11*********************@m79g2000cwm.googlegroups. comen******@yahoo.com writes:

Dik T. Winter wrote:

....

I ignored e-min in the analysis. But if e_min p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min p, the smallest positive integer that is
representable is b ** (e_min - p).

The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.

Indeed, I did not look far enough.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 27 '06 #64

ena8t8si

Dik T. Winter wrote:

In article <11*********************@m79g2000cwm.googlegroups. comen******@yahoo.com writes:
Dik T. Winter wrote:
...

I ignored e-min in the analysis. But if e_min p you will find some
pretty strange things. I do not find it in the standard, but apparently
that is allowed. And if that is the case, even 1.0 is not representable.
In fact, if e_min p, the smallest positive integer that is
representable is b ** (e_min - p).
>
The restrictions in 5.2.4.2.2 #10 imply that e_min <= 0.

Indeed, I did not look far enough.

Well it's easy to overlook. The first time
reading through it I had reached the same
conclusion you did.

Aug 27 '06 #65

Jun Woong

Dik T. Winter wrote:

In article <11**********************@m73g2000cwd.googlegroups .com"Jun Woong" <wo***@icu.ac.krwrites:
...
Should it be 2.0 DBL_MAX / (FLT_RADIX-DBL_EPSILON) to be precise? In
general since we have greater e-max than p, the precision matters when
inspecting whether or not a positive integer can be represented with
the given radix.

If e_max >= p, the largest positive integer that is representable is
DBL_MAX. If e_max < p, it is b ** e_max - 1.

I was talking about an integer s.t. integers from 1 to that integer
(inclusive) can be represented exactly with given b, e_max, e_min and
p, which the following deals with.

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p,

n should be (b ** p) - 1 if considering integers from 1 to n
inclusive.

else it is xxx_MAX.

However, if e_max < p, then xxx_MAX == (1 - b**(-p)) * b**e_max is not
an integer. It should be (b ** e_max) - 1 or
xxx_MAX / (1 - xxx_EPSILON/FLT_RADIX) - 1

The factor (FLT_RADIX-DBL_EPSILON)**(-1) in my previous post came from
my mistake made in handling the floor function.

Thanks.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 28 '06 #66

Jun Woong

en******@yahoo.com wrote:
[...]

However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */

Correct whether or not the implementation's fp number model follows
the standard's.

if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 28 '06 #67

Dik T. Winter

In article <11**********************@74g2000cwt.googlegroups. com"Jun Woong" <wo***@icu.ac.krwrites:

Dik T. Winter wrote:

In article <11**********************@m73g2000cwd.googlegroups .com"Jun Woong" <wo***@icu.ac.krwrites:

....

To determine whether an arbitrary integer can be represented takes quite
a bit of work. But this is all moot. The largest positive integer
such that all integers from 0 to that integer can be represented is:
b ** p
or FLT_RADIX ** xxx_MANT_DIG when e_max >= p,

n should be (b ** p) - 1 if considering integers from 1 to n
inclusive.

Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 28 '06 #68

Jun Woong

Dik T. Winter wrote:
[...]

>
Why? b ** p - 1 is representable, as is b ** p. So, I think that
b ** p should be included.

Oops, you are right. I missed e_max >= p, so b ** p is also
representable.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Aug 29 '06 #69

ena8t8si

Jun Woong wrote:

en******@yahoo.com wrote:
[...]
However, for the last one there is a weaker guarantee:

double a, b, c, d;
a = 1.0, b = 1.0, c = 1.0, d = 1.0;
a==b || a==c || a==d || b==c || b==d || c==d; /* guaranteed */

Correct whether or not the implementation's fp number model follows
the standard's.

if (a<b && b<c) {
b==1 && c-b==DBL_EPSILON; /* guaranteed when a<b && b<c */
}

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

1. The assumption that the implementation follows the standard's
fp number model isn't necessary. My comment does assume that 1
is exactly representable, but beyond that any number model will
work.

2. The accuracy of subtraction could indeed be arbitrarily
bad. However, DBL_EPSILON is defined as the difference
between 1 and the smallest double value greater than 1.
If that is meant as the implementation does the subtraction
then the second half of the guarantee is true by definition.

Allowing for implementations which does not follow the standard's fp
number model makes many things vague on this area. I doubt
consideration for such poor implementations (if any) is still
necessary.

And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.

Sep 3 '06 #70

Jun Woong

en******@yahoo.com wrote:

Jun Woong wrote:

[...]

I think two more assumptions necessary here about:
- the accuracy of the subtraction operation
- the implementation's following the standard's fp number model

[...]

And the assumptions here are weaker even than that, only that 1
is exactly representable, the difference between 1 and 1+DBL_EPSILON
is exactly representable, and if the result of a subtraction is
exactly representable then the subtraction yields that value.

Yes.

I didn't mean the precise assumption to make your argument true; what
I listed are the sufficient conditions for it.

One thing to note is that the fp number model and the accuracy of the
subtraction operation is separate; that is, the definition of
*_EPSILON does not restrict the result of x - 1 to be *_EPSILON where
x indicates 1's succeeding number on the representable fp number line.
--
Jun, Woong (woong at icu.ac.kr)
Samsung Electronics Co., Ltd.

``All opinions are mine and do not represent any organization''

Sep 4 '06 #71

Similar topics