Floating Point and Wide Registers

Robert Gamble

9899:1999 5.1.2.3 Example 4 reads:
"EXAMPLE 4 Implementations employing wide registers have to take care
to honor appropriate semantics. Values are independent of whether they
are represented in a register or in memory. For example, an implicit
spilling of a register is not permitted to alter the value. Also, an
explicit store and load is required to round to the precision of the
storage type. In particular, casts and assignments are required to
perform their specified conversion. For the fragment

double d1, d2;
float f;
d1 = f = expression;
d2 = (float) expression;

the values assigned to d1 and d2 are required to have been converted to
float."

The output of the following program is:

d3 != d1 * d2
d3 != (double) (d1 * d2)
fdim == 0

I expected an output of

d3 != d1 * d2
d3 == (double) (d1 * d2)
fdim == 0

Here is the program:

#include <math.h>
#include <stdio.h>

int main (void) {
double d1, d2, d3;
d1 = 0.1;
d2 = 10.0;
d3 = d1 * d2;

/* First part */
if (d3 == d1 * d2)
puts("d3 == d1 * d2");
else
puts("d3 != d1 * d2");

/* Second part */
if (d3 == (double) (d1 * d2))
puts("d3 == (double) (d1 * d2)");
else
puts("d3 != (double) (d1 * d2)");

/* Third part */
if (fdim(d3, d1 * d2) == 0)
puts("fdim == 0");
else
puts("fdim != 0");

return 0;
}

It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. This does not appear to be happening. If I use
the gcc option -ffloat-store the result is as expected but this
shouldn't be required in a Standard-conforming mode.
The result of the last part of the program shows that when the results
of "d1 * d2" is actually converted to a double, it compares equal to
d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Robert Gamble

Aug 21 '06 #1

Subscribe Post Reply

3512

Michael Mair

Robert Gamble schrieb:
<snip: Cast to double should force the value to type double
and appropriate truncation but does not>

It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. This does not appear to be happening. If I use
the gcc option -ffloat-store the result is as expected but this
shouldn't be required in a Standard-conforming mode.
The result of the last part of the program shows that when the results
of "d1 * d2" is actually converted to a double, it compares equal to
d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Your interpretation is correct; in fact, even -ffloat-store is not
always enough to force gcc to conforming behaviour; see, for
example, a similar question by myself,
http://groups.google.de/group/comp.l...d30d2f1380f0f1

This is a truly annoying feature of gcc.
Cheers
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.

Aug 21 '06 #2

jacob navia

Robert Gamble wrote:

9899:1999 5.1.2.3 Example 4 reads:
"EXAMPLE 4 Implementations employing wide registers have to take care
to honor appropriate semantics. Values are independent of whether they
are represented in a register or in memory. For example, an implicit
spilling of a register is not permitted to alter the value. Also, an
explicit store and load is required to round to the precision of the
storage type. In particular, casts and assignments are required to
perform their specified conversion. For the fragment

double d1, d2;
float f;
d1 = f = expression;
d2 = (float) expression;

the values assigned to d1 and d2 are required to have been converted to
float."

The output of the following program is:

d3 != d1 * d2
d3 != (double) (d1 * d2)
fdim == 0

I expected an output of

d3 != d1 * d2
d3 == (double) (d1 * d2)
fdim == 0

Here is the program:

#include <math.h>
#include <stdio.h>

int main (void) {
double d1, d2, d3;
d1 = 0.1;
d2 = 10.0;
d3 = d1 * d2;

/* First part */
if (d3 == d1 * d2)
puts("d3 == d1 * d2");
else
puts("d3 != d1 * d2");

/* Second part */
if (d3 == (double) (d1 * d2))
puts("d3 == (double) (d1 * d2)");
else
puts("d3 != (double) (d1 * d2)");

/* Third part */
if (fdim(d3, d1 * d2) == 0)
puts("fdim == 0");
else
puts("fdim != 0");

return 0;
}

It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. This does not appear to be happening. If I use
the gcc option -ffloat-store the result is as expected but this
shouldn't be required in a Standard-conforming mode.
The result of the last part of the program shows that when the results
of "d1 * d2" is actually converted to a double, it compares equal to
d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Robert Gamble

The output for the lcc-win32 compiler is:

d3 != d1 * d2 // OK
d3 != (double) (d1 * d2) // This is the problem
fdim == 0 // OK
Why?

Because the result of d1 * d2 is already of type double, and the
compiler (probably like gcc) thinks that any cast to the same type
is a NOP (no operation).

If you write:
int a,b;

b = (int)((int)a + (int)b);

the same code will be generated as if you had written:

b = a+b;

You are correct that this is not what the compiler should do, and it is
a bug, a cast to double should do a cast.

If you change the comparison to

if (d3 == (double) (float)(d1 * d2))

THEN the result is

d3 == (double) (d1 * d2);

because lcc-win32 forces the cast since float != double.

Aug 21 '06 #3

Ancient_Hacker

I've found it best to NEVER expect floating-point equality or
non-equality.

A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic. And one is rarely
actually looking for equality, more likely would be satisfied with
being within some small delta.

Aug 21 '06 #4

Walter Roberson

In article <11**********************@b28g2000cwb.googlegroups .com>,
Ancient_Hacker <gr**@comcast.netwrote:

>I've found it best to NEVER expect floating-point equality or
non-equality.

>A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic. And one is rarely
actually looking for equality, more likely would be satisfied with
being within some small delta.

I was thinking briefly about this matter the other day, and trying
to decide whether sort() could be counted upon. If floating point
equality tests are dubious, are relative relationship tests not
dubious as well?

--
Programming is what happens while you're busy making other plans.

Aug 21 '06 #5

Keith Thompson

ro******@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:

In article <11**********************@b28g2000cwb.googlegroups .com>,
Ancient_Hacker <gr**@comcast.netwrote:
>>I've found it best to NEVER expect floating-point equality or
non-equality.

>>A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic. And one is rarely
actually looking for equality, more likely would be satisfied with
being within some small delta.

I was thinking briefly about this matter the other day, and trying
to decide whether sort() could be counted upon. If floating point
equality tests are dubious, are relative relationship tests not
dubious as well?

It *should* be safe to assume that, for any two floating-point values
x and y, exactly one of the following is true:

x < y
x == y
x y

.... as long as neither X nor Y is a NaN or Infinity (or possibly some
other exotic pseudo-value).

For example, given:

double x = 1.0;
double y = 1.0;
x /= 7.0;
x *= 7.0;

I'd be comfortable assuming that exactly one of the above conditions
holds, but I wouldn't want to make any assumptions about which one.

On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly. I'd be very surprised
if, for example, (0.5 * 2.0 != 1.0). But I suspect most
floating-point calculations aren't limited to such well-behaved
values.

What the C standard actually guarantees is a different matter.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Aug 21 '06 #6

Douglas A. Gwyn

Robert Gamble wrote:

The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed. Presumably your
compiler takes advantage of that to retain the "spurious"
precision represented in the guard bits in some situations
and not in others.

If you have set the compiler into an IEEE-conforming mode
(__STDC_IEC_559__) then the guard bits are not supposed to be
kept in the result of each operation (F.2: representation of
the type of each result is required to conform to the IEEE
spec, correctly rounded), and the default rounding mode is
round-to-nearest. (F.2 somewhat contradicts F.7.3 with
espect to dynamic rounding precision.)

As somebody else pointed out, testing for exact equality is
generally unwise.

Aug 21 '06 #7

Robert Gamble

On Mon, 21 Aug 2006 22:36:55 +0200, Michael Mair wrote:

Robert Gamble schrieb:
<snip: Cast to double should force the value to type double
and appropriate truncation but does not>
>It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming. The part that is unexpected, to
me, is the second part where the result of the multiplication is
explicitly cast to double which, according to my interpretation of the
above-quoted Standard verse, requires that the result is converted to
the narrower type of double before the test for equality if performed.
This does not appear to be happening. If I use the gcc option
-ffloat-store the result is as expected but this shouldn't be required
in a Standard-conforming mode. The result of the last part of the
program shows that when the results of "d1 * d2" is actually converted
to a double, it compares equal to d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Your interpretation is correct; in fact, even -ffloat-store is not
always enough to force gcc to conforming behaviour; see, for example, a
similar question by myself,
http://groups.google.de/group/comp.l...d30d2f1380f0f1

This is a truly annoying feature of gcc.

Thanks for the response. It's unfortunate that gcc doesn't follow the
Standard in this regard, the -ffloat-store option inhibits storing any
floating point variables in registers which is overkill for those who just
want explicit narowing conversions to be honored. Hopefully this will be
addressed before c99-compliance is deemed complete but I won't hold my
breath.

Robert Gamble

Aug 21 '06 #8

Robert Gamble

On Mon, 21 Aug 2006 22:50:07 +0200, jacob navia wrote:

Robert Gamble wrote:
>9899:1999 5.1.2.3 Example 4 reads:
"EXAMPLE 4 Implementations employing wide registers have to take care
to honor appropriate semantics. Values are independent of whether they
are represented in a register or in memory. For example, an implicit
spilling of a register is not permitted to alter the value. Also, an
explicit store and load is required to round to the precision of the
storage type. In particular, casts and assignments are required to
perform their specified conversion. For the fragment

double d1, d2;
float f;
d1 = f = expression;
d2 = (float) expression;

the values assigned to d1 and d2 are required to have been converted to
float."

The output of the following program is:

d3 != d1 * d2
d3 != (double) (d1 * d2)
fdim == 0

I expected an output of

d3 != d1 * d2
d3 == (double) (d1 * d2)
fdim == 0

Here is the program:

#include <math.h>
#include <stdio.h>

int main (void) {
double d1, d2, d3;
d1 = 0.1;
d2 = 10.0;
d3 = d1 * d2;

/* First part */
if (d3 == d1 * d2)
puts("d3 == d1 * d2");
else
puts("d3 != d1 * d2");

/* Second part */
if (d3 == (double) (d1 * d2))
puts("d3 == (double) (d1 * d2)");
else
puts("d3 != (double) (d1 * d2)");

/* Third part */
if (fdim(d3, d1 * d2) == 0)
puts("fdim == 0");
else
puts("fdim != 0");

return 0;
}

It was compiled with gcc using -Wall -W -std=c99 -pedantic

I understand the pitfalls of floating point arithmetic and I understand
what is going on here. On my machine (x86) floating point arithmetic
is performed in 80-bit registers and doubles are 64-bits. In the first
example the compiler is computing the result of the multiplication in
an 80-bit register and comparing the result to the double with less
precision. The result is not unexpected because d3 lost some precision
when it was stored into a 64-bit object but the result of the
multiplication did not undergo this loss. I don't have a problem with
this, it is AFAICT Standard conforming.
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. This does not appear to be happening. If I use
the gcc option -ffloat-store the result is as expected but this
shouldn't be required in a Standard-conforming mode.
The result of the last part of the program shows that when the results
of "d1 * d2" is actually converted to a double, it compares equal to
d3.

So my question is: Is my interpretation correct and are the results of
the second two parts guaranteed? If not, where did I go wrong?

Robert Gamble

The output for the lcc-win32 compiler is:

d3 != d1 * d2 // OK
d3 != (double) (d1 * d2) // This is the problem
fdim == 0 // OK
Why?

Because the result of d1 * d2 is already of type double, and the
compiler (probably like gcc) thinks that any cast to the same type
is a NOP (no operation).

That's what I figured.

If you write:
int a,b;

b = (int)((int)a + (int)b);

the same code will be generated as if you had written:

b = a+b;

But in this case that's okay because the relevant issues involved with
floating point dont apply to integer arithmetic.

You are correct that this is not what the compiler should do, and it is
a bug, a cast to double should do a cast.

Either gcc doesn't properly identify the fact that not applying the cast
can result in different behavior or they just don't care. In any case it
would be nice if they either fixed this or at least better advertised the
fact that even in a conforming mode this isn't handled properly and
directed users to the -ffloat-store option.

If you change the comparison to

if (d3 == (double) (float)(d1 * d2))

THEN the result is

d3 == (double) (d1 * d2);

because lcc-win32 forces the cast since float != double.

Assuming that float has less precision than double this shouldn't be the
case in general. The cast to float may cause precision to be permanently
lost making its value no longer equivalent to d3 which was not cast to
float.

Robert Gamble

Aug 21 '06 #9

Robert Gamble

On Mon, 21 Aug 2006 21:55:16 +0000, Walter Roberson wrote:

In article <11**********************@b28g2000cwb.googlegroups .com>,
Ancient_Hacker <gr**@comcast.netwrote:
>>I've found it best to NEVER expect floating-point equality or
non-equality.

>>A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic. And one is rarely
actually looking for equality, more likely would be satisfied with
being within some small delta.

I was thinking briefly about this matter the other day, and trying
to decide whether sort() could be counted upon. If floating point
equality tests are dubious, are relative relationship tests not
dubious as well?

This is actually what precipitated my original post. I wrote a routine a
while ago to sort an array of doubles but performed arithmetic on the
values and used the result for the sort order. A simplified example would
look something like this:

double da1[10];
double da2[10];
int x, y;
....
if (da1[x]/da2[x] <= da1[y]/da2[y])
/* sorting code */
else
/* sorting code */

The problem was that given all the same values, the result of
evaluating the conditional expression could be different depending on
which values were kept in the wide register and which were narrowed in a
particular evaluation. This of couse would cause the sort function to fail
miserably. This is not at all unexpected and given the range of the
possible values in the array was fixed by replacing the if statement with
something like:

if (da1[y]/da2[y] - da1[x]/da2[x] 0.001)
...

In general though, this isn't a great solution and I think that properly
placed casts should work, provided the implementation conforms to C99
(which gcc apparently does not in this case).

To sort an array of doubles based solely on the value of each element
though, I think a simple comparision would work properly regardless of
what values are stored where during the evaluation of the expression given
that all the values are ordered. In ther words, if "(da1[0] < da1[1])" is
true one time, it must be true every time it is evaluated if neither
elements have been modified between evaluations. Using the fmin/fmax
functions provided in C99 should work for all values ordered or not. I
can't think of any reason this wouldn't work.

Robert Gamble

Aug 22 '06 #10

Robert Gamble

On Mon, 21 Aug 2006 23:09:15 +0000, Douglas A. Gwyn wrote:

Robert Gamble wrote:
>The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed. Presumably your
compiler takes advantage of that to retain the "spurious"
precision represented in the guard bits in some situations
and not in others.

This is something I wasn't sure about. Is the following guaranteed:

double d1 = 0.1;
double d2 = d1;
d1 == d2; /* always true? */

Robert Gamble

Aug 22 '06 #11

Dik T. Winter

In article <11**********************@b28g2000cwb.googlegroups .com"Ancient_Hacker" <gr**@comcast.netwrites:

I've found it best to NEVER expect floating-point equality or
non-equality.

Depends.

A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic.

I think that Robert Gamble knew what he was doing. And he encountered a
bug in gcc.

Sometimes it is critical to know what is happening precisely.

Note that problems like Robert Gamble encountered are long standing in gcc.
The standard tightened it, but the compilers did not follow. Getting
results in a precision larger than expected can yield havoc in numerical
mathematics. When a long, long time ago I wrote some functions to
calculate some elementary functions (like the arcsine), the code heavily
depended on results rounded to the precision used. But to get a point.
It is simple to double the precision you are using when computing in
floating point. The algorithms are due to T. J. Dekker, and work when
floating point satisfies some requirements that IEEE does satisfy.
But they only work if indeed the precision of intermediate expressions
is the same as for the base type. Consider:
void xpadd(double a, double b, double *c, double *cc) {
*c = a + b;
*cc = (a b ? (double)(a + b) - a + b : (double)(a + b) - b + a);
}
This will *not* work if a compiler does not honour the cast, but still
uses a wider precision. (And I wonder how Jacob Navia would handle this.)
(You may ponder what that routine means. Simply after execution c is
the sum of a and b properly rounded and mathematically a + b = c + cc.)

Yes, some people do know what they mean when they require equality of
floating point numbers.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #12

CBFalconer

Ancient_Hacker wrote:

>
I've found it best to NEVER expect floating-point equality or
non-equality.

A goodly bit of the time whoever writes such a test isnt up on all
the little jiggles that can happen with fp arithmetic. And one
is rarely actually looking for equality, more likely would be
satisfied with being within some small delta.

With gcc one can even tell it to warn on any such usage.

--
Chuck F (cb********@yahoo.com) (cb********@maineline.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.netUSE maineline address!

Aug 22 '06 #13

Dik T. Winter

In article <ln************@nuthaus.mib.orgKeith Thompson <ks***@mib.orgwrites:
....

On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.

I'd be very surprised
if, for example, (0.5 * 2.0 != 1.0).

Want to know about that ternary Russian machine? (Yes, there was a series
produced.)
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #14

Robert Gamble

On Mon, 21 Aug 2006 15:47:44 -0400, Robert Gamble wrote:

On Mon, 21 Aug 2006 21:55:16 +0000, Walter Roberson wrote:

>In article <11**********************@b28g2000cwb.googlegroups .com>,
Ancient_Hacker <gr**@comcast.netwrote:
>>>I've found it best to NEVER expect floating-point equality or
non-equality.

>>>A goodly bit of the time whoever writes such a test isnt up on all the
little jiggles that can happen with fp arithmetic. And one is rarely
actually looking for equality, more likely would be satisfied with
being within some small delta.

I was thinking briefly about this matter the other day, and trying
to decide whether sort() could be counted upon. If floating point
equality tests are dubious, are relative relationship tests not
dubious as well?

This is actually what precipitated my original post. I wrote a routine a
while ago to sort an array of doubles but performed arithmetic on the
values and used the result for the sort order. A simplified example would
look something like this:

double da1[10];
double da2[10];
int x, y;
...
if (da1[x]/da2[x] <= da1[y]/da2[y])
/* sorting code */
else
/* sorting code */

The problem was that given all the same values, the result of
evaluating the conditional expression could be different depending on
which values were kept in the wide register and which were narrowed in a
particular evaluation. This of couse would cause the sort function to fail
miserably. This is not at all unexpected and given the range of the
possible values in the array was fixed by replacing the if statement with
something like:

if (da1[y]/da2[y] - da1[x]/da2[x] 0.001)
...

In general though, this isn't a great solution and I think that properly
placed casts should work, provided the implementation conforms to C99
(which gcc apparently does not in this case).

As Doug pointed out else-thread, this isn't even guaranteed if the
implementation isn't in IEEE-conforming mode (which it isn't required to
support). I consider this a QOI issue.

To sort an array of doubles based solely on the value of each element
though, I think a simple comparision would work properly regardless of
what values are stored where during the evaluation of the expression given
that all the values are ordered. In ther words, if "(da1[0] < da1[1])" is
true one time, it must be true every time it is evaluated if neither
elements have been modified between evaluations. Using the fmin/fmax
functions provided in C99 should work for all values ordered or not. I
can't think of any reason this wouldn't work.

I think this part still stands. What I really care about is that the
following never aborts for any ordered double values of d1 and d2:

if (d1 < d2) {
if (d1 < d2)
;
else
abort();
}

This would allow for sorting arrays of floating point values, is this
guaranteed?

Robert Gamble

Aug 22 '06 #15

Keith Thompson

"Dik T. Winter" <Di********@cwi.nlwrites:

In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.

Where is that stated?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Aug 22 '06 #16

Francis Glassborow

In article <pa****************************@gmail.com>, Robert Gamble
<rg*******@gmail.comwrites

>double da1[10];
double da2[10];
int x, y;
...
if (da1[x]/da2[x] <= da1[y]/da2[y])
/* sorting code */
else
/* sorting code */

The problem was that given all the same values, the result of
evaluating the conditional expression could be different depending on
which values were kept in the wide register and which were narrowed in a
particular evaluation. This of couse would cause the sort function to fail
miserably. This is not at all unexpected and given the range of the
possible values in the array was fixed by replacing the if statement with
something like:

if (da1[y]/da2[y] - da1[x]/da2[x] 0.001)
...

I think there should be a call to fabs() in that expression.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Aug 22 '06 #17

Francis Glassborow

In article <ln************@nuthaus.mib.org>, Keith Thompson
<ks***@mib.orgwrites

>On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly. I'd be very surprised
if, for example, (0.5 * 2.0 != 1.0). But I suspect most
floating-point calculations aren't limited to such well-behaved
values.

In practice I agree with you, but as a language lawyer I feel it
necessary to point out that a perverse implementation could be using
something such as a base 3 representation of floating point values and
the above would then be false.
--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Aug 22 '06 #18

Keith Thompson

Francis Glassborow <fr*****@robinton.demon.co.ukwrites:

In article <ln************@nuthaus.mib.org>, Keith Thompson
<ks***@mib.orgwrites
>>On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly. I'd be very surprised
if, for example, (0.5 * 2.0 != 1.0). But I suspect most
floating-point calculations aren't limited to such well-behaved
values.

In practice I agree with you, but as a language lawyer I feel it
necessary to point out that a perverse implementation could be using
something such as a base 3 representation of floating point values and
the above would then be false.

Agreed.

As a fellow language lawyer, that's why I wrote that I'd be surprised
if they were unequal, not that they must be equal.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Aug 22 '06 #19

Robert Gamble

Keith Thompson wrote:

"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.
Required by the C standard.

Where is that stated?

The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.

Robert Gamble

Aug 22 '06 #20

Robert Gamble

Douglas A. Gwyn wrote:

Robert Gamble wrote:
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed.

Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

There is similiar language for converting floating point numbers from
other types.

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Robert Gamble

Aug 22 '06 #21

Chris Dollin

Robert Gamble wrote:

Douglas A. Gwyn wrote:
>Robert Gamble wrote:
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed.

Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

There is similiar language for converting floating point numbers from
other types.

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Could it not, for example, pick the lower value for left operands,
and the upper value for right operands?

That would be consistent. Stupid, perhaps, but consistent. Or
perhaps I mean "predictable".

--
Chris "seeking brushlike similarity" Dollin
"No-one here is exactly what he appears." G'kar, /Babylon 5/

Aug 22 '06 #22

Wojtek Lerch

"Robert Gamble" <rg*******@gmail.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...

Douglas A. Gwyn wrote:

....

>Thus, not even 0.1==0.1 is guaranteed.

....

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Where does the standard say that it must be consistent, or even describe
what "consistent" means in this context? The behaviour is unspecified --
the implementation is free to decide whether to round up or down, separately
for each case. I don't see a requirement that the decision must always be
the same, or that it must be the same for any two constants that have
identical spelling, or identical mathematical value. As long as the
implementation documents how it makes the decision, the requirement of
"implementation-defined" is satisfied.

Aug 22 '06 #23

Richard Bos

"Robert Gamble" <rg*******@gmail.comwrote:

Keith Thompson wrote:
"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.
>
Required by the C standard.
Where is that stated?

The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.

Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.

Richard

Aug 22 '06 #24

Robert Gamble

Chris Dollin wrote:

Robert Gamble wrote:

Douglas A. Gwyn wrote:
Robert Gamble wrote:
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed.
Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

There is similiar language for converting floating point numbers from
other types.

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Could it not, for example, pick the lower value for left operands,
and the upper value for right operands?

That would be consistent. Stupid, perhaps, but consistent. Or
perhaps I mean "predictable".

No, the choices are "the larger representable value immediately
adjacent to the nearest representable value" and "the smaller
representable value immediately adjacent to the nearest representable
value"; "Larger for right operands" is not a valid choice. The
implementation needs to choice between which of these behaviors to use
and document it.

Robert Gamble

Aug 22 '06 #25

Chris Dollin

Robert Gamble wrote:

>
Chris Dollin wrote:
>Robert Gamble wrote:

Douglas A. Gwyn wrote:
Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

There is similiar language for converting floating point numbers from
other types.

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Could it not, for example, pick the lower value for left operands,
and the upper value for right operands?

That would be consistent. Stupid, perhaps, but consistent. Or
perhaps I mean "predictable".

No, the choices are "the larger representable value immediately
adjacent to the nearest representable value" and "the smaller
representable value immediately adjacent to the nearest representable
value"; "Larger for right operands" is not a valid choice. The
implementation needs to choice between which of these behaviors to use
and document it.

I see your reading, but it isn't clear to me that the text quoted
above /excludes/ an instance-by-instance choice. But perchance the
way the Standard is written as a whole would make permitting
instance-by-instance readings generally unpaletable.

--
Chris "" Dollin
"No-one here is exactly what he appears." G'kar, /Babylon 5/

Aug 22 '06 #26

Ancient_Hacker

And just to muddy the waters a bit more, it will actually improve the
accuracy of calculations (in the long-term average) if 1/2 LSB bit of
random dither is thrown in.

Aug 22 '06 #27

Robert Gamble

Wojtek Lerch wrote:

"Robert Gamble" <rg*******@gmail.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
Douglas A. Gwyn wrote:
...

Thus, not even 0.1==0.1 is guaranteed.

...
Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Where does the standard say that it must be consistent, or even describe
what "consistent" means in this context? The behaviour is unspecified --
the implementation is free to decide whether to round up or down, separately
for each case. I don't see a requirement that the decision must always be
the same, or that it must be the same for any two constants that have
identical spelling, or identical mathematical value. As long as the
implementation documents how it makes the decision, the requirement of
"implementation-defined" is satisfied.

The Standard says that the implementation must "document how the choice
is made" which implies that there is a systematic way to determine how
the choice will be made in each instance. If the choice doesn't have
to be consistent there is no point in having implementation defined
behavior at all.

Robert Gamble

Aug 22 '06 #28

Walter Roberson

In article <11*********************@i42g2000cwa.googlegroups. com>,
Robert Gamble <rg*******@gmail.comwrote:

>Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

>Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Your wording in this thread implies that it must choose one
of the two behaviours and stick with it. That isn't necessarily
the case: it could, for example, choose "round to even". This
method is consistant, easily documentable, and yet is neither
round to larger nor round to smaller.

http://citeseer.ist.psu.edu/context/1001006/0
--
"law -- it's a commodity"
-- Andrew Ryan (The Globe and Mail, 2005/11/26)

Aug 22 '06 #29

Robert Gamble

Richard Bos wrote:

"Robert Gamble" <rg*******@gmail.comwrote:

Keith Thompson wrote:
"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.

Required by the C standard.
>
Where is that stated?
The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.

Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.

How would you represent 2.0 with a radix of 3 in the floating point
model?

Robert Gamble

Aug 22 '06 #30

Eric Sosman

Robert Gamble wrote On 08/22/06 10:52,:

>
The Standard says that the implementation must "document how the choice
is made" which implies that there is a systematic way to determine how
the choice will be made in each instance. If the choice doesn't have
to be consistent there is no point in having implementation defined
behavior at all.

I don't think you can go from "document the choice" to
"document the choice in a systematic way." For one thing,
you've got to define what you mean by "systematic," which
probably means you've got to enumerate the conditions that
are and are not allowed to influence the choice.

DEATHSTATION 9000 OFFICIAL DOCUMENTATION
(quoted without permission)

When a floating-point result must be rounded, the
Rounding Mode Bit (RMB) is consulted. If the RMB
is clear, the floating-point result is rounded
toward zero. If the RMB is set, the result is
rounded toward infinity.

The RMB is set to zero when the system is shipped
from the factory, and thereafter inverts each time
the on-board detector observes the arrival of an
alpha particle.

All right, it's whimsical -- but doesn't it "document
how the choice is made?" There's a QoI issue, but I don't
think there's a conformance problem.

--
Er*********@sun.com

Aug 22 '06 #31

Wojtek Lerch

"Robert Gamble" <rg*******@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...

The Standard says that the implementation must "document how the choice
is made" which implies that there is a systematic way to determine how
the choice will be made in each instance. If the choice doesn't have
to be consistent there is no point in having implementation defined
behavior at all.

The point in having implementation-defined behaviour is to ensure that you
can determine, by reading an implementation's documentation, whether the
implementation satisfies your requirements.

Aug 22 '06 #32

kuyper

Robert Gamble wrote:

Richard Bos wrote:
"Robert Gamble" <rg*******@gmail.comwrote:

....

The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.
Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.

How would you represent 2.0 with a radix of 3 in the floating point
model?

As indicated above, 0.200000e+1. In terms of 5.2.4.2.2:

s = +1
b = 3
e = 1
f[1] = 2, all other f[k] = 0

The value give by the formula in 5.2.4.2.2p2 is then

x = +1*3*2*3^-1 == 2.0

For the model defined in 5.2.4.2.2, there do exist values of b, p,
emin, and emax such that 2.0 isn't exactly representable: if e-min is
high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
DBL_EPSILON*DBL_MAX. but that would require DBL_MIN, and either
DBL_EPSILON or DBL_MAX, to have values inconsistent with
5.2.4.2.2p8-10.

Any implementation where 2.0 was either too large or too small to be
represented exactly would also be pretty unpopular, but that's a QoI
issue.

Aug 22 '06 #33

Douglas A. Gwyn

Robert Gamble wrote:

... Is the following guaranteed:
double d1 = 0.1;
double d2 = d1;
d1 == d2; /* always true? */

I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

Aug 22 '06 #34

Douglas A. Gwyn

Robert Gamble wrote:

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

The implementation definition could be arbitrarily complicated,
specifying variations based on context (for example).

Aug 22 '06 #35

Philip Potter

"Robert Gamble" <rg*******@gmail.comwrote in message
news:11**********************@74g2000cwt.googlegro ups.com...

How would you represent 2.0 with a radix of 3 in the floating point
model?

Others have already explained how this is possible. Perhaps you're thinking
of 0.5? That is indeed inexact in radix 3 floating-point.

Philip

Aug 22 '06 #36

Francis Glassborow

In article <4l************@individual.net>, Wojtek Lerch
<Wo******@yahoo.cawrites

>Where does the standard say that it must be consistent, or even describe
what "consistent" means in this context? The behaviour is unspecified --
the implementation is free to decide whether to round up or down, separately
for each case. I don't see a requirement that the decision must always be
the same, or that it must be the same for any two constants that have
identical spelling, or identical mathematical value. As long as the
implementation documents how it makes the decision, the requirement of
"implementation-defined" is satisfied.

And in some circumstances it might be preferred to either strictly
alternate up and down or do so randomly
--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Aug 22 '06 #37

Francis Glassborow

In article <11**********************@p79g2000cwp.googlegroups .com>,
Robert Gamble <rg*******@gmail.comwrites

>The Standard says that the implementation must "document how the choice
is made" which implies that there is a systematic way to determine how
the choice will be made in each instance. If the choice doesn't have
to be consistent there is no point in having implementation defined
behavior at all.

But suppose the implementation states that the rounding will be up and
down dependant on the lsb in a hardware random number generator? That
documents the choice but does not allow the programmer to know what it
is.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' and "You Can Program in C++"
see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Aug 22 '06 #38

lawrence.jones

In comp.std.c Robert Gamble <rg*******@gmail.comwrote:

>
Thanks for the response. It's unfortunate that gcc doesn't follow the
Standard in this regard

As I recall, *most* of GCC tries to do it right, but the x86 back end
lies about what it can do and thus defeats the best efforts of the rest
of the compiler. I believe GCC on other platforms gets it right.

-Larry Jones

I'm not a vegetarian! I'm a dessertarian. -- Calvin

Aug 22 '06 #39

Keith Thompson

"Robert Gamble" <rg*******@gmail.comwrites:

Chris Dollin wrote:
>Robert Gamble wrote:
Douglas A. Gwyn wrote:
Robert Gamble wrote:
The part that is unexpected, to me, is the second part where the result
of the multiplication is explicitly cast to double which, according to
my interpretation of the above-quoted Standard verse, requires that the
result is converted to the narrower type of double before the test for
equality if performed. ...

When the "same" mathematical value is computed twice, the
results need not be the same, so long as each result is one
of the two nearest representable values to the "true" value.
Thus, not even 0.1==0.1 is guaranteed.

Where is this stated? 6.4.4.2p3 states in part:
"For decimal floating constants, and also for hexadecimal floating
constants when FLT_RADIX is not a power of 2, the result is either the
nearest representable value, or the larger or smaller representable
value immediately adjacent to the nearest representable value, chosen
in an implementation-defined manner."

There is similiar language for converting floating point numbers from
other types.

Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

Could it not, for example, pick the lower value for left operands,
and the upper value for right operands?

That would be consistent. Stupid, perhaps, but consistent. Or
perhaps I mean "predictable".

No, the choices are "the larger representable value immediately
adjacent to the nearest representable value" and "the smaller
representable value immediately adjacent to the nearest representable
value"; "Larger for right operands" is not a valid choice. The
implementation needs to choice between which of these behaviors to use
and document it.

The implementation needs to document how the choice is made.

Your reading would forbid different rounding methods for float and
double, which I don't think is the intent.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Aug 22 '06 #40

Walter Roberson

In article <ln************@nuthaus.mib.org>,
Keith Thompson <ks***@mib.orgwrote:

>Your reading would forbid different rounding methods for float and
double, which I don't think is the intent.

On the contrary, I believe different rounding methods for float
and double is an intentional allowance.

"single precision" floating point formats and rounding properties
were invented multiple times, and it took years for the good
and the bad to get weeded out from the ugly. "double precision"
came along much further in the development of floating point,
when there were much firmer ideas of How It Ought To Work.
Double precision formats and properties were much more
standardized than single precision. It was not uncommon for
manufacturers to preserve their legacy single precision properties
for single precision, but to use different properties for
double precision.

If I recall correctly, IBM and DEC (especially VAX) had multiple
legacy single-precision float formats with properties that differed
noticably from double-precision properties.
--
Okay, buzzwords only. Two syllables, tops. -- Laurie Anderson

Aug 22 '06 #41

Robert Gamble

Douglas A. Gwyn wrote:

Robert Gamble wrote:
Since when is an implementation allowed to manifest
implementation-defined behavior in a non-consistent fashion?

The implementation definition could be arbitrarily complicated,
specifying variations based on context (for example).

Okay, I am in agreement now.

Robert Gamble

Aug 22 '06 #42

Robert Gamble

ku****@wizard.net wrote:

Robert Gamble wrote:
Richard Bos wrote:
"Robert Gamble" <rg*******@gmail.comwrote:

...

The number 1 can be exactly represented according to the model
described in 5.2.4.2.2 using any radix (b) since an exponent (e) of
zero must be allowed and b^e is 1 when e is zero. Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.
>
Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.
How would you represent 2.0 with a radix of 3 in the floating point
model?

As indicated above, 0.200000e+1. In terms of 5.2.4.2.2:

s = +1
b = 3
e = 1
f[1] = 2, all other f[k] = 0

The value give by the formula in 5.2.4.2.2p2 is then

x = +1*3*2*3^-1 == 2.0

You and Richard are, of course, correct. I don't know what I was
thinking, sorry and thanks for the clarification.

Robert Gamble

Aug 22 '06 #43

Robert Gamble

Douglas A. Gwyn wrote:

Robert Gamble wrote:
... Is the following guaranteed:
double d1 = 0.1;
double d2 = d1;
d1 == d2; /* always true? */

I don't think it's guaranteed, even if the declarations were
volatile-qualified (to prevent register caching). However,
it's hard to imagine code in that case that would fail the test.

First off I'd like to thank you and everyone else who has contributed
to this thread, your patience and insights have been valuable and are
appreciated.
I accept the fact that not-withstanding IEEE-compliance 0.1==0.1 is not
guaranteed and all of the related points that lead to such a
conclusion. What I don't understand though is how the above example
isn't guaranteed, even without the volatile qualifier. In my
understanding the value of 0.1 is stored, either exactly or rounded in
an implementation-defined way, as a double value in d1. How can
additional rounding occur when d2 is then assigned the value of d1? I
really can't think of an allowable scenerio where this could be the
case. I understand that:
d1 = 0.1; d2 = 0.1;
may not result in d1 and d2 having values that compare equal but this
is because there is the potential for rounding to occur twice with the
results being different each time. Similiar to my original example, I
would think that "d2 = d1 = 0.1;" would also result in values for d1
and d2 that must compare equal. Not only that, but given the apparent
guarantees of 6.3.1.5, I would think that the following is also always
true:
float f1 = 0.1;
double d1 = f1;
f1 == d1;
The Standard seems pretty clear about this, or am I misinterpreting
something here?

Robert Gamble

Aug 22 '06 #44

Dik T. Winter

In article <ln************@nuthaus.mib.orgKeith Thompson <ks***@mib.orgwrites:

"Dik T. Winter" <Di********@cwi.nlwrites:
In article <ln************@nuthaus.mib.orgKeith Thompson
<ks***@mib.orgwrites: ...
On the other hand, there are some values (such as 1.0) that you can
reasonably assume can be represented exactly.
Required by the C standard.

Where is that stated?

5.2.4.2.2 where the model is defined. 1.0 is a number in the model.
The actual representation may have numbers in addition to the model
numbers, but the model numbers are required. See also the definition
of FLT_EPSILON.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #45

Dik T. Winter

In article <44***************@news.xs4all.nlrl*@hoekstra-uitgeverij.nl (Richard Bos) writes:

"Robert Gamble" <rg*******@gmail.comwrote:

....

Since a floating
point number must be represented exactly if it can be exactly
represented it is guaranteed that 1 will always be represented exactly
in a floating point number. The same cannot be said for 2.

Yes, it can. 2 is exactly 0.100000e+2 if the base is 2 (or, if you want
the exponent expressed in the base as well, 0.100000e+10), and exactly
0.200000e+1 if the base is anything larger.

FLT_DIG is required to be at least 6. It is further defined as
floor((p - 1) * log_10 b) (ignoring the case b is a power of 10).
Which means that (p - 1) * log_10 b 6, or p 1 + 6 / log_10 b.
We need not consider the exponent, that is can be large enough.
So the largest integer that is guaranteed to be represented
exactly is:
b ** (1 + 6 / log_10 b) - 1 = b * b ** (6 * log_b 10) - 1 =
b * (b ** 10) ** 6 - 1 = b * 10 ** 6 - 1 = b * 1000000 - 1
(again, ** means exponentiation).
For b = 10 it the (different) formulas come to the result that
maximum is 999999.
So considering everything, all integers in the range -999999 .. +999999
can be represented exactly, regardless the base used.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #46

Dik T. Winter

In article <11**********************@m79g2000cwm.googlegroups .com"Robert Gamble" <rg*******@gmail.comwrites:
....

No, the choices are "the larger representable value immediately
adjacent to the nearest representable value" and "the smaller
representable value immediately adjacent to the nearest representable
value"; "Larger for right operands" is not a valid choice.

There are actually *three* choices. The nearest representable value,
or one of the two values adjacent to it.

The
implementation needs to choice between which of these behaviors to use
and document it.

But the choice need not be consistent with respect to the number or the
context.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #47

Dik T. Winter

In article <11*********************@h48g2000cwc.googlegroups. com"Ancient_Hacker" <gr**@comcast.netwrites:

And just to muddy the waters a bit more, it will actually improve the
accuracy of calculations (in the long-term average) if 1/2 LSB bit of
random dither is thrown in.

Try to discuss that with Kahan.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #48

Dik T. Winter

In article <11**********************@i42g2000cwa.googlegroups .comku****@wizard.net writes:
....

For the model defined in 5.2.4.2.2, there do exist values of b, p,
emin, and emax such that 2.0 isn't exactly representable: if e-min is
high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
DBL_EPSILON*DBL_MAX. but that would require DBL_MIN, and either
DBL_EPSILON or DBL_MAX, to have values inconsistent with
5.2.4.2.2p8-10.

Any implementation where 2.0 was either too large or too small to be
represented exactly would also be pretty unpopular, but that's a QoI
issue.

In my opinion such an implementation would be incorrect. For float,
FLT_MIN_10_EXP is *defined* as ceil((e_min - 1) * log_10 b) and should
be at most -37. This gives an upper bound on e_min. In a similar way
we can find a lower bound on e_max. And both are good enough to
guarantee that 2.0 *must* be in the model. Note that the values in
this paragraph are *minimal* in magnitude.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Aug 22 '06 #49

kuyper

Dik T. Winter wrote:

In article <11**********************@i42g2000cwa.googlegroups .comku****@wizard.net writes:
...
For the model defined in 5.2.4.2.2, there do exist values of b, p,
emin, and emax such that 2.0 isn't exactly representable: if e-min is
high enough, 2.0 < DBL_MIN; if e-max were low enough, 2.0 >
DBL_EPSILON*DBL_MAX. but that would require DBL_MIN, and either
DBL_EPSILON or DBL_MAX, to have values inconsistent with
5.2.4.2.2p8-10.
>
Any implementation where 2.0 was either too large or too small to be
represented exactly would also be pretty unpopular, but that's a QoI
issue.

In my opinion such an implementation would be incorrect. For float,
FLT_MIN_10_EXP is *defined* as ceil((e_min - 1) * log_10 b) and should
be at most -37. This gives an upper bound on e_min. In a similar way
we can find a lower bound on e_max. And both are good enough to
guarantee that 2.0 *must* be in the model. Note that the values in
this paragraph are *minimal* in magnitude.

Of course - I already indicated that such an implementation was
non-conforming. My point was that, in addition to being non-conforming,
it would also be extremely unpopular. As a practical matter, that's far
more important.

Aug 23 '06 #50

Floating Point and Wide Registers

Similar topics