By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,719 Members | 1,811 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,719 IT Pros & Developers. It's quick & easy.

Floating-point bit hacking: iterated nextafter() without loop?

P: n/a
I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0. Something
equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}

The problem with the above loop is that it may take too long because n
could be quite large. Moreover, very many such loops would be executed.

Disappointedly gcc doesn't seem to be able to optimize over nextafter
at all.

However, with only some modest bit hacking and without assuming much
about the floating point implementation it is possible to calculate
the nth greater value fast. At the end of this post is a program
showing such a solution. Of course it is not guaranteed to work as
intended but is there any non-odd platform where it will fail? If
there are any places where the solution could fail, what would the
consequences be? Is there a better solution to the problem than the
one below?
Daniel Vallstrom

// Tests consecutive nextafter calls and a way to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041014.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
u: n unrolled nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8 (==xn)
The u loops took 0.7 cpu seconds.
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>
#define nextafter2M( x, y ) (x) = nextafter((x),(y)); (x) = nextafter((x),(y))
#define nextafter4M( x, y ) nextafter2M( (x), (y) ); nextafter2M( (x), (y) )
#define nextafter8M( x, y ) nextafter4M( (x), (y) ); nextafter4M( (x), (y) )
#define nextafter16M( x, y ) nextafter8M( (x), (y) ); nextafter8M( (x), (y) )
#define nextafter32M( x, y ) nextafter16M( (x), (y) ); nextafter16M( (x), (y) )
#define nextafter64M( x, y ) nextafter32M( (x), (y) ); nextafter32M( (x), (y) )
int main( void )
{
clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.
// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );
// Test if unrolling the nextafter calls helps. (It doesn't (besides
// the speed-up from the unrolling itself).)

double u = x0;

clockAtLoopStart = clock();
for ( unsigned int k = n/64; k != 0; k-- )
{
nextafter64M( u, DBL_MAX );
}

for ( unsigned int k = n%64; k != 0; k-- )
{
u = nextafter( u, DBL_MAX );
}
clockAtLoopEnd = clock();

printf( "u: n unrolled nextafter( x0, DBL_MAX ) iterations: %a (==xn)\n",
u );
// u: n unrolled nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8 (==xn)
assert( u == xn );
printf( " The u loops took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The u loops took 0.7 cpu seconds.
return 0;
}
Nov 14 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Ack! No replies! Bump. Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?p

da**************@gmail.com (Daniel Vallstrom) wrote in message news:<62**************************@posting.google. com>...
I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0.
For the background to the problem, I forgot to say that f(n)>0 and
increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
floating values involved aren't god given but rather some heuristic
measures. Hence, when you want to add f(n) to x0, it sometimes makes
sense to increment x0 n times instead of doing nothing when the
accuracy of the floating type is too small (i.e. when x0 + f(n) == x0).
Something equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}

The problem with the above loop is that it may take too long because n
could be quite large. Moreover, very many such loops would be executed.


To see that the approach involving iterated nextafter() calls is
infeasible, consider that values of n could on average be around
2**24, like in the program in the first post. Each such n-iterated
nextafter() loop will take about a second to compute. Furthermore,
thousands or millions such loops could very well have to be executed.

[snip suggested solution to the problem]

Any input is appreciated. Regards,
Daniel Vallstrom
Nov 14 '05 #2

P: n/a
In article <62**************************@posting.google.com > da**************@gmail.com (Daniel Vallstrom) writes:
Ack! No replies! Bump. Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?p
Perhaps everybody is stumped? You will not find much floating-point
expertise in this newsgroup.
da**************@gmail.com (Daniel Vallstrom) wrote in message news:<62**************************@posting.google. com>...
I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0.


For the background to the problem, I forgot to say that f(n)>0 and
increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
floating values involved aren't god given but rather some heuristic
measures. Hence, when you want to add f(n) to x0, it sometimes makes
sense to increment x0 n times instead of doing nothing when the
accuracy of the floating type is too small (i.e. when x0 + f(n) == x0).
Something equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}

The problem with the above loop is that it may take too long because n
could be quite large. Moreover, very many such loops would be executed.


To see that the approach involving iterated nextafter() calls is
infeasible, consider that values of n could on average be around
2**24, like in the program in the first post. Each such n-iterated
nextafter() loop will take about a second to compute. Furthermore,
thousands or millions such loops could very well have to be executed.


My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nov 14 '05 #3

P: n/a
Accumulated ugly effects of top-posting fixed, I hope.

"Dik T. Winter" wrote:
(Daniel Vallstrom) writes:
da**************@gmail.com (Daniel Vallstrom) wrote in message

I am having trouble with floating point addition because of the
limited accuracy of floating types. Consider e.g. x0 += f(n) where
x0 and f are of some floating type. Sometimes x0 is much larger
than f(n) so that x0 + f(n) == x0. For example, x0 could be
2**300*(1.1234...) while f(n) is 2**100*(1.4256...). If x0 + f(n)
== x0 I still sometimes want the addition of f(n) to x0 to have
some effect on x0. A good solution seems to be to update x0 to the
nth value greater than x0.


For the background to the problem, I forgot to say that f(n)>0 and
increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
floating values involved aren't god given but rather some heuristic
measures. Hence, when you want to add f(n) to x0, it sometimes
makes sense to increment x0 n times instead of doing nothing when
the accuracy of the floating type is too small (i.e. when x0 + f(n)
== x0). .... snip ... Ack! No replies! Bump. Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?p


Perhaps everybody is stumped? You will not find much floating-point
expertise in this newsgroup.


I would attempt to keep the accumulated corrections separate.
Ideally such a value would be accumulated starting with the
smallest correction, but we can't necessarily have everything.

double x0, xnow, delta, n

for (n = 1, x0 = XSTART, delta = 0; n <= NLAST; n++) {
delta += f(n);
xnow = x0 + delta;
}

which should at least mitigate the problem. FP arithmetic is
inherently an approximation, so we can never have everything. At
best we can define the error bounds. In this case the bounds will
be a function of the relative magnitudes of delta, f(n), and x0.

--
"I support the Red Sox and any team that beats the Yankees"

"Any baby snookums can be a Yankee fan, it takes real moral
fiber to be a Red Sox fan"
Nov 14 '05 #4

P: n/a
Daniel Vallstrom wrote:
Ack! No replies! Bump.
Bump? This is Usenet. Good newsreaders don't bump, they just add
responses to the original thread, which stays put, visually speaking.
Or should I take the fact that there have
been no replies to mean that the program supplied in the first
post does work in practice?


Take it to mean that the comp.lang.c folks don't know enough about
floating-point math at the bit level to help you.

I'd suggest some other newsgroups, but I don't know enough to help you
there, either. comp.programming is probably a good catchall for this
sort of thing. Or you could find some web discussion groups.
Nov 14 '05 #5

P: n/a
"Dik T. Winter" <Di********@cwi.nl> wrote in message news:<I5********@cwi.nl>...
In article <62**************************@posting.google.com > da**************@gmail.com (Daniel Vallstrom) writes:
> da**************@gmail.com (Daniel Vallstrom) wrote in message news:<62**************************@posting.google. com>...
> > I am having trouble with floating point addition because of the limited
> > accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
> > of some floating type. Sometimes x0 is much larger than f(n) so that
> > x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
> > is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
> > addition of f(n) to x0 to have some effect on x0. A good solution seems
> > to be to update x0 to the nth value greater than x0.

>
> For the background to the problem, I forgot to say that f(n)>0 and
> increases with n, i.e. n0<n1 -> f(n0)<f(n1). Also, the various
> floating values involved aren't god given but rather some heuristic
> measures. Hence, when you want to add f(n) to x0, it sometimes makes
> sense to increment x0 n times instead of doing nothing when the
> accuracy of the floating type is too small (i.e. when x0 + f(n) == x0).
>
> > Something equivalent to this:
> >
> > for ( ; n != 0; n-- )
> > {
> > x0 = nextafter( x0, DBL_MAX );
> > }
> >
> > The problem with the above loop is that it may take too long because n
> > could be quite large. Moreover, very many such loops would be executed.

>
> To see that the approach involving iterated nextafter() calls is
> infeasible, consider that values of n could on average be around
> 2**24, like in the program in the first post. Each such n-iterated
> nextafter() loop will take about a second to compute. Furthermore,
> thousands or millions such loops could very well have to be executed.


My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).


Thanks! That got me thinking. I'll add an unpolished new
solution at the end of this post. (I ought to have found
it directly but I never moved on after failing to find
standard library functions that directly could do the
bit hacking (mantissa-exponent combining) that I wanted.)

I guess that when the floating point exponent base (FLT_RADIX)
isn't 2, the code won't work, at least not without some fixing.

Daniel Vallstrom
// Tests consecutive nextafter calls and a way to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041018.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
x0Exp: frexp( x0, &x0Exp ): 9
mnorm: DBL_MIN + m: 0x1.0000001p-1022
zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)
zb1: zb - e: 0x1p-20
zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>
int main( void )
{
#if FLT_RADIX != 2
#error Case not supported.
#endif

clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use. Can be anything.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to. Can be anything.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.
// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );
// Here is an approach using frexp and ldexp instead of bit-oring.

int x0Exp;
frexp( x0, &x0Exp );
printf( "x0Exp: frexp( x0, &x0Exp ): %d\n", x0Exp );
// x0Exp: frexp( x0, &x0Exp ): 9

// Normalize the mantissa.
double mnorm = DBL_MIN + m;
printf( "mnorm: DBL_MIN + m: %a\n", mnorm );
// mnorm: DBL_MIN + m: 0x1.0000001p-1022

double zb = ldexp( mnorm, x0Exp-DBL_MIN_EXP );
printf( "zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): %a (==z)\n", z );
// zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)

assert( zb == z );
// The rest is just like above.

// Remove the faulty first mantissa bit in zb.
double zb1 = zb - e;
printf( "zb1: zb - e: %a\n", zb1 );
// zb1: zb1 - e: 0x1p-20

// We are ready to add the value to x0.
double zb2 = x0 + zb1;
printf( "zb2: x0 + zb1: %a (==xn)\n", zb2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( zb2 == xn );
return 0;
}
Nov 14 '05 #6

P: n/a
Daniel Vallstrom wrote:
I am having trouble with floating point addition because of the limited
accuracy of floating types. Consider e.g. x0 += f(n) where x0 and f are
of some floating type. Sometimes x0 is much larger than f(n) so that
x0 + f(n) == x0. For example, x0 could be 2**300*(1.1234...) while f(n)
is 2**100*(1.4256...). If x0 + f(n) == x0 I still sometimes want the
addition of f(n) to x0 to have some effect on x0. A good solution seems
to be to update x0 to the nth value greater than x0. Something
equivalent to this:

for ( ; n != 0; n-- )
{
x0 = nextafter( x0, DBL_MAX );
}
[...]


Look up "Kahan's summation formula."

--
Er*********@sun.com

Nov 14 '05 #7

P: n/a
"Dik T. Winter" <Di********@cwi.nl> wrote in message news:<I5********@cwi.nl>...
In article <62**************************@posting.google.com > da**************@gmail.com (Daniel Vallstrom) writes:
> da**************@gmail.com (Daniel Vallstrom) wrote in message news:<62**************************@posting.google. com>...
> > [snip question asking how to compute the following loop in a fast manner]
> >
> > for ( ; n != 0; n-- )
> > {
> > x0 = nextafter( x0, DBL_MAX );
> > }
My suggestion for the loop would be breaking up the floating-point number in
its constituent parts, and work with them. In that case you add n (suitably
scaled) to the mantissa and recombine. Be careful with overflow in the
mantissa when you add... See frexp and ldexp (at least on my system).


Thanks again. Here is a simple solution:

int x0Exp;
frexp( x0, &x0Exp );
x0 += ldexp( n * nextafter( 0.0, DBL_MAX ), x0Exp - DBL_MIN_EXP );

I'm still miffed that I didn't find the above solution directly
(eventhough hindsight is always 20/20). I'll post a full program with
the solution at the end.

Daniel Vallstrom
// Tests consecutive nextafter calls and ways to compute the same thing fast
// without iterated nextafter calls.
// Daniel Vallstrom, 041019.
// Compile with e.g: gcc -std=c99 -pedantic -Wall -O3 -lm nextafterNTest.c

/* The program should print something like:
n: 0x1000000 (16777216)
x0: 0x1.001p+8
xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
The xn loop took 0.8 cpu seconds.
m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022
e: exp2( floor( log2( x0 ) ) ): 0x1p+8
z: e | m: 0x1.0000001p+8
z1: z - e: 0x1p-20
z2: x0 + z1: 0x1.0010001p+8 (==xn)
x0Exp: frexp( x0, &x0Exp ): 9
mnorm: DBL_MIN + m: 0x1.0000001p-1022
zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)
zb1: zb - e: 0x1p-20 (==z1)
zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): 0x1p-20 (==z1)
zc2: x0 + zc1: 0x1.0010001p+8 (==xn)
*/

#include <stdio.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <time.h>
int main( void )
{
#if FLT_RADIX != 2
#error Case not supported.
#endif

clock_t clockAtLoopStart; // Used to time the start of loops.
clock_t clockAtLoopEnd; // Used to time the end of loops.

// The number of nextafter iterations to use. Can be anything.
unsigned int n = ( 1u << 24 );
printf( "\nn: %#x (%u)\n", n, n );

// The base double to apply nextafter to. Can be anything.
double x0 = 0x1.001p8;
printf( "x0: %a\n", x0 );

// The nth double greater than x0.
double xn = x0;
clockAtLoopStart = clock();
for ( unsigned int k = 0; k != n; k++ )
{
xn = nextafter( xn, DBL_MAX );
}
clockAtLoopEnd = clock();
printf( "xn: n nextafter( x0, DBL_MAX ) iterations: %a\n", xn );
// xn: n nextafter( x0, DBL_MAX ) iterations: 0x1.0010001p+8
printf( " The xn loop took %.1f cpu seconds.\n",
(double) ( clockAtLoopEnd - clockAtLoopStart ) / CLOCKS_PER_SEC );
// The xn loop took 0.8 cpu seconds.
// The goal is to compute xn fast. At least gcc -O3 does *not* optimize
// iterated nextafter calls (the above loop). If the loop (n) is long (or
// many loops have to be done) it takes too long to run.

// Calculate a useful mantissa:
double m = nextafter( 0.0, DBL_MAX );
m = n*m;
printf( "m: n * nextafter( 0.0, DBL_MAX ): %a\n", m );
// m: n * nextafter( 0.0, DBL_MAX ): 0x0.0000001p-1022

// Calculate a useful exponent part.
double e = exp2( floor( log2( x0 ) ) );
printf( "e: exp2( floor( log2( x0 ) ) ): %a\n", e );
// e: exp2( floor( log2( x0 ) ) ): 0x1p+8

// Combine e and m.
double z = e;
for ( unsigned char * zPtr = &z, * zEnd = zPtr + sizeof(z), * mPtr = &m;
zPtr != zEnd; zPtr++, mPtr++ )
{
*zPtr |= *mPtr;
}
printf( "z: e | m: %a\n", z );
// z: e | m: 0x1.0000001p+8

// Remove the faulty first mantissa bit in z, the one coming from e.
double z1 = z - e;
printf( "z1: z - e: %a\n", z1 );
// z1: z - e: 0x1p-20

// Finally we are ready to add the value to x0.
double z2 = x0 + z1;
printf( "z2: x0 + z1: %a (==xn)\n", z2 );
// z2: x0 + z1: 0x1.0010001p+8 (==xn)

assert( z2 == xn );
// Here is an approach using frexp and ldexp instead of bit-oring.

int x0Exp;
frexp( x0, &x0Exp );
printf( "x0Exp: frexp( x0, &x0Exp ): %d\n", x0Exp );
// x0Exp: frexp( x0, &x0Exp ): 9

// Normalize the mantissa.
double mnorm = DBL_MIN + m;
printf( "mnorm: DBL_MIN + m: %a\n", mnorm );
// mnorm: DBL_MIN + m: 0x1.0000001p-1022

// Increase the exponent of mnorm to match the exponent of x0.
double zb = ldexp( mnorm, x0Exp-DBL_MIN_EXP );
printf( "zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): %a (==z)\n", zb );
// zb: ldexp( mnorm, x0Exp-DBL_MIN_EXP ): 0x1.0000001p+8 (==z)

assert( zb == z );
// The rest is just like above.

// Remove the faulty first mantissa bit in zb.
assert( e == ldexp( 1.0, x0Exp-1 ) );
double zb1 = zb - e;
printf( "zb1: zb - e: %a (==z1)\n", zb1 );
// zb1: zb1 - e: 0x1p-20
assert( zb1 == z1 );

// We are ready to add the value to x0.
double zb2 = x0 + zb1;
printf( "zb2: x0 + zb1: %a (==xn)\n", zb2 );
// zb2: x0 + zb1: 0x1.0010001p+8 (==xn)
assert( zb2 == xn );
// Here is a better solution.

double zc1 = ldexp( m, x0Exp-DBL_MIN_EXP );
printf( "zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): %a (==z1)\n", zc1 );
// zc1: ldexp( m, x0Exp-DBL_MIN_EXP ): 0x1p-20 (==z1)
assert( zc1 == z1 );

// Add the value to x0.
double zc2 = x0 + zc1;
printf( "zc2: x0 + zc1: %a (==xn)\n", zc2 );
// zc2: x0 + zc1: 0x1.0010001p+8 (==xn)
assert( zc2 == xn );
return 0;
}
Nov 14 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.