perhaps a stack problem? Long calculations - strange error?

Martin Joergensen

Hi,

I've encountered a really, *really*, REALLY strange error :-)

I have a for-loop and after 8 runs I get strange results...... I
mean: A really strange result....
I'm calculating temperatures. T[j][i] = 20 degrees at all
times.... The 2D T-array looks like this:

| 2: 3: ..... i-direction goes to the right ->
--+----------
2: | 20 20
3: | 20 20

j-direction goes down

(index 1 and 4 = 0, but they don't matter)

The code *that works* looks like this:
- - - - - - - - - - - -

{
/* copy 2D (old) temperatures to 1D-array */
count = 0;
solved_temp[0] = 0;

/* number_of_interior_cells = 4 */
for(j=2, i=1, no=1; no<=(int) number_of_interior_cells; no++)
{
/* update columns (x-value) */
i++;

solved_temp[no] = T[j][i];
solved_temp[no] += ( hx[j][i] * ( T[j][i-1]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hx[j][i+1] * ( T[j][i+1]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hy[j][i] * ( T[j-1][i]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hy[j+1][i] * ( T[j+1][i]-T[j][i] ) )
/ h0[j][i];

/* below you see the code that *doesn't work* - isn't it
the same??? */

//solved_temp[no] = T[j][i] + (
// hx[j][i] * ( T[j][i-1]-T[j][i] ) + hx[j][i+1] * (
T[j][i+1]-T[j][i] ) +
// hy[j][i] * ( T[j-1][i]-T[j][i] ) + hy[j+1][i] * (
T[j+1][i]-T[j][i] )
// ) / h0[j][i];

if(i == nx-1) /* hits east boundary? */
{
j++; /* next row */
i=1; /* reset x-coordinate */
}
}
}

The expected result is that solved_temp[1] = 20, solved_temp[2] =
20, solved_temp[3] = 20, solved_temp[4] = 20 at all times. If I
swap the above outcommented/commented sections with each other, I
get what I expect the first 7 times. On the 8th time I get that
the next result for solved_temp[1] = -1.#IND000000000000, but the
other results are okay....

What exactly is that -1.#IND000000000000 means?

And isn't it exactly the same code??? I'm pretty confused...

In case you're wondering, hx or hy is either 0 or 1 as you can
see below and all the temperature differences is 0, so it
shouldn't matter anyway... h0 = 2 but also that shouldn't change
anything since everything gives 0 so I have something which is 0
divided by 2 and that ofcourse should give 0........ It all ends
up in (20 + 0) = 20...

Printing hx (step = 8):

i=1: i=2: i=3:
i=4:
j=1: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=2: -> 0.00000 -> 0.00000 -> 1.00000 -> 0.00000
j=3: -> 0.00000 -> 0.00000 -> 1.00000 -> 0.00000
j=4: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000

Printing hy (step = 8):

i=1: i=2: i=3:
i=4:
j=1: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=2: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=3: -> 0.00000 -> 1.00000 -> 1.00000 -> 0.00000
j=4: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000

(only 5 decimal digits shown, but it shouldn't change anything in
the big picture I guess)....

Why does it work when I calculate solved_temp[no] in 5 individual
steps instead of in one long calculation? Even Microsoft Visual
Studio 2005 debugger agrees with me, but then I step over the
calculation line and find this stupid/strange result......!

I hope somebody can explain this really weird behaviour... I hope
there's a "natural" explanation like "stack overflow" or
whatever....?

I was a bit afraid of posting this if I didn't check my code
properly but now I've looked at it for about 1-2 hours and when
my debugger even agrees with me and I then choose F10 "step over"
and get another result, then I think I need professional help :-)

It works now (with 5 individual calculations), but I just want to
be sure that this error doesn't come back later....
Best regards / Med venlig hilsen
Martin Jørgensen

--
---------------------------------------------------------------------------
Home of Martin Jørgensen - http://www.martinjoergensen.dk

May 7 '06 #1

Subscribe Reply

2562

Vladimir Oka

Martin Joergensen opined:

Hi,

I've encountered a really, *really*, REALLY strange error :-)

I have a for-loop and after 8 runs I get strange results...... I
mean: A really strange result....
I'm calculating temperatures. T[j][i] = 20 degrees at all
times.... The 2D T-array looks like this:

| 2: 3: ..... i-direction goes to the right ->
--+----------
2: | 20 20
3: | 20 20

j-direction goes down

(index 1 and 4 = 0, but they don't matter)

The code *that works* looks like this:
- - - - - - - - - - - -

{
/* copy 2D (old) temperatures to 1D-array */
count = 0;
solved_temp[0] = 0;

/* number_of_interior_cells = 4 */
for(j=2, i=1, no=1; no<=(int) number_of_interior_cells; no++)
{
/* update columns (x-value) */
i++;

solved_temp[no] = T[j][i];
solved_temp[no] += ( hx[j][i] * ( T[j][i-1]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hx[j][i+1] * ( T[j][i+1]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hy[j][i] * ( T[j-1][i]-T[j][i] ) )
/ h0[j][i];
solved_temp[no] += ( hy[j+1][i] * ( T[j+1][i]-T[j][i] ) )
/ h0[j][i];

/* below you see the code that *doesn't work* - isn't it
the same??? */

//solved_temp[no] = T[j][i] + (
// hx[j][i] * ( T[j][i-1]-T[j][i] ) + hx[j][i+1] * (
T[j][i+1]-T[j][i] ) +
// hy[j][i] * ( T[j-1][i]-T[j][i] ) + hy[j+1][i] * (
T[j+1][i]-T[j][i] )
// ) / h0[j][i];

if(i == nx-1) /* hits east boundary? */
{
j++; /* next row */
i=1; /* reset x-coordinate */
}
}
}

The expected result is that solved_temp[1] = 20, solved_temp[2] =
20, solved_temp[3] = 20, solved_temp[4] = 20 at all times. If I
swap the above outcommented/commented sections with each other, I
get what I expect the first 7 times. On the 8th time I get that
the next result for solved_temp[1] = -1.#IND000000000000, but the
other results are okay....

What exactly is that -1.#IND000000000000 means?
My guess is that your commented out code overflows. The above looks
like some IEEE error/overflow/NaN thingy. It's still too much weekend
'round here for a closer look. Also, you should have really posted a
compilable example that shows the problem.
And isn't it exactly the same code??? I'm pretty confused...
Mathematically, it probably is. Computers, unfortunately don't have
infinite precision. Therein may lie your undoing.
In case you're wondering, hx or hy is either 0 or 1 as you can
see below and all the temperature differences is 0, so it
shouldn't matter anyway... h0 = 2 but also that shouldn't change
anything since everything gives 0 so I have something which is 0
divided by 2 and that ofcourse should give 0........ It all ends
up in (20 + 0) = 20...

Printing hx (step = 8):

i=1: i=2: i=3:
i=4:
j=1: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=2: -> 0.00000 -> 0.00000 -> 1.00000 -> 0.00000
j=3: -> 0.00000 -> 0.00000 -> 1.00000 -> 0.00000
j=4: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000

Printing hy (step = 8):

i=1: i=2: i=3:
i=4:
j=1: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=2: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000
j=3: -> 0.00000 -> 1.00000 -> 1.00000 -> 0.00000
j=4: -> 0.00000 -> 0.00000 -> 0.00000 -> 0.00000

(only 5 decimal digits shown, but it shouldn't change anything in
the big picture I guess)....

Why does it work when I calculate solved_temp[no] in 5 individual
steps instead of in one long calculation? Even Microsoft Visual
Studio 2005 debugger agrees with me, but then I step over the
calculation line and find this stupid/strange result......!

I hope somebody can explain this really weird behaviour... I hope
there's a "natural" explanation like "stack overflow" or
whatever....?

I was a bit afraid of posting this if I didn't check my code
properly but now I've looked at it for about 1-2 hours and when
my debugger even agrees with me and I then choose F10 "step over"
and get another result, then I think I need professional help :-)

It works now (with 5 individual calculations), but I just want to
be sure that this error doesn't come back later....

I'd also recommend you find a good text on the pitfalls of using finite
precision floating point arithmetic. One fairly good link has been
posted here a couple of months back. The only thing I remember about
it was that it was on the Sun site somewhere.

--
Why use Windows, since there is a door?
(By fa****@galileo.rhein-neckar.de, Andre Fachat)

<http://clc-wiki.net/wiki/Introduction_to_comp.lang.c>

May 7 '06 #2

Robert Gamble

Vladimir Oka wrote:

I'd also recommend you find a good text on the pitfalls of using finite
precision floating point arithmetic. One fairly good link has been
posted here a couple of months back. The only thing I remember about
it was that it was on the Sun site somewhere.

You are probably referring to the paper "What Every Computer Scientist
Should Know About Floating-Point Arithmetic" by David Goldberg which is
available on Sun's website at:
<http://docs.sun.com/source/806-3568/ncg_goldberg.html> among many
other places.

Robert Gamble

May 7 '06 #3

Martin Jørgensen

Vladimir Oka wrote:

Martin Joergensen opined:

-snip-

What exactly is that -1.#IND000000000000 means?

My guess is that your commented out code overflows. The above looks
like some IEEE error/overflow/NaN thingy. It's still too much weekend
'round here for a closer look. Also, you should have really posted a
compilable example that shows the problem.

You shouldn't try to code anything yourself... I just tried it but my
code is pretty big. I mean: hx, hy and h0 is calculated in each step and
dynamically allocated so the smallest possible code would probably still
be some hundred lines.

I just made a 100 line program that didn't calculate hx and hy and here
there was no error.... Hmm. So I guess I have an error somewhere in my
code...

Still I can't explain the difference between adding 5 numbers and then
get the right result vs. doing everything at once... So that's basically
why I asked. And when Microsoft Visual studio 2005 debugger agrees with
me, then I'm really confused because I couldn't imagine that it would
lie to me about anything.

-snip-

It works now (with 5 individual calculations), but I just want to
be sure that this error doesn't come back later....

I'd also recommend you find a good text on the pitfalls of using finite
precision floating point arithmetic. One fairly good link has been
posted here a couple of months back. The only thing I remember about
it was that it was on the Sun site somewhere.

I looked at it... It seems like it takes pretty long time to read it
all. Perhaps I'll look at it later. Thanks for the answer.
Best regards / Med venlig hilsen
Martin Jørgensen

--
---------------------------------------------------------------------------
Home of Martin Jørgensen - http://www.martinjoergensen.dk

May 7 '06 #4

Vladimir Oka

Robert Gamble opined:

Vladimir Oka wrote:
I'd also recommend you find a good text on the pitfalls of using
finite precision floating point arithmetic. One fairly good link has
been posted here a couple of months back. The only thing I remember
about it was that it was on the Sun site somewhere.

You are probably referring to the paper "What Every Computer
Scientist Should Know About Floating-Point Arithmetic" by David
Goldberg which is available on Sun's website at:
<http://docs.sun.com/source/806-3568/ncg_goldberg.html> among many
other places.

Yup, that's the one! Thanks for the link (although I do remember
printing it out at the time, and leaving it on the "to read" heap in
the office ;-) ).

--
Those who make peaceful revolution impossible will make violent
revolution inevitable.
-- John F. Kennedy

<http://clc-wiki.net/wiki/Introduction_to_comp.lang.c>

May 7 '06 #5

Vladimir Oka

Martin JÃ¸rgensen opined:

Vladimir Oka wrote:
Martin Joergensen opined:

-snip-
What exactly is that -1.#IND000000000000 means?

My guess is that your commented out code overflows. The above looks
like some IEEE error/overflow/NaN thingy. It's still too much
weekend 'round here for a closer look. Also, you should have really
posted a compilable example that shows the problem.

You shouldn't try to code anything yourself... I just tried it but my
code is pretty big. I mean: hx, hy and h0 is calculated in each step
and dynamically allocated so the smallest possible code would
probably still be some hundred lines.

I just made a 100 line program that didn't calculate hx and hy and
here there was no error.... Hmm. So I guess I have an error somewhere
in my code...

Still I can't explain the difference between adding 5 numbers and
then get the right result vs. doing everything at once... So that's
basically why I asked. And when Microsoft Visual studio 2005 debugger
agrees with me, then I'm really confused because I couldn't imagine
that it would lie to me about anything.

-snip-
It works now (with 5 individual calculations), but I just want to
be sure that this error doesn't come back later....

I'd also recommend you find a good text on the pitfalls of using
finite precision floating point arithmetic. One fairly good link has
been posted here a couple of months back. The only thing I remember
about it was that it was on the Sun site somewhere.

I looked at it... It seems like it takes pretty long time to read it
all. Perhaps I'll look at it later. Thanks for the answer.

I think once you read it, you may find it easier to imagine "the
difference between adding 5 numbers and then get the right result vs.
doing everything at once..." And I do mean it in the nicest possible
way.

I'm still prepared to be wrong, but I wouldn't be surprised that your
problem lies in one of the intermediate terms in "everything at once"
code overflowing, or misbehaving in a different way.

--
No problem is insoluble.
-- Dr. Janet Wallace, "The Deadly Years", stardate 3479.4

<http://clc-wiki.net/wiki/Introduction_to_comp.lang.c>

May 7 '06 #6

Old Wolf

Martin Joergensen wrote:

for(j=2, i=1, no=1; no<=(int) number_of_interior_cells; no++)
{
/* update columns (x-value) */
i++;

solved_temp[no] = T[j][i];

Check that T[] and solved_temp[] are allocated properly, remembering
that arrays in C are zero-based, ie. you need:

int solved_temp[number_of_interior_cells+1];

or

int *solved_temp = malloc( sizeof *solved_temp
* (number_of_interior_cells+1) );

and similarly for T.

Also, your two divisions are different; in general (a+b)/2
is not the same as (a/2) + (b/2). For example if a,b are
ints and a=3, b=5, then (a+b)/2 = 4 but a/2 + b/2 = 3.

If they are floating point then there may still be minor
errors due to rounding errors.

And in both cases, the (a+b) might overflow.

If you can provide a complete, compilable program,
and the input data yo uare using, then people here
will be able to reproduce and explain your results.

May 8 '06 #7

Martin JÃ¸rgensen

Vladimir Oka wrote:

Martin JÃ¸rgensen opined: -snip-
I think once you read it, you may find it easier to imagine "the
difference between adding 5 numbers and then get the right result vs.
doing everything at once..." And I do mean it in the nicest possible
way.

I'm still prepared to be wrong, but I wouldn't be surprised that your
problem lies in one of the intermediate terms in "everything at once"
code overflowing, or misbehaving in a different way.

That's also what I think.... Ok, thanks for your time.

Have anyone in here experienced anything like that, when there's a
difference between adding things together in multiple separate
calculations versus in a single (longer) calculation?

The worst thing is that by splitting the calculation up in 5 pieces, I
thought I could just step into each of them and find the error easily...

Since the "error" disappeared now, I can't find "the error".... And this
is what's bugging me.... If I could just find the exact place and say:
Ha! Got you... But I can't...
Best regards / Med venlig hilsen
Martin JÃ¸rgensen

--
---------------------------------------------------------------------------
Home of Martin JÃ¸rgensen - http://www.martinjoergensen.dk

May 8 '06 #8

Vladimir Oka

Martin JÃ¸rgensen opined:

Vladimir Oka wrote:
Martin JÃ¸rgensen opined: -snip-
I think once you read it, you may find it easier to imagine "the
difference between adding 5 numbers and then get the right result
vs. doing everything at once..." And I do mean it in the nicest
possible way.

I'm still prepared to be wrong, but I wouldn't be surprised that
your problem lies in one of the intermediate terms in "everything at
once" code overflowing, or misbehaving in a different way.

That's also what I think.... Ok, thanks for your time.

Not at all.
Have anyone in here experienced anything like that, when there's a
difference between adding things together in multiple separate
calculations versus in a single (longer) calculation?
Yes, in a different life, and a different language (FORTRAN).
The worst thing is that by splitting the calculation up in 5 pieces,
I thought I could just step into each of them and find the error
easily...

Since the "error" disappeared now, I can't find "the error".... And
this is what's bugging me.... If I could just find the exact place
and say: Ha! Got you... But I can't...

Sounds to me, you *did* find your error. It's just that you still can't
*explain* it properly. Did you try repeating the calculations by hand?

--
You're dead, Jim.
-- McCoy, "Amok Time", stardate 3372.7

<http://clc-wiki.net/wiki/Introduction_to_comp.lang.c>

May 8 '06 #9

Martin JÃ¸rgensen

Vladimir Oka wrote:

Martin JÃ¸rgensen opined:

-snip-

Have anyone in here experienced anything like that, when there's a
difference between adding things together in multiple separate
calculations versus in a single (longer) calculation?

Yes, in a different life, and a different language (FORTRAN).

The worst thing is that by splitting the calculation up in 5 pieces,
I thought I could just step into each of them and find the error
easily...

Since the "error" disappeared now, I can't find "the error".... And
this is what's bugging me.... If I could just find the exact place
and say: Ha! Got you... But I can't...

Sounds to me, you *did* find your error. It's just that you still can't
*explain* it properly. Did you try repeating the calculations by hand?

By hand? It should give 20 degrees and not -1.#IND000000000000. The
reason is that all other terms than the 20 gives 0... So there's nothing
to "do by hand" really... Visual studio debugger is right. But once I
step into the long expression it gives -1.#IND000000000000 instead of 20.

When I wrote "the error disappeared now" I meant that instead of using
the long expression it works by adding 5 individual smaller sums
(swapping commented vs. uncommented code)...

So I wouldn't exactly say I found the error - perhaps it comes back
another place... But I suspect there's an error somewhere, perhaps in MS
visual studio 2005... And perhaps also in my program??? It's totally
insane... :-(

Everything works like it should under linux.... The same code.... I'm
not so experienced that I can understand the output from Valgrind but it
seems to me like there's no big errors / memory leaks or the like... Now
I'm considering continuing doing my project under linux (it's my
bachelor project)...

Today I got some help from a more experienced programmer than me so we
tried www.ibm.com/software/awdtools/purify and updated it with the
latest patch - this gives really strange warnings (lots of
printf-errors??? and some "memory leaks") and it was clear that it
doesn't work really well with Visual Studio 2005 yet - according to IBM.
Best regards / Med venlig hilsen
Martin JÃ¸rgensen

--
---------------------------------------------------------------------------
Home of Martin JÃ¸rgensen - http://www.martinjoergensen.dk

May 8 '06 #10

Martin Jørgensen

Old Wolf wrote:

Martin Joergensen wrote:
for(j=2, i=1, no=1; no<=(int) number_of_interior_cells; no++)
{
/* update columns (x-value) */
i++;

solved_temp[no] = T[j][i];

Check that T[] and solved_temp[] are allocated properly, remembering
that arrays in C are zero-based, ie. you need:

int solved_temp[number_of_interior_cells+1];

or

int *solved_temp = malloc( sizeof *solved_temp
* (number_of_interior_cells+1) );

and similarly for T.

Yep. But they're both type double (not important though).
Also, your two divisions are different; in general (a+b)/2
is not the same as (a/2) + (b/2). For example if a,b are
ints and a=3, b=5, then (a+b)/2 = 4 but a/2 + b/2 = 3.
Yep, I know. But here's it all "double"-types.
If they are floating point then there may still be minor
errors due to rounding errors.
But not anything that could go from 20 to -1.#IND000000000000, I guess.
And in both cases, the (a+b) might overflow.
These numbers are far to small for overflow to occur.
If you can provide a complete, compilable program,
and the input data yo uare using, then people here
will be able to reproduce and explain your results.

I can't make that. I tried to make a 100 line program (no dynamic
allocation and fixed variables). Actually I think it's something that
doesn't have to do with the C language itself... Perhaps compiler
options/switches or linker options/switches or something... I'll try to
see if I can isolate the error or else I'll switch to Linux where it
works...
Best regards / Med venlig hilsen
Martin Jørgensen

--
---------------------------------------------------------------------------
Home of Martin Jørgensen - http://www.martinjoergensen.dk

May 8 '06 #11

Martin Jørgensen

Martin Joergensen wrote:

Hi,

I've encountered a really, *really*, REALLY strange error :-)

I have a for-loop and after 8 runs I get strange results...... I mean: A
really strange result....
I'm calculating temperatures. T[j][i] = 20 degrees at all times.... The
2D T-array looks like this:

| 2: 3: ..... i-direction goes to the right ->
--+----------
2: | 20 20
3: | 20 20

j-direction goes down

(index 1 and 4 = 0, but they don't matter)

The code *that works* looks like this:
- - - - - - - - - - - -

{
/* copy 2D (old) temperatures to 1D-array */
count = 0;
solved_temp[0] = 0;

/* number_of_interior_cells = 4 */
for(j=2, i=1, no=1; no<=(int) number_of_interior_cells; no++)
{
/* update columns (x-value) */
i++;

solved_temp[no] = T[j][i];
solved_temp[no] += ( hx[j][i] * ( T[j][i-1]-T[j][i] ) ) / h0[j][i];
solved_temp[no] += ( hx[j][i+1] * ( T[j][i+1]-T[j][i] ) ) /
h0[j][i];
solved_temp[no] += ( hy[j][i] * ( T[j-1][i]-T[j][i] ) ) / h0[j][i];
solved_temp[no] += ( hy[j+1][i] * ( T[j+1][i]-T[j][i] ) ) /
h0[j][i];

/* below you see the code that *doesn't work* - isn't it the
same??? */

//solved_temp[no] = T[j][i] + (
// hx[j][i] * ( T[j][i-1]-T[j][i] ) + hx[j][i+1] * (
T[j][i+1]-T[j][i] ) +
// hy[j][i] * ( T[j-1][i]-T[j][i] ) + hy[j+1][i] * (
T[j+1][i]-T[j][i] )
// ) / h0[j][i];

if(i == nx-1) /* hits east boundary? */
{
j++; /* next row */
i=1; /* reset x-coordinate */
}
}
}

The expected result is that solved_temp[1] = 20, solved_temp[2] = 20,
solved_temp[3] = 20, solved_temp[4] = 20 at all times. If I swap the
above outcommented/commented sections with each other, I get what I
expect the first 7 times. On the 8th time I get that the next result for
solved_temp[1] = -1.#IND000000000000, but the other results are okay....

-snip-

Update:

The program works in "RELEASE" mode but not under "DEBUG" mode under the
*exact* same conditions with the same source code, with the same input
file... And again: it also works under linux (gcc).

Does that ring a bell anywhere? I hope it does somewhere, because I'm
still lost........ Compiler is Visual studio 2005...

Since I guess it's not scrictly standard C anymore I've made xfut and
fut to: comp.os.ms-windows.programmer.win32 but I'll also continue to
read in comp.lang.c if anyone posts answers (here) there...

Best regards / Med venlig hilsen
Martin Jørgensen

--
---------------------------------------------------------------------------
Home of Martin Jørgensen - http://www.martinjoergensen.dk

May 10 '06 #12

perhaps a stack problem? Long calculations - strange error?

Similar topics