Most Interesting Bug Track Down

Frederick Gotham

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

I'm sure several regulars here have more interesting stories... :)

--

Frederick Gotham

Nov 24 '06 #1

Subscribe Post Reply

2289

Eric Sosman

Frederick Gotham wrote:

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

Looks like it may not have jumped high enough: The bug is there
even without embedded '\0' characters.

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 24 '06 #2

Jack Klein

On Fri, 24 Nov 2006 22:12:15 GMT, Frederick Gotham
<fg*******@SPAM.comwrote in comp.lang.c:

>
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

Why "as usual"? Once you had the application working correctly, was
it too slow? Did you profile or otherwise test and to prove that this
was a bottleneck?

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

So now, thanks to premature optimization, you have violated your
design. You are initializing with something that is not a string, or
more precisely a string plus something else.

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

How could that be a problem? You just said you eliminated all use of
strlen().

I'm sure several regulars here have more interesting stories... :)

Sounds like a poor design, aggravated by violating its constraints in
use.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Nov 24 '06 #3

Frederick Gotham

Eric Sosman:

Looks like it may not have jumped high enough: The bug is there
even without embedded '\0' characters.

Ah yes, I should have mentioned that I took the "sizeof "Hello" - 1" into
account.

--

Frederick Gotham

Nov 24 '06 #4

Frederick Gotham

Jack Klein:

>As per usual with my
code, I made it efficient to the extreme. One thing I did was replace,
where possible, any usages of "strlen" with something like:

Why "as usual"? Once you had the application working correctly, was
it too slow? Did you profile or otherwise test and to prove that this
was a bottleneck?

I have never written a program for monetary gain. I program purely for the
enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.

> struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place
of strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null
separator, e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

So now, thanks to premature optimization, you have violated your
design. You are initializing with something that is not a string, or
more precisely a string plus something else.

I can live with the minor complication though, given that my code runs
several orders of magnitude faster than the "play it safe" equivalent.

--

Frederick Gotham

Nov 24 '06 #5

Gordon Burditt

>I have never written a program for monetary gain. I program purely for the

>enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.

If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Even programs like the OS/360's IEFBR14, or the C equivalent:

int main(void)
{
return 0;
}

can be made to work faster if they don't have to run correctly.

Nov 25 '06 #6

CBFalconer

Frederick Gotham wrote:

>

.... snip ...

>
I was writing usable which dealt with strings. As per usual with my
code, I made it efficient to the extreme. One thing I did was
replace, where possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in
place of strlen.

The code grew though, and at one stage I needed to store info about
two strings in one of these structures. To do this, I used a null
separator, e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

Ugh. You deserved anything that happened to you.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Nov 25 '06 #7

Mark F. Haigh

Frederick Gotham wrote:

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

<snip>

>
I'm sure several regulars here have more interesting stories... :)

Ok, I'll bite.

One of the most unusual things I've tracked down was a production-level
C program that suddenly failed with the addition of a single comment in
a single .c file. Mind you, this was an application deployed on custom
hardware and in active use at thousands of sites worldwide. It wasn't
even a new comment-- it was the addition of about 30 characters to an
existing comment.

The usual suspects were considered. Perhaps it was a build-order
issue, I thought. I've run into those before. If there are data
corruption problems in the code, changes in the order that files are
compiled and linked can cause the linker to allocate different storage
locations for different variables. This can cause a "more critical"
(ie pointer) variable to be clobbered rather than another (ie
statistics counter) when corruption occurs.

No dice. My original build, a clean rebuild, and a rebuild of a clean
build with several hundred random files touched all failed. Hmmm. At
best this was inconclusive.

Now this was an embedded product running the vxWorks RTOS. In the bad
old days, on the bad old MMU-less platforms, it could take a real
effort to get debugging information out of a production board if it
crashed hard. Unfortunately, I couldn't track down any boards with
debug facilities, so I had to disassemble the chassis and have a rework
tech solder a DB-9 on it so I could get some serial output from it (my
soldering isn't the greatest).

Much to my chagrin, I quickly realized that any substantial code
modifications other than that comment line caused the code to spring
back to life. "If you can't beat 'em, diff 'em", I thought. I
reverted the original change and rebuilt the file, except this time
saving the assembly and preprocessed output. I then re-added the
problematic comment and rebuilt, again saving the assembly and
preprocessed output. On each build I set the randomization seeds to
the same values, so any pseudo-random numbers used by the compiler
would have the same values on both builds.

I fired up the graphical diff program. No changes to the preprocessed
files except the line numbers, which differed by one. No changes to
the assembly files, except...

.... A single additional line of assembly code. Aparrently a
preprocessor macro in a header file far, far away had decided to use
the value of __LINE__ for field debugging of production code. Since
the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
immediate IIRC, and (shockingly) the __LINE__ value was exactly on the
boundary, the compiler had generated a "load immediate; increment by 1"
instruction sequence instead of just a "load immediate".

But who cares!?! The instruction sequence was correct, anyways.
Perhaps an assembler bug? This did not seem to be likely, given "The
Two General Rules of Compiler and Assembler Bugs":

1. It's not a compiler or assembler bug.
2. See rule #1*.

* Unless you can prove it.

I verified with objdump and the ARM Architecture Reference Manual that
the opcodes being generated by the assembler were indeed correct.
Hmmm.

Every code modification I could think of caused things to start working
again. Things were seeming to me to point toward a hardware bug of
some kind. But given "The Two General Rules of Hardware Bugs": ...

1. It's not a hardware bug.
2. See rule #1*.

* Unless you can prove it.

.... I now had to prove it, especially since the units were already in
the field. On a hunch I looked at the assembly code directly after
this new additional instruction. I saw a load of a value from memory
that was added to another value and used as a pointer.

I inserted a jump to a routine I wrote in assembly that moved the
freshly-loaded value to a known memory location. It then jumped to
some C code in another translation unit that dumped the value, and then
reloaded and re-dumped it.

The value of the original read was 0, and the reload of the same
location generated a valid pointer value. Obviously they should have
been equal (I had already ruled out concurrent access problems
previously).

Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.

The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.
Mark F. Haigh
mf*****@sbcglobal.net

Nov 25 '06 #8

websnarf

Frederick Gotham wrote:

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

First of all (sizeof "inline string") is 1+strlen ("inline string").
So I assume you compensated for this in your macro.

Second of all, in "The Better String Library", which does the same
thing, this is not a bug but it fact, the correct behavior. '\0' is a
legitimate character, not a string terminator. Where the semantics
coincide (which is most of the time, when dealing with pure text data)
you can assume strlen(bstring->data) is the same as b->slen. In
Bstrlib, you would never try to mash two strings together using some
kind of hacked representation such as "string1\0string2", that would
make no sense. Because Bstrlib is more consistent in this respect,
these sorts of bugs are far less likely.

I'm sure several regulars here have more interesting stories... :)

Oh sure:

1) if (a < 0) a = -a; b = sqrt (a);

2) Anything involving a stack overrun with stack checking turned off.
You just have to be inspired to imagine that this is your problem. The
standard is worthless for helping you here.

3) Assuming that vararg parameters were passed by value and could be
"reset" by retrieving its original value. (No debugger or compiler
diagnostic can help you figure out what is going wrong here.)

4) Watching Microsoft Visual C++ barf on struct tagbstring b = {
sizeof("string")-1, -__LINE__, "string" }; because MS's preprocessor
emitted something like _line+425 for __LINE__, and it complained that
it was not a compile time constant.

5) Adventures with WATCOM C/C++ v11.x's optimizations with "-ol" turned
on. It just fails to build correct code for about 10% of the source
I've written. These are real fun to track down. Like the stack
checking thing, you just have to be inspired to try turning the flag
off to see if it fixes the problem.

Then there's the standard "I forgot I made assumption X in function Y
then passed it parameters which technically violated X even though it
wasn't obvious that it was". Unfortunately, in the C language, these
assumptions often take the form of "allocated at least some certain
amount of space" or "the parameter is a well form non-empty linked
list" etc, and the error is usually undefined behavior.

I don't do a lot of heap or stack smashing anymore these days, as I
generally wrap things in rigourous enough abstractions, and I just
generally use debug heaps while developing. But there can still be
problems of convention. A hash table I implemented has an iterator
mechanism, and I made the termination condition when the index was
greater than the current hash table size -- the problem is that when I
came back to reuse this code after more than a year, I forgot my
convention for termination and thought it was when the index was < 0.
So I walked off the end of the hash array nicely because I did not
sufficiently document the convention. The problem is that I was using
-1 as the start-up index (since 0 may or may not be a valid entry, and
you *have* to perform an increment on every call to the iterator
incrementor) and so could not use < 0 as the terminator condition. But
it meant that my intuition conficted with what was necessary. I fixed
this by creating an "isDone" macro for the iterator.

With multithreaded errors, I already know a priori that they are
difficult. When I can, and I detect such a bug, I will spend a short
amount of time try to track it down. If I can't get it, I junk the
contentious code and start over. Its just a matter of productivity --
these bugs can be so hard, that it will take longer to track them down
than to rewrite the code. Sometimes I don't learn/figure out what I
did wrong, but life is too short.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 25 '06 #9

Frederick Gotham

Gordon Burditt:

>>I have never written a program for monetary gain. I program purely for
the enjoyment of programming. If I achieve a certain object, I am not
satisfied -- I want to achieve the objective as efficiently as is
possible.

If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Which is a tremendous argument, only that my programs do run right in the
end.

--

Frederick Gotham

Nov 25 '06 #10

Eric Sosman

Gordon Burditt wrote:

>>I have never written a program for monetary gain. I program purely for the
enjoyment of programming. If I achieve a certain object, I am not satisfied
-- I want to achieve the objective as efficiently as is possible.

If it doesn't have to work correctly, any program can run in 0 time
and 0 bytes.

Even programs like the OS/360's IEFBR14, or the C equivalent:

int main(void)
{
return 0;
}

can be made to work faster if they don't have to run correctly.

<off-topic>

Wikipedia's article on IEFBR14 makes amusing reading. The
original one-instruction program had a bug, which must set some
kind of unenviable standard for "fault density:" one bug per
machine instruction! Not only that, but the fix had a bug, and
the fix to the fix had a bug, and it wasn't until the fourth
version of IEFBR14 that the program "did nothing" correctly.

The final version had three times as many lines of source as
the first, executed three times as many instructions, and occupied
eight times as much memory. Code bloat wasn't invented in Redmond.

</off-topic>

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 25 '06 #11

William Hughes

Frederick Gotham wrote:

I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

I'm sure several regulars here have more interesting stories... :)

The ugliest bug (though by no means the harderst)
I ever had to fix was

strcmp(search_string, TARGET_STRING)

where TARGET_STRING was #defnied
as a string literal. I had a problem
that even though search_string was the same as the
define, it was not matching.

Turns out the code had been changed to
from using us_satellite1 and europe_satellite.
to us_satellite1, us_satellite2 and europe_satellite.
Because the same thing was done for both
us_satellite1 and us_satellite2 it was only necessary
to check the us_satellite part. Enter string_strip, which
in addition to taking off whitespace, was modified to
remove trainling numbers. Now all that was needed
was to strip the string, then compare. To get
the comparison string you strip TARGET_STRING.
You only need to do this once as the name does
not change.
(The contract on the progammer responsible is still
open. It requires proof of a slow agonizing death.)
Of course, it the compiler had refused to modify
a string literal this would not have worked.

Three bugs that are also cautionary tails about
undefined behaviour.

One large program started failing at odd momements. We traced the
problem to some changes made by one of the programmers. However,
these changes were made to a completely different part of the program,
and they were only intended to add another input format. Furthermore,
the changes looked fine and worked correctly.

On looking at the code I noticed that a standard library function
(I don't recall exact details) was called in a non-standard way.
Knowing that
this couldn't possibly be the proplem, I changed the call
to conform to the standard. The bug disappeared.
Undefined behaviour includes appearing to do exaclty what you
want, but causing subtle problems elsewhere.

Another bug had to do with the equivalent of

a[i] = i++

This was in a piece of code we had used for years, on many platforms
and many compilers. One day the code didn't work correctly on
the SGI (this is also an argument for test suites with known results,
the failure mode was a small loss of accuracy, not a complete
failure.).
Yep, turned out that a compiler upgrade was the culprit. Undefined
behaviour
can lie dormant for years.

A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

- William Hughes

Nov 25 '06 #12

Ian Collins

Mark F. Haigh wrote:

>
Turns out the memory subsystem was laid out by a junior designer at
another company who had incorrectly chained some of the RAM clock
traces (IIRC), where they should have all had equal lengths. This
particular instruction sequence, combined with the peculiar cache
behavior of the application had caused it to issue memory reads that
exposed the RAM subsystem timing problems. Rumor had it that somebody
at the other company had done a quickie board spin to increase the
amount of RAM without properly reviewing the change. Amazingly, the
rest of the half-million lines of code seemed to run just fine. How, I
will never know.

That reminds me of one of mine - "your diver doesn't work in the afternoon".

About 2 o'clock each day, an HDLC driver I wad written started to
experience random crashes, no amount of debug code or simulation could
catch the problem. As a contractor, I was under immense pressure from
the hardware people to fix "my broken software". I the end I came in
over the weekend so I could pinch a decent scope and logic analyser,
hooked up a couple of dozen probes and waited.

Sure enough, mid afternoon the board fell over. But this time I had the
evidence, a late data acknowledge strobe. Turns out the hardware
designer had used a 47K rather than a 4K7 pull-up resistor. Which just
about pulled the strobe high in time until the lab temperature got above
20C. It was lucky for us we where testing in the summer.

Needless to say, I enjoyed the team meeting on the following Monday morning!

--
Ian Collins.

Nov 25 '06 #13

Mark F. Haigh

Ian Collins wrote:

<snip>

That reminds me of one of mine - "your diver doesn't work in the afternoon".

About 2 o'clock each day, an HDLC driver I wad written started to
experience random crashes, no amount of debug code or simulation could
catch the problem. As a contractor, I was under immense pressure from
the hardware people to fix "my broken software". I the end I came in
over the weekend so I could pinch a decent scope and logic analyser,
hooked up a couple of dozen probes and waited.

HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
might have looked great on paper, but those things were complete turds
to program. Stitching buggy software together with buggy microcode,
ugh. At least with a PCI device you get a ~50% correct datasheet and a
T-shirt from the vendor, but with the IXP that kind of thing seems like
a utopian dream.

I suppose a topical conclusion can be drawn-- always write bomb-proof C
so you can prove it's a hardware problem.

>
Sure enough, mid afternoon the board fell over. But this time I had the
evidence, a late data acknowledge strobe. Turns out the hardware
designer had used a 47K rather than a 4K7 pull-up resistor. Which just
about pulled the strobe high in time until the lab temperature got above
20C. It was lucky for us we where testing in the summer.

Here's a funny one. A prototype high-end system that I was working on
was slated to get some new high-CFM (cubic feet / minute) fans because
of a new processor's increased heat output. These particular fans had
tachometer / RPM-sensing support. The plan was to use it to detect
failed fans and alert the user to order a replacement. The system
could run properly at normal room temperature with 2 failed fans.

It was a prototype board, so I had no chassis to mount the fans to. I
just put them on my desk, plugged them in, and powered up. The fans
went airborne. I quickly powered down and duct-taped the fans to my
desk.

Unfortunately the wires to the fans were quite short and the fans were
uncomfortably close to the power supply switch. I would just have to
be careful, I thought.

After a couple of days, I had all of the fan driver functionality
implemented. It could give you instantaneous RPM, low / high RPM
watermarks, and it was tied in to the rest of the system.

Near the end of the second day, I was distracted and reached over
carelessly to power the system down. Big mistake. The highly curved
and very sharp blades of one of the fans reached out and cut the very
tip of my right thumb off.

As my thumb bled into a paper towel, I thought "OK, time for a beer",
and headed across the street to grab a couple of pints. I eventually
got it all clotted up and bandaged. I headed back to my desk and
powered up the system...

Sure enough, the system had logged an alert for a possible fan failure
due to low RPM for the period during which it was de-skinning the tip
of my thumb.

>
Needless to say, I enjoyed the team meeting on the following Monday morning!

Not me. Bandage was still on my thumb, and everybody still thought it
was hilarious (including me).

Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.
Mark F. Haigh
mf*****@sbcglobal.net

Nov 26 '06 #14

Ian Collins

Mark F. Haigh wrote:

>
HDLC, hmm? Did you ever get stuck dealing with the Intel IXPs? They
might have looked great on paper, but those things were complete turds
to program.

No, MC68360, nice to program, not too buggy microcode!

Not me. Bandage was still on my thumb, and everybody still thought it
was hilarious (including me).

Ok, ok, sorry. Yes, it's off topic. Yes, I'll shut up now.

Amusing none the less.

--
Ian Collins.

Nov 26 '06 #15

goose

Frederick Gotham wrote:

<snipped jumping bug>

>
I'm sure several regulars here have more interesting stories... :)

On the (seriously drain-bamaged) platform I am working on
right now, there are calls to open a file, read data from a file,
write data to a file and rename a file. Yup ... no close file function!

I ran across an interesting bug last week; when a file has been
opened for writing, but no data gets written to it and then the
file is renamed ... the file contains junk data of a fairly random
length.

Try explaining *that* bug to your boss. Luckily, it seems
that the newer models of this particular device will (when
we get them in 2007/2008) be running ARM linux, and not
this current POS-specially-written-OS...

goose,

Nov 26 '06 #16

Keith Thompson

"William Hughes" <wp*******@hotmail.comwrites:
[snip]

A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 27 '06 #17

Ben Pfaff

Keith Thompson <ks***@mib.orgwrites:

"William Hughes" <wp*******@hotmail.comwrites:
[snip]
>A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

According to the SUSv3 "Change History" for strdup, it was
originally part of the X/OPEN UNIX extension, so it seems
believable that a strict POSIX mode might cause it to be omitted.

Perhaps it is worth commenting that #define is not a statement.
--
Bite me! said C.

Nov 27 '06 #18

William Hughes

Keith Thompson wrote:

"William Hughes" <wp*******@hotmail.comwrites:
[snip]
A third, fairly easy to find bug, had to do with the use of
the nonstandard strdup. The code refused to compile on one
plaform, even though this platform had a strdup function. Turned
out that the culprit here was a #define POSIX statement. If
POSIX was defined, the strdup was not available. So
the agument "feature X is available on almost all machines"
has a problem. Yes feature X is available but it may only
be available in a nonstandard mode.

Are you sure about that? I'd expect strdup() to be available only if
POSIX *is* defined.

No, this was from memory. However, a quick check suggests that strdup
is not a POSIX function. The problem was definitely that strdup was
available on
the machine, but not if one of our standardization definies was used.
I seem to recall this was some version of #define POSIX.

- William Hughes

Nov 27 '06 #19

James Dow Allen

Mark F. Haigh wrote:

I'm sure several regulars here have more interesting stories... :)

Ok, I'll bite.
...

Great story! You should post this in alt.folklore.computers.
Most of my best bug stories also involve hardware, e.g.
http://james.fabpedigree.com/bug22.htm

I'll mention one C-language bug that cost me several
hours of debug: it seemed like a problem with a controller
chip, but was actually a violation by me, of C rules.

I had a function like
send_to_chip(char *dat, int cnt); /* send dat[0] ... dat[cnt-1] to
chip */

and another function
whatever( ..., short cmd);
which needed, among other things to send the 2-byte cmd
to the chip. Endianness was not an issue, as my program
was specific to MC680x0.

I tried to send the bytes with
send_to_chip(&cmd, 2);
but it didn't work. Eventually I found that "&cmd" was pointing
2 bytes *before* the 2-byte cmd on the stack.
I don't remember which compiler this was 20 years ago --
I think it was based on Plauger's and I deduced that the
peculiar "&cmd" value was a result of porting a little-endian
compiler to a big-endian machine.

I consider the compiler behavior clearly flawed, but concede
now that taking the address of a function argument was
a violation.

The entire thing was so unlikely that I couldn't help but think,
"Great. I'll probably be hit with a meteorite on the way to my car
tonight." Fortunately the bad luck stopped with a new spin of the
board, and I'm still here to tell the tale.

Glad you made it!

James Dow Allen

Nov 28 '06 #20

Chris Dollin

Mark F. Haigh wrote:

... A single additional line of assembly code. Aparrently a
preprocessor macro in a header file far, far away had decided to use
the value of __LINE__ for field debugging of production code. Since
the CPU was an ARM7, and the ARM instruction set can only load a 12 bit
immediate IIRC,

(fx:OT
An eight-bit value on an even bit boundary in the 32-bit word: the
12-bit immediate value is the 8-bit field and a 4-bit /rotate/
count. So not only can the value be at the bottom, middle, or
top of the word, it can be split between the top and the bottom!
)

--
Chris "Magenta - the best colour of sound" Dollin
"Reaching out for mirrors hidden in the web." - Renaissance, /Running Hard/

Nov 28 '06 #21

lawrence.jones

James Dow Allen <jd*********@yahoo.comwrote:

>
and another function
whatever( ..., short cmd);
which needed, among other things to send the 2-byte cmd
to the chip. Endianness was not an issue, as my program
was specific to MC680x0.

I tried to send the bytes with
send_to_chip(&cmd, 2);
but it didn't work. Eventually I found that "&cmd" was pointing
2 bytes *before* the 2-byte cmd on the stack.

That was a common behavior in pre-ANSI compilers. Since char and short
arguments are promoted to int (and float arguments are promoted to
double), many compilers just rewrote the argument declarations as the
widened type (much as array arguments declarations are rewritten as
pointer declarations). ANSI outlawed that practice, requiring the
arguments to have their explicitly declared type, not the widened type.

-Larry Jones

What better way to spend one's freedom than eating chocolate
cereal and watching cartoons! -- Calvin

Nov 28 '06 #22

Barry Schwarz

On Sat, 25 Nov 2006 09:28:21 -0500, Eric Sosman
<es*****@acm-dot-org.invalidwrote:

snip

><off-topic>

Wikipedia's article on IEFBR14 makes amusing reading. The
original one-instruction program had a bug, which must set some
kind of unenviable standard for "fault density:" one bug per
machine instruction! Not only that, but the fix had a bug, and
the fix to the fix had a bug, and it wasn't until the fourth
version of IEFBR14 that the program "did nothing" correctly.

The final version had three times as many lines of source as
the first, executed three times as many instructions, and occupied
eight times as much memory. Code bloat wasn't invented in Redmond.

1 - It's not bloat if the additional code is necessary to make the
program run correctly. A square root function that works only for
perfect squares can be much smaller than one that works in general.
The additional code is not bloat but necessary to correct
deficiencies. A square root function that generates logs or primes
contains bloat.

The minimum size of a program on an IBM mainframe is eight bytes (due
to alignment restrictions). Since that is also the current size of
IEFBR14, it probably didn't grow much. Since the program only
contains two instructions and the mainframe does not have fractional
instructions, it didn't grow by a factor of three. The original
instruction is still in the code so it couldn't have been too
incorrect.

I think the Wikipedia article may suffer from a certain lack of
accuracy.
Remove del for email

Dec 1 '06 #23

Richard Heathfield

Barry Schwarz said:

<snip>

>
I think the Wikipedia article may suffer from a certain lack of
accuracy.

Don't bother fixing it. If you do, it'll probably be broken again by
tonight.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

Dec 1 '06 #24

Gordon Burditt

>>Code bloat wasn't invented in Redmond.

But I bet they tried to patent it.

>The minimum size of a program on an IBM mainframe is eight bytes (due
to alignment restrictions). Since that is also the current size of
IEFBR14, it probably didn't grow much. Since the program only
contains two instructions and the mainframe does not have fractional
instructions, it didn't grow by a factor of three. The original
instruction is still in the code so it couldn't have been too
incorrect.

But under MVS on my college's 360/50, you couldn't run a program
in under 8k without running out of memory. Or perhaps it was IN
8k, and you had to use at least 12k. I tried it (due to the billing
including a function of memory requested) and got a lovely core
dump. I think that had something to do with a 4k page size and two
different subpools and needing space for an initial save area.

Dec 2 '06 #25

Barry Schwarz

On Sat, 02 Dec 2006 01:45:41 -0000, go***********@burditt.org (Gordon
Burditt) wrote:

>>>Code bloat wasn't invented in Redmond.

But I bet they tried to patent it.

>>The minimum size of a program on an IBM mainframe is eight bytes (due
to alignment restrictions). Since that is also the current size of
IEFBR14, it probably didn't grow much. Since the program only
contains two instructions and the mainframe does not have fractional
instructions, it didn't grow by a factor of three. The original
instruction is still in the code so it couldn't have been too
incorrect.

But under MVS on my college's 360/50, you couldn't run a program
in under 8k without running out of memory. Or perhaps it was IN
8k, and you had to use at least 12k. I tried it (due to the billing
including a function of memory requested) and got a lovely core
dump. I think that had something to do with a 4k page size and two
different subpools and needing space for an initial save area.

The amount of memory a program requires to execute is a different
metric than the size of the program itself. While normally the former
would be larger than the latter, MVS supported programs in overlay
format where it need not be true.

My college used a 360/40 so we are probably in the same generation.
While I can still remember what it looked like, I don't have a clue as
to the minimum region size.
Remove del for email

Dec 3 '06 #26

Spiros Bousbouras

Barry Schwarz wrote:

On Sat, 25 Nov 2006 09:28:21 -0500, Eric Sosman
<es*****@acm-dot-org.invalidwrote:

snip

<off-topic>

Wikipedia's article on IEFBR14 makes amusing reading. The
original one-instruction program had a bug, which must set some
kind of unenviable standard for "fault density:" one bug per
machine instruction! Not only that, but the fix had a bug, and
the fix to the fix had a bug, and it wasn't until the fourth
version of IEFBR14 that the program "did nothing" correctly.

The final version had three times as many lines of source as
the first, executed three times as many instructions, and occupied
eight times as much memory. Code bloat wasn't invented in Redmond.

1 - It's not bloat if the additional code is necessary to make the
program run correctly. A square root function that works only for
perfect squares can be much smaller than one that works in general.
The additional code is not bloat but necessary to correct
deficiencies. A square root function that generates logs or primes
contains bloat.

The minimum size of a program on an IBM mainframe is eight bytes (due
to alignment restrictions). Since that is also the current size of
IEFBR14, it probably didn't grow much. Since the program only
contains two instructions and the mainframe does not have fractional
instructions, it didn't grow by a factor of three. The original
instruction is still in the code so it couldn't have been too
incorrect.

I think the Wikipedia article may suffer from a certain lack of
accuracy.

Most of what you're quoting is Eric Sosman's comments
rather than claims found in the Wikipedia article. The article
does say that the number of executed instructions trippled
in size. Are you sure the final version of the programme
appearing in the article (which by the way is not the final
version of the programme itself) only contains 2 instructions ?

Dec 26 '06 #27

Barry Schwarz

On 25 Dec 2006 16:47:26 -0800, "Spiros Bousbouras" <sp****@gmail.com>
wrote:

>Barry Schwarz wrote:
>On Sat, 25 Nov 2006 09:28:21 -0500, Eric Sosman
<es*****@acm-dot-org.invalidwrote:

snip

><off-topic>

Wikipedia's article on IEFBR14 makes amusing reading. The
original one-instruction program had a bug, which must set some
kind of unenviable standard for "fault density:" one bug per
machine instruction! Not only that, but the fix had a bug, and
the fix to the fix had a bug, and it wasn't until the fourth
version of IEFBR14 that the program "did nothing" correctly.

The final version had three times as many lines of source as
the first, executed three times as many instructions, and occupied
eight times as much memory. Code bloat wasn't invented in Redmond.

1 - It's not bloat if the additional code is necessary to make the
program run correctly. A square root function that works only for
perfect squares can be much smaller than one that works in general.
The additional code is not bloat but necessary to correct
deficiencies. A square root function that generates logs or primes
contains bloat.

The minimum size of a program on an IBM mainframe is eight bytes (due
to alignment restrictions). Since that is also the current size of
IEFBR14, it probably didn't grow much. Since the program only
contains two instructions and the mainframe does not have fractional
instructions, it didn't grow by a factor of three. The original
instruction is still in the code so it couldn't have been too
incorrect.

I think the Wikipedia article may suffer from a certain lack of
accuracy.

Most of what you're quoting is Eric Sosman's comments
rather than claims found in the Wikipedia article. The article
does say that the number of executed instructions trippled
in size. Are you sure the final version of the programme
appearing in the article (which by the way is not the final
version of the programme itself) only contains 2 instructions ?

The "final" version at http://en.wikipedia.org/wiki/IEFBR14 shows the
program occupying 16 bytes, 12 of which do not appear in the actual
program. The program on my OS/390 2.10 system occupies 8 bytes, of
which 4 contain the code for the two instructions and the other 4 are
filler to meet the alignment requirements. The same is true for
several previous versions of the operating system.
Remove del for email

Dec 26 '06 #28

Most Interesting Bug Track Down

Similar topics