Making Fatal Hidden Assumptions - Page 14

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

13052

cs_posting

Randy Howard wrote:

Hypothetical hardware that traps on *speculative* loads isn't broken by
design? I'd love to see the initialization sequences, or the task
switching code that has to make sure that all pointer values are valid
before they're loaded. No, scratch that. I've got better things to do.

This is a lot of whining about a specific problem that can
easily be remedied just by changing the loop construction. The
whole debate is pretty pointless in that context, unless you
have some religious reason to insist upon the method in the
original.

Have to remember thought that the C program in question is really a
back translation of approximating an assembly language original. If
the compiler builds the undefined pointer operation in the logical way,
it will be essentially the same as the hand written assembly language
code.

To then claim that speculative execution may cause an exception on the
result is to imply that the assembly language author, who has a pretty
good idea what assumptions he is making, must now add "speculativ e
loading of something I wasn't going to fetch" to the list of concerns.

Or were you thinking it was the compiler rather than processor logic
which was going to do the speculating?

Some pipelining tricks like the MIPS branch delay slot, are explicitly
part of the programming model, and you do have to manually handle them
when working with low level assembly code. But for the x86,
speculation is not...

Mar 11 '06 #131

Al Balmer

On 10 Mar 2006 17:05:01 -0800, cs********@hotm ail.com wrote:

Have to remember thought that the C program in question is really a
back translation of approximating an assembly language original

Why? We've long since stopped discussing that program.

--
Al Balmer
Sun City, AZ

Mar 11 '06 #132

Albert van der Horst

In article <pa************ *************** @bsb.me.uk>,
Ben Bacarisse <be********@bsb .me.uk> wrote:
<SNIP>

I found Eric Sosman's "if (buffer + space_required > buffer_end) ..."
example more convincing, because I have seen that in programs that are
intended to be portable -- I am pretty sure I have written such things
myself in my younger days. Have you other more general examples of
dangerous assumptions that can sneak into code? A list of the "top 10
things you might be assuming" would be very interesting.
I find it not an example of implicit assumptions, just of bad coding.

There is no reason not to use the almost as readable, problemless

if ( buffer_end - buffer < space_required )

(It mentally reads as " if the number of elements that still
fit in the buffer is less then the amount of elements we require")
Ben.

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
al****@spenarnc .xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Mar 11 '06 #133

Jordan Abel

On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:
I spent 25 years writing assembler.

Yeah, me to. Still do, regularly, on processors that will never
have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of storage,
Harvard architecture, and lack of programmer-accessible stack.
Funnily enough, these are characteristics possessed by chips for which
I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and, of
course, a decent-sized library]

Mar 11 '06 #134

cs_posting

Albert van der Horst wrote:

I found Eric Sosman's "if (buffer + space_required > buffer_end) ..."
There is no reason not to use the almost as readable, problemless

if ( buffer_end - buffer < space_required )

If the computation in one version can be reduced to a constant by the
compiler, that would be a reason for using that version.

I can imainge a number of situations in which "bad coding" is the
result of a programmer with a mental idea of how to accomplish
something efficiently, trying to render that approach in C as if it
were assembly language. This is doubly likely on small systems...

The problem of course is that the compiler has it's own ideas about how
to be efficient.

And the standards committee may have very different ideas from the
would-be hand optimizing programmer about how you are supposed to
instruct the compiler in what you want!

Mar 12 '06 #135

CBFalconer

Jordan Abel wrote:

On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote: I spent 25 years writing assembler.

Yeah, me to. Still do, regularly, on processors that will never
have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of
storage, Harvard architecture, and lack of programmer-accessible
stack. Funnily enough, these are characteristics possessed by
chips for which I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and,
of course, a decent-sized library]

These machines may well have C compilers, just not conforming
ones. The areas of non-conformance are likely to be:

object size available
floating point arithmetic
recursion depth (1 meaning no recursion)
availability of long and long-long.
availability of standard library

Once more, many programs can be written that are valid and portable
C, without requiring these abilities. The thing that needs to be
documented is use of possible non-standard substitutes for standard
features.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 12 '06 #136

Jordan Abel

On 2006-03-11, CBFalconer <cb********@yah oo.com> wrote:

Jordan Abel wrote:
On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:
> I spent 25 years writing assembler.

Yeah, me to. Still do, regularly, on processors that will never
have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of
storage, Harvard architecture, and lack of programmer-accessible
stack. Funnily enough, these are characteristics possessed by
chips for which I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and,
of course, a decent-sized library]

These machines may well have C compilers, just not conforming
ones. The areas of non-conformance are likely to be:

object size available
floating point arithmetic
recursion depth (1 meaning no recursion)

availability of long and long-long. Or even the range of int itself. [People have claimed that "c"
implementations exist with 8-bit int]
availability of standard library An implementation can conform, as a freestanding implementation, with
VERY little of the standard library

I'd question how much of the other stuff can be gone and still
considered "c", though.

Once more, many programs can be written that are valid and portable
C, without requiring these abilities. The thing that needs to be
documented is use of possible non-standard substitutes for standard
features.

Mar 12 '06 #137

toby

cs********@hotm ail.com wrote:

Albert van der Horst wrote:
I found Eric Sosman's "if (buffer + space_required > buffer_end) ..."
There is no reason not to use the almost as readable, problemless

if ( buffer_end - buffer < space_required )

If the computation in one version can be reduced to a constant by the
compiler, that would be a reason for using that version.

I can imainge a number of situations in which "bad coding" is the
result of a programmer with a mental idea of how to accomplish
something efficiently, trying to render that approach in C as if it
were assembly language.

This reminds me of C--: http://www.cminusminus.org/
This is doubly likely on small systems...

The problem of course is that the compiler has it's own ideas about how
to be efficient.

And the standards committee may have very different ideas from the
would-be hand optimizing programmer about how you are supposed to
instruct the compiler in what you want!

Mar 12 '06 #138

Hans-Bernhard Broeker

In comp.arch.embed ded Jordan Abel <ra*******@gmai l.com> wrote:

On 2006-03-11, CBFalconer <cb********@yah oo.com> wrote:
Jordan Abel wrote:
"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and,
of course, a decent-sized library]

The "decent-sized" library for small embedded systems is easier to
meet than you may think. Be sure to look up "freestandi ng
implementation" in the library section of the C standard.
These machines may well have C compilers, just not conforming
ones. The areas of non-conformance are likely to be:
Following the only truly formal definition, "C compilers, just not
conforming ones" don't of course exist any more actually than there
are cars on the highway with "wheels, just not round ones."

[...] availability of long and long-long.

Long long is, for many such compilers, still a non-issue, because they
never claimed to have implemented C99. C89==C is still a widely
accepted assumption.
Or even the range of int itself. [People have claimed that "c"
implementations exist with 8-bit int]
Some people would probably also not shy away from claiming there are
18-wheeler trucks built with only two wheels. Even all things
considered, such people are blatantly wrong (because they've been lied
to by, or are, marketroids). There's no excuse for violating a strict
requirement of the standard just to match users' likely interpretation
of one of the helpful suggestions, like "int should be the natural
integer type of the target CPU".
I'd question how much of the other stuff can be gone and still
considered "c", though.

The rule of thumb should be one of practicality. Try hard to fit as
much of the language as you can on the CPU, but stay reasonable.
I.e. all features that lie in the intersection between the standard's
requirements and the platform's feature set, should be implemented
strictly by the C standard. For the rest, stay as close to the
standard as you can bear. And above all, *document* all such
deviations prominently.

In case of doubt, do what every clever politician would do: refuse to
decide, so the users can unload the decision and the responsibility on
their own shoulders. Implement both a "standard as standard can"
mode, and a "as much standard as we think makes sense" mode. E.g. on
8-bit or smaller CPUs C's default integer promotion rules can turn
into a serious liability; offering a flag to turn them on or off makes
sense.

--
Hans-Bernhard Broeker (br*****@physik .rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Mar 12 '06 #139

Allan Herriman

On 12 Mar 2006 12:43:56 +1000, Mark L Pappin <ml*@acm.org> wrote:

Jordan Abel <ra*******@gmai l.com> writes:
On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:
> I spent 25 years writing assembler.

Yeah, me to. Still do, regularly, on processors that will never
have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of storage,
Harvard architecture, and lack of programmer-accessible stack.
Funnily enough, these are characteristics possessed by chips for which
I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and, of
course, a decent-sized library]

Got me there. Freestanding only, with maybe 16 bytes of RAM and 256
words of ROM - no 'malloc()' here, and 'printf()' can be problematic.

A freestanding C implementation does, however, still have a C
compiler. I'm curious which processors Andrew Reilly claims "will
never have a C compiler", and why he makes that claim.

I assume Dr. Reilly's referring to various DSP devices. These often
have features such as saturating arithmetic and bit-reversed
addressing.

Regards,
Allan

Mar 12 '06 #140