Making Fatal Hidden Assumptions - Page 2

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

12839

<
1
2
3
4
12
>
Last »

Paul Burke

Ben Bacarisse wrote:

p = s - 1 is fine unless s
points to the first element of an array (or worse)[1].

My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

Quote from p98 of K&R 2nd (ANSI) edition: "If pa points to a particular
element of an array, then by definition pa+1 points to the next element,
pa+i points to i elements after pa, and pa-i points to i elements before."

Paul Burke

Mar 7 '06 #11

Arthur J. O'Dwyer

On Tue, 7 Mar 2006, Paul Burke wrote:

Ben Bacarisse wrote:

p = s - 1 is fine unless s
points to the first element of an array (or worse)[1].
My simple mind must be missing something big here. If for pointer p, (p-1)
is deprecated because it's not guaranteed that it points to anything
sensible, why is p++ OK? There's no boundary checking in C (unless you
put it in).

(BTW, "deprecated ," in the context of programming-language standards,
means something that once was okay but is not recommended anymore. (p-1)
isn't like that.) Read on for your answer.
Quote from p98 of K&R 2nd (ANSI) edition: "If pa points to a particular
element of an array, then by definition pa+1 points to the next element,
pa+i points to i elements after pa, and pa-i points to i elements before."

K&R answers your question. If pa points to some element of an array,
then pa-1 points to the /previous element/. But what's the "previous
element" relative to the first element in the array? It doesn't exist.
So we have undefined behavior.
The expression pa+1 is similar, but with one special case. If pa points
to the last element in the array, you might expect that pa+1 would be
undefined; but actually the C standard specifically allows you to evaluate
pa+1 in that case. Dereferencing that pointer, or incrementing it /again/,
however, invoke undefined behavior.

Basically: A C pointer must always point to something. "The
negative-oneth element of array a" is not "something. "

int a[10];
int *pa = a+3;
pa-3; /* fine; points to a[0] */
pa+6; /* fine; points to a[9] */
pa+7; /* fine; points "after" a[9] */
pa+8; /* undefined behavior */
pa-4; /* undefined behavior */

HTH,
-Arthur

Mar 7 '06 #12

Ben Pfaff

Paul Burke <pa**@scazon.co m> writes:

My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

You're missing what the standard says about it:

8 When an expression that has integer type is added to or
subtracted from a pointer, the result has the type of the
pointer operand. If the pointer operand points to an element
of an array object, and the array is large enough, the
result points to an element offset from the original element
such that the difference of the subscripts of the resulting
and original array elements equals the integer expression.
In other words, if the expression P points to the i-th
element of an array object, the expressions (P)+N
(equivalently, N+(P)) and (P)-N (where N has the value n)
point to, respectively, the i+n-th and i-n-th elements of
the array object, provided they exist. Moreover, if the
expression P points to the last element of an array object,
the expression (P)+1 points one past the last element of the
array object, and if the expression Q points one past the
last element of an array object, the expression (Q)-1 points
to the last element of the array object. If both the pointer
operand and the result point to elements of the same array
object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the
last element of the array object, it shall not be used as
the operand of a unary * operator that is evaluated.

(Wow, that's a mouthful.)
--
"To get the best out of this book, I strongly recommend that you read it."
--Richard Heathfield

Mar 7 '06 #13

Eric Sosman

Paul Burke wrote On 03/07/06 11:45,:

Ben Bacarisse wrote:

p = s - 1 is fine unless s
points to the first element of an array (or worse)[1].

My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

Quote from p98 of K&R 2nd (ANSI) edition: "If pa points to a particular
element of an array, then by definition pa+1 points to the next element,
pa+i points to i elements after pa, and pa-i points to i elements before."

There's a special dispensation that allows you to
calculate a pointer value that points to the imaginary
element "one position past the end" of an array. (You
cannot, of course, actually try to access the imaginary
element -- but you can form the pointer value, compare
it to other pointer values, and so on.)

There is no such special dispensation for an element
"one position before the beginning."

The Rationale says the Committee considered defining
the effects at both ends (bilateral dispensations?) , but
rejected it for efficiency reasons. Consider an array of
large elements -- structs of 32KB size, say. A system
that actually performed hardware checking of pointer values
could accommodate the one-past-the-end rule by allocating
just one extra byte after the end of the array, a byte that
the special pointer value could point at without setting off
the hardware's alarms. But one-before-the-beginning would
require an extra 32KB, just to hold data that could never
be used ...

--
Er*********@sun .com

Mar 7 '06 #14

Ben Bacarisse

On Tue, 07 Mar 2006 16:45:06 +0000, Paul Burke wrote:

Ben Bacarisse wrote:
p = s - 1 is fine unless s
points to the first element of an array (or worse)[1].

My simple mind must be missing something big here. If for pointer p, (p-1)
is deprecated because it's not guaranteed that it points to anything
sensible, why is p++ OK? There's no boundary checking in C (unless you put
it in).

As your (snipped) quote from K&R illustrates, pointer arithmetic is
defined only "within" arrays. "Within", includes a pointer that points
just after the last element. This is very handy, since pointer
expressions like end - start are a common idiom. When a pointer, p,
points to the first element of an array, p - 1 is not defined (indeed it
may not be representable given a devious enough, but conforming,
implementation) . p - 1 is not deprecated at all (so far as I know). It
is either perfectly valid or entirely undefined. p + 1 is well-defined if
p points to any array element (including the last). p + 2 is not defined
if p points to the last element of an array.

For the purpose of arithmetic, pointers to single data objects are treated
like pointers to arrays of size 1. So we have:

int x, a[2];

int *p = &x + 1; /* defined */
int *q = &x - 1; /* undefined */
int *r = &x + 2; /* undefined */

int *s = a; /* defined */
int *t = a - 1; /* undefined */
int *u = a + 1; /* defined */
int *v = a + 2; /* defined */
int *w = a + 3; /* undefined */

I hope I have not missed what you were talking about.

--
Ben.

Mar 7 '06 #15

Peter Harrison

"Ben Pfaff" <bl*@cs.stanfor d.edu> wrote in message
news:87******** ****@benpfaff.o rg...

Paul Burke <pa**@scazon.co m> writes:
My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

It seems I too have a simple mind. I read the recent replies to this and
found myself not sure I am better off.

This is what I think I understand:

int x;
int *p;
int *q;

p = &x; /* is OK */
p = &x + 1; /* is OK even though we have no idea what p points to */
p = &x + 6; /* is undefined - does this mean that p may not be the */
/* address six locations beyond x? */
/* or just that we don't know what is there? */
p = &x - 1; / as previous */

But the poster was comparing with p++, so...

p = &x; /* so far so good */
p++; /* still ok (?) but we dont know what is there */
p++; /* is this now undefined? */

I guess _my_ question is - in this context does 'undefined' mean just that
we cannot say anything about what the pointer points to or that we cannot
say anything about the value of the pointer. So for example:

p = &x;
q = &x;
p = p+8;
q = q+8;

should p and q have the same value or is that undefined.

Pete Harrison

Mar 7 '06 #16

Walter Roberson

In article <Pi************ *************** *******@unix43. andrew.cmu.edu> ,
Arthur J. O'Dwyer <aj*******@andr ew.cmu.edu> wrote:

K&R answers your question. If pa points to some element of an array,
then pa-1 points to the /previous element/. But what's the "previous
element" relative to the first element in the array? It doesn't exist.
So we have undefined behavior.
The expression pa+1 is similar, but with one special case. If pa points
to the last element in the array, you might expect that pa+1 would be
undefined; but actually the C standard specifically allows you to evaluate
pa+1 in that case. Dereferencing that pointer, or incrementing it /again/,
however, invoke undefined behavior.

Right.

These limitations may perhaps be most easily be understood with
reference to the VAX "descriptor " model of pointers, in which a pointer
did not refer -directly- to the target memory, but instead refered to
a a -description- of that memory, including sizes and basic types
and current offset. In that scheme, with pa a pointer, pa-1 involves
internal work to produce the descriptor with the appropriate sizes and
offsets -- and at the time the relative offset was calculated, it would
be compared to the known bounds, and an exception could occur *then*
[when the pointer was built] rather than at the time the pointer was used.
Other circumstances where it could make a difference include cases in
which pa points to a block of memory at the very beginning of a memory
segment (on a segmented architecture). The calculation of the value of
pa-1 could then proceed in several ways:

- by exception (because the system notes the attempt to point before
the segment beginning)

- by wrapping the relative segment offset field to its maximum value
(which might or might not trigger odd behaviours)

- by wrapping the relative segment offset field to its maximum value -and-
decrementing the field that holds the address register number that
holds the base virtual memory (this effectively pointing into
a completely -different- block of memory, which might or might not
trigger odd behaviours)

- by noticing that the segment descriptor is not suitable for the
pointer and building a new segment descriptor that covers the extended
range (which might use up scarce segment descriptors unnecessarily)

- by exception (because the system notes that the new memory
address is not one that the user has access rights to)
There are undoubtedly other situations, but the point remains that
creating a pointer to "before" an object is not certain to work,
even if that pointer is never dereferenced.
--
Okay, buzzwords only. Two syllables, tops. -- Laurie Anderson

Mar 7 '06 #17

Robin Haigh

"Peter Harrison" <pe***@cannock. ac.uk> wrote in message
news:du******** **@south.jnrs.j a.net...

"Ben Pfaff" <bl*@cs.stanfor d.edu> wrote in message
news:87******** ****@benpfaff.o rg...
Paul Burke <pa**@scazon.co m> writes:
My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

It seems I too have a simple mind. I read the recent replies to this and
found myself not sure I am better off.

This is what I think I understand:

int x;
int *p;
int *q;

p = &x; /* is OK */
p = &x + 1; /* is OK even though we have no idea what p points to */
p = &x + 6; /* is undefined - does this mean that p may not be the */
/* address six locations beyond x? */
/* or just that we don't know what is there? */
p = &x - 1; / as previous */

But the poster was comparing with p++, so...

p = &x; /* so far so good */
p++; /* still ok (?) but we dont know what is there */
p++; /* is this now undefined? */

I guess _my_ question is - in this context does 'undefined' mean just that
we cannot say anything about what the pointer points to or that we cannot
say anything about the value of the pointer. So for example:

p = &x;
q = &x;
p = p+8;
q = q+8;

should p and q have the same value or is that undefined.

Basically you're supposed to assume that the arithmetic can't be done if the
result (or an intermediate value) goes out of range. As if it were a
trapped overflow. The behaviour is undefined as soon as you evaluate the
r.h.s, so you don't even know that the assignment to p or q actually takes
place.

--
RSH

Mar 7 '06 #18

Keith Thompson

"Peter Harrison" <pe***@cannock. ac.uk> writes:
[...]

This is what I think I understand:

int x;
int *p;
int *q;

p = &x; /* is OK */
p = &x + 1; /* is OK even though we have no idea what p points to */
p = &x + 6; /* is undefined - does this mean that p may not be the */
/* address six locations beyond x? */
/* or just that we don't know what is there? */ [...] p = &x - 1; / as previous */

But the poster was comparing with p++, so...

p = &x; /* so far so good */
p++; /* still ok (?) but we dont know what is there */
p++; /* is this now undefined? */

I guess _my_ question is - in this context does 'undefined' mean just that
we cannot say anything about what the pointer points to or that we cannot
say anything about the value of the pointer. So for example:

p = &x;
q = &x;
p = p+8;
q = q+8;

should p and q have the same value or is that undefined.

"Undefined" means far more than that. It's not an undefined *value*,
it's undefined *behavior*, which the standard defines as:

behavior, upon use of a nonportable or erroneous program construct
or of erroneous data, for which this International Standard
imposes no requirements

NOTE Possible undefined behavior ranges from ignoring the
situation completely with unpredictable results, to behaving
during translation or program execution in a documented manner
characteristic of the environment (with or without the issuance of
a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).

Once undefined behavior has occurred, it's meaningless to talk about
the value of p. If your program or your computer crashes, p doesn't
have a value; if demons start flying out of your nose (in accordance
with the standard joke around here), you'll have more important things
to worry about.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 7 '06 #19

Andrew Reilly

On Tue, 07 Mar 2006 10:21:21 -0800, Ben Pfaff wrote:

Paul Burke <pa**@scazon.co m> writes:
My simple mind must be missing something big here. If for pointer p,
(p-1) is deprecated because it's not guaranteed that it points to
anything sensible, why is p++ OK? There's no boundary checking in C
(unless you put it in).

You're missing what the standard says about it:

8 When an expression that has integer type is added to or
subtracted from a pointer, the result has the type of the pointer
operand. If the pointer operand points to an element of an array
object, and the array is large enough, the result points to an
element offset from the original element such that the difference of
the subscripts of the resulting and original array elements equals
the integer expression. In other words, if the expression P points to
the i-th element of an array object, the expressions (P)+N
(equivalently, N+(P)) and (P)-N (where N has the value n) point to,
respectively, the i+n-th and i-n-th elements of the array object,
provided they exist. Moreover, if the expression P points to the
last element of an array object, the expression (P)+1 points one past
the last element of the array object, and if the expression Q points
one past the last element of an array object, the expression (Q)-1
points to the last element of the array object. If both the pointer
operand and the result point to elements of the same array object, or
one past the last element of the array object, the evaluation shall
not produce an overflow; otherwise, the behavior is undefined. If the
result points one past the last element of the array object, it shall
not be used as the operand of a unary * operator that is evaluated.

(Wow, that's a mouthful.)

It's precicely this sort of tomfoolery on the part of the C standards
committee that has brought the language into such ill-repute in recent
years. It's practically unworkable now, compared to how it was in (say)
the immediately post-ANSI-fication years.

The code in question could trivially have p replaced by (p+p_index)
everywhere, where p_index is an int, and all of the arithmetic currently
effected on p is instead effected on p_index. I.e., if p was set to s,
and p_index set to -1, and p_index++ appeared as the first element inside
the outer do loop.

So having the standard make the semantic equivalent "undefined" only
serves to make the standard itself ever more pointless.

Bah, humbug. Think I'll go back to assembly language, where pointers do
what you tell them to, and don't complain about it to their lawyers.

--
Andrew

Mar 7 '06 #20