Making Fatal Hidden Assumptions - Page 28

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

13143

Dik T. Winter

In article <pa************ *************** *@areilly.bpc-users.org> Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
....

Interesting features. I'm not sure how much different multiple value
returns would be from values returned via reference parameters
(pointers). it sounds like a good idea.

The significant difference is that reference parameters (pointers) can't
be in registers.

Why not? I have worked with a lot of implementations where the first few
parameters were passed through registers. (Depending on the processor,
from four to eight.) And i many cases no need at all to put those pointers
on the stack.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Mar 22 '06 #271

Arthur J. O'Dwyer

On Wed, 22 Mar 2006, Dik T. Winter wrote:

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
...
Interesting features. I'm not sure how much different multiple value
returns would be from values returned via reference parameters
(pointers). it sounds like a good idea.

The significant difference is that reference parameters (pointers) can't
be in registers.

Why not? I have worked with a lot of implementations where the first few
parameters were passed through registers. (Depending on the processor,
from four to eight.) And i many cases no need at all to put those pointers
on the stack.

I believe Andrew means

void foo(int & x)
{
use(x);
}

void bar()
{
register int a;
foo(a); /* Will C++ accept this? */
}

I don't know whether standard C++ would accept the above code, or whether
it would, like standard C, insist that the programmer can't take the
address of a 'register' variable, even implicitly. But in any case, it
would be hard for the compiler to put the variable 'a' into a machine
register when it compiles 'bar', because it needs to pass its address
to 'foo' later on.

HTH,
-Arthur

Mar 22 '06 #272

Jordan Abel

On 2006-03-22, Keith Thompson <ks***@mib.or g> wrote:

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Tue, 21 Mar 2006 19:02:55 -0600, Stephen Sprunk wrote:
If a system traps on a prefetch, it's fundamentally broken. However, a
system that traps when an invalid pointer is loaded is not broken, and the
AS/400 is the usual example. Annoying, but not broken.
And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

It is. The C standard wouldn't just have to forbid an implementation
from trapping when it loads an invalid address; it would have to
define the behavior of any program that uses such an address.

Why? It's not that difficult to define the behavior of a program that
"uses" such an address other than by dereferencing, and no problem to
leave the behavior undefined for dereferencing
A number of examples have been posted here where that could cause
serious problems for some implementations other than the AS/400.

Mar 22 '06 #273

Jordan Abel

On 2006-03-22, Keith Thompson <ks***@mib.or g> wrote:

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Tue, 21 Mar 2006 19:02:55 -0600, Stephen Sprunk wrote:
If a system traps on a prefetch, it's fundamentally broken. However, a
system that traps when an invalid pointer is loaded is not broken, and the
AS/400 is the usual example. Annoying, but not broken.
And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

It is. The C standard wouldn't just have to forbid an implementation
from trapping when it loads an invalid address; it would have to
define the behavior of any program that uses such an address.

Why? It's not that difficult to define the behavior of a program that
"uses" such an address other than by dereferencing, and no problem to
leave the behavior undefined for dereferencing
A number of examples have been posted here where that could cause
serious problems for some implementations other than the AS/400.

Mar 22 '06 #274

Dik T. Winter

In article <Pi************ *************** ********@unix42 .andrew.cmu.edu > "Arthur J. O'Dwyer" <aj*******@andr ew.cmu.edu> writes:

On Wed, 22 Mar 2006, Dik T. Winter wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
...
> Interesting features. I'm not sure how much different multiple value
> returns would be from values returned via reference parameters
> (pointers). it sounds like a good idea.

The significant difference is that reference parameters (pointers) can't
be in registers.

Why not? I have worked with a lot of implementations where the first few
parameters were passed through registers. (Depending on the processor,
from four to eight.) And i many cases no need at all to put those pointers
on the stack.

I believe Andrew means

void foo(int & x)

I thought we were talking about C.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Mar 22 '06 #275

Dik T. Winter

In article <sl************ ***********@ran dom.yi.org> Jordan Abel <ra*******@gmai l.com> writes:

On 2006-03-22, Keith Thompson <ks***@mib.or g> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes: ....
And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

It is. The C standard wouldn't just have to forbid an implementation
from trapping when it loads an invalid address; it would have to
define the behavior of any program that uses such an address.

Why? It's not that difficult to define the behavior of a program that
"uses" such an address other than by dereferencing, and no problem to
leave the behavior undefined for dereferencing

But that would have locked out machines that strictly separate pointers
and non-pointers, in the sense that you can not load a pointer in a
non-pointer register and the other way around. Note also that on the
AS/400 a pointer is longer than any integer, so doing arithmetic on them
in integer registers would require quite a lot.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Mar 22 '06 #276

Jordan Abel

On 2006-03-22, Dik T. Winter <Di********@cwi .nl> wrote:

In article <sl************ ***********@ran dom.yi.org> Jordan Abel <ra*******@gmai l.com> writes:
> On 2006-03-22, Keith Thompson <ks***@mib.or g> wrote:
> > Andrew Reilly <an************ *@areilly.bpc-users.org> writes: ... > >> And I still say that constraining C for everyone so that it could fit the
> >> AS/400, rather than making C-on-AS/400 jump through a few more hoops to
> >> match traditional C behaviour, was the wrong trade-off. I accept that
> >> this may well be a minority view.
> >
> > It is. The C standard wouldn't just have to forbid an implementation
> > from trapping when it loads an invalid address; it would have to
> > define the behavior of any program that uses such an address.

>
> Why? It's not that difficult to define the behavior of a program that
> "uses" such an address other than by dereferencing, and no problem to
> leave the behavior undefined for dereferencing

But that would have locked out machines that strictly separate pointers
and non-pointers, in the sense that you can not load a pointer in a
non-pointer register and the other way around. Note also that on the
AS/400 a pointer is longer than any integer, so doing arithmetic on them
in integer registers would require quite a lot.

Surely there's some way to catch and ignore the trap from loading an
invalid pointer, though. I mean, it stops _somewhere_ even as it is now,
unless the register melts the silicon and drips through the floor, then
accelerates to the speed of light.

Mar 22 '06 #277

Andrew Reilly

On Wed, 22 Mar 2006 03:03:47 +0000, Dik T. Winter wrote:

In article <Pi************ *************** ********@unix42 .andrew.cmu.edu > "Arthur J. O'Dwyer" <aj*******@andr ew.cmu.edu> writes:
> On Wed, 22 Mar 2006, Dik T. Winter wrote:
> > Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
> > ...
> > > > Interesting features. I'm not sure how much different multiple value
> > > > returns would be from values returned via reference parameters
> > > > (pointers). it sounds like a good idea.
> > >
> > > The significant difference is that reference parameters (pointers) can't
> > > be in registers.
> >
> > Why not? I have worked with a lot of implementations where the first few
> > parameters were passed through registers. (Depending on the processor,
> > from four to eight.) And i many cases no need at all to put those pointers
> > on the stack.

>
> I believe Andrew means
>
> void foo(int & x)

I thought we were talking about C.

We were. Sure, you can put a pointer into a register argument.

The point is that you can't take the address of a register state variable,
so you can't pass that "by reference" as a way to get multiple return
values out of a function.

Specifically;

double foo(double **pp)
{
double *p = *pp;
result = *p++;
if (result > 0.0) result += *p++;
*pp = p;
return result;
}

void bar(double info[], int N)
{
register double state = 0.0;
register double *p = info;
while (--N >= 0)
state -= foo(&p);
}

Doesn't work, because &p doesn't work.

Another way to make this sort of thing "nice" (as far as code
factorization goes) is either dynamic scoping or nested function
definitions. Then you could have:

void bar(double info[], int N)
{
register double state = 0.0;
register double *p = info;
double foo() {
double result = *p++;
if (result > 0.0) result += *p++;
return result;
}
while (--N >= 0)
state -= foo();
}

That puts more limitations on code structure: you may well want to build
a library of foo-like mutators, and use them in functions other than
bar(). Multiple return values would help with that, by making the state
mutation explicit.

Cheers,

--
Andrew

Mar 22 '06 #278

Keith Thompson

Jordan Abel <ra*******@gmai l.com> writes:

On 2006-03-22, Keith Thompson <ks***@mib.or g> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Tue, 21 Mar 2006 19:02:55 -0600, Stephen Sprunk wrote:
If a system traps on a prefetch, it's fundamentally broken.
However, a system that traps when an invalid pointer is loaded is
not broken, and the AS/400 is the usual example. Annoying, but
not broken.

And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

It is. The C standard wouldn't just have to forbid an implementation
from trapping when it loads an invalid address; it would have to
define the behavior of any program that uses such an address.

Why? It's not that difficult to define the behavior of a program that
"uses" such an address other than by dereferencing, and no problem to
leave the behavior undefined for dereferencing

The problem is pointer arithmetic. For example, given:

#define BUFFER_SIZE /* some big number */
int buf[BUFFER_SIZE];
int *ptr = buf + BUFFER_SIZE;
int offset = /* some number */

Requiring (ptr + offset - offset == ptr) probably wouldn't be too much
of a burden for most systems, but requiring (ptr + offset > ptr) could
cause problems. Given the current requirements, buf can be placed
anywhere in memory; there just (on some systems) needs to be a single
extra byte just past the end of the array. Requiring out-of-bounds
pointer arithmetic to work "correctly" in all cases could be much more
burdensome.

And avoiding creating invalid pointers really isn't all that
difficult.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 22 '06 #279

Keith Thompson

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:

On Wed, 22 Mar 2006 03:03:47 +0000, Dik T. Winter wrote:
In article
<Pi************ *************** ********@unix42 .andrew.cmu.edu >
"Arthur J. O'Dwyer" <aj*******@andr ew.cmu.edu> writes: [...]
> I believe Andrew means
>
> void foo(int & x)

I thought we were talking about C.

We were. Sure, you can put a pointer into a register argument.

"void foo(int & x)" is a syntax error in C.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 22 '06 #280