473,842 Members | 1,773 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making Fatal Hidden Assumptions

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06
351 13183
On Thu, 09 Mar 2006 09:13:24 +1100, Andrew Reilly
<an************ *@areilly.bpc-users.org> wrote:
C is not a higher-level language. It's a universal assembler. Pick
another one.


Nice parrot. I think the original author of that phrase meant it as a
joke.

I spent 25 years writing assembler. C is a higher-level language.

--
Al Balmer
Sun City, AZ
Mar 9 '06 #91
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
[...]
C is not a higher-level language.
It's higher-level than some, lower than others. I'd call it a
medium-level language.
It's a universal assembler.


Not in any meaningful sense of the word "assembler" .

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 9 '06 #92
In article <12************ *@news.supernew s.com> Andrey Tarasevich <an************ **@hotmail.com> writes:
....
The first time I see this code, but:
const char *p; /* 5 */ .... if ((((int) p) & (SW - 1)) == 0) { ....
This will not result in the desired answer on the Cray 1.
On the Cray 1 a byte pointer has the word address (64 bit words)
in the lower 48 bits and a byte offset in the upper 16 bits.
So this code actually tests whether the *word* address is even.
And so the code will fail to give the correct answer in the
following case:
char f[] = "0123456789 ";
int i;
f[1] = 0;
i = strlen(f + 2);
when f starts at an even word address it will give the answer 1
instead of the correct 8.
d = *((int *) p);


Note that in here the byte-offset in the pointer is ignored, so
d points to the integer that contains the character array:
"0\00023456 7".

Again an hidden assumption I think. (It is exactly this hidden
assumption that made porting of a particular program extremely
difficult to the Cray 1. The assumption was that in a word
pointer the lowest bit was 0, and that bit was used for
administrative purposes.)
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 9 '06 #93
In article <12************ *@corp.supernew s.com> msg <msg@_cybertheq ue.org_> writes:
OK, I've made enough of a fool of myself already. I'll go and have that
second cup of coffee for the morning, before I start going on about having
the standard support non-2's complement integers, or machines that have no
arithmetic right shifts...


I get queasy reading the rants against 1's complement architectures; I
wish Seymour Cray were still around to address this.


There are quite a few niceties indeed. Negation of a number is really
simple, just a logical operation, and there are others. This means
simpler hardware for the basic operations on signed objects, except for
the carry. It was only when I encountered the PDP that I saw the first
2's complement machine.

On the other hand, when Seymour Cray started his own company, those
machines where 2's complement. And he shifted from 60 to 64 bit
words, but still retained octal notation (he did not like hexadecimal
at all).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 9 '06 #94
On Wed, 08 Mar 2006 18:26:35 -0500, CBFalconer wrote:
"Clark S. Cox III" wrote:
"Peter Harrison" <pe***@cannock. ac.uk> said:
Paul Burke <pa**@scazon.co m> writes:

> My simple mind must be missing something big here. If for
> pointer p, (p-1) is deprecated because it's not guaranteed that
> it points to anything sensible, why is p++ OK? There's no
> boundary checking in C (unless you put it in).


It's not deprecated, it's illegal. Once you have involved UB all
bets are off. Without the p-1 the p++ statements are fine, as long
as they don't advance the pointer more than one past the end of the
object.


It's no more "illegal" than any of the other undefined behaviour that you
pointed out in that code snippet. There aren't different classes of
undefined behaviour, are there?

I reckon I'll just go with the undefined flow, in the interests of
efficient, clean code on the architectures that I target. I'll make sure
that I supply a document specifying how the compilers must behave for all
of the undefined behaviours that I'm relying on, OK? I have no interest
in trying to make my code work on architectures for which they don't hold.

Of course, that list will pretty much just describe the usual flat-memory,
2's compliment machine that is actually used in almost all circumstances
in the present day, anyway. Anyone using anything else already knows that
they're in a world of trouble and that all bets are off.

--
Andrew

Mar 9 '06 #95
Now responding to the basic article:

In article <44************ ***@yahoo.com> cb********@main eline.net writes:
....
#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080) Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT.
It does not allow for larger sizeof(int) (as it does not allow for
other values of CHAR_BIT). When sizeof(int) > 4 it will only show
whether there is a zero byte in the low order four bytes. When
sizeof(int) < 4 it will give false positives. Both constants have
to be changed when sizeof(int) != 4. Moreover, it will not work on
1's complement or sign-magnitude machines. Using unsigned here is
most appropriate.
if ((((int) p) & (SW - 1)) == 0) { Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.


It is false on the Cray 1 and its derivatives. See another article
by me where I show that it may give wrong answers.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 9 '06 #96
On Wed, 08 Mar 2006 23:53:23 +0000, Christian Bau wrote:
Only because the standard says so. Didn't have to be that way. There are
plenty of logically correct algorithms that could exist that involve
pointers that point somewhere outside of a[0..N]. As long as there's no
de-referencing, no harm, no foul. (Consider the simple case of iterating
through an array at a non-unit stride, using the normal p < s + N
termination condition. The loop finishes with p > s + N and the standard
says "pow, you're dead", when the semantically identical code written with
integer indexes has no legal problems.
Consider a typical implementation with 32 bit pointers and objects that
can be close to 2 GByte in size.


Yeah, my world-view doesn't allow individual objects to occupy half the
address space or more. I'm comfortable with that restriction, but I can
accept that there may be others that aren't. They're wrong, of course :-)
typedef struct { char a [2000000000]; } giantobject;

giantobject anobject;

giantobject* p = &anobject;
giantobject* q = &anobject - 1;
giantobject* r = &anobject + 1;
giantobject* s = &anobject + 2;

It would be very hard to implement this in a way that both q and s would
be valid; for example, it would be very hard to achieve that q < p, p <
r and r < s are all true. If q and s cannot be both valid, and there
isn't much reason why one should be valid and the other shouldn't, then
neither can be used in a program with any useful guarantees by the
standard.


Yes, very hard indeed. Partition your object or use a machine with bigger
addresses. Doesn't seem like a good enough reason to me to break a very
useful abstraction.

Posit: you've got N bits to play with, both for addresses and integers.
You need to be able to form a ptrdiff_t, which is a signed quantity, to
compute d = anobject.a[i] - anobject.a[j], for any indices i,j within the
range of the array. The range of signed quantities is just less than half
that of unsigned. That range must therefore define how large any
individual object can be. I.e., half of your address space. Neat, huh?

Yeah, yeah, for any complicated problem there's an answer that is simple,
neat and wrong.

--
Andrew

Mar 9 '06 #97
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:26:35 -0500, CBFalconer wrote:
"Clark S. Cox III" wrote:
"Peter Harrison" <pe***@cannock. ac.uk> said:
> Paul Burke <pa**@scazon.co m> writes:
>
>> My simple mind must be missing something big here. If for
>> pointer p, (p-1) is deprecated because it's not guaranteed that
>> it points to anything sensible, why is p++ OK? There's no
>> boundary checking in C (unless you put it in).
It's not deprecated, it's illegal. Once you have involved UB all
bets are off. Without the p-1 the p++ statements are fine, as long
as they don't advance the pointer more than one past the end of the
object.


It's no more "illegal" than any of the other undefined behaviour that you
pointed out in that code snippet. There aren't different classes of
undefined behaviour, are there?


Right, "illegal" probably isn't the best word to describe undefined
behavior. An implementation is required to diagnose syntax errors and
constraint violations; it's specifically *not* required to diagnose
undefined behavior (though it's allowed to do so).
I reckon I'll just go with the undefined flow, in the interests of
efficient, clean code on the architectures that I target. I'll make sure
that I supply a document specifying how the compilers must behave for all
of the undefined behaviours that I'm relying on, OK? I have no interest
in trying to make my code work on architectures for which they don't hold.
Ok, you can do that if you like. If you can manage to avoid undefined
behavior altogether, your code is likely to work on *any* system with
a conforming C implementation; if not, it may break when ported to
some exotic system.

For example, code that makes certain seemingly reasonable assumptions
about pointer representations will fail on Cray vector systems. I've
run into such code myself; the corrected code was actually simpler and
cleaner.

If you write code that depends on undefined behavior, *and* there's a
real advantage in doing so on some particular set of platforms, *and*
you don't mind that your code could fail on other platforms, then
that's a perfectly legitimate choice. (If you post such code here in
comp.lang.c, you can expect us to point out the undefined behavior;
some of us might possibly be overly enthusiastic in pointing it out.)
Of course, that list will pretty much just describe the usual flat-memory,
2's compliment machine that is actually used in almost all circumstances
in the present day, anyway. Anyone using anything else already knows that
they're in a world of trouble and that all bets are off.


All bets don't *need* to be off if you're able to stick to what the C
standard actually guarantees.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 9 '06 #98
On Thu, 09 Mar 2006 00:00:34 +0000, Christian Bau wrote:
Question: If the C Standard guarantees that for any array a, &a [-1]
should be valid, should it also guarantee that &a [-1] != NULL
Probably, since NULL has been given the guarantee that it's unique in some
sense. In an embedded environment, or assembly language, the construct
could of course produce NULL (for whatever value you pick for NULL), and
NULL would not be special. I don't know that insisting on the existence of
a unique and special NULL pointer value is one of the standard's crowning
achievements, either. It's convenient for lots of things, but it's just
not the way simple hardware works, particularly at the limits.
and that
&a [-1] < &a [0]
Sure, in the ptrdiff sense that I mentioned before.
I.e., (a - 1) - (a + 0) < 0 (indeed, identically -1)
In that case, what happens when I create an array with a single element
that is an enormously large struct?


Go nuts. If your address space is larger than your integer range, (as, is
the case for I32LP64 machines), your compiler might have to make sure that
it performs the difference calculation to sufficient precision.

I still feel comfortable about this failing to work for objects larger
than half the address space, or even for objects larger than the range of
an int. That's IMO, a much less uncomfortable restriction than the one
that the standard seems to have stipulated, which is that the simple and
obvious pointer arithmetic that you've used in your examples works in some
situations and doesn't work in others. (Remember: it's all good if those
array references are in a function that was itself passed (&foo[n], for
n>=1) as the argument.)

Cheers,

--
Andrew

Mar 9 '06 #99
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 23:53:23 +0000, Christian Bau wrote: [...]
Consider a typical implementation with 32 bit pointers and objects that
can be close to 2 GByte in size.


Yeah, my world-view doesn't allow individual objects to occupy half the
address space or more. I'm comfortable with that restriction, but I can
accept that there may be others that aren't. They're wrong, of course :-)


I can easily imagine a program that needs to manipulate a very large
data set (for a scientific simulation, perhaps). For a data set that
won't fit into memory all at once, loading as much of it as possible
can significantly improve performance.
typedef struct { char a [2000000000]; } giantobject;

giantobject anobject;

giantobject* p = &anobject;
giantobject* q = &anobject - 1;
giantobject* r = &anobject + 1;
giantobject* s = &anobject + 2;

It would be very hard to implement this in a way that both q and s would
be valid; for example, it would be very hard to achieve that q < p, p <
r and r < s are all true. If q and s cannot be both valid, and there
isn't much reason why one should be valid and the other shouldn't, then
neither can be used in a program with any useful guarantees by the
standard.


Yes, very hard indeed. Partition your object or use a machine with bigger
addresses. Doesn't seem like a good enough reason to me to break a very
useful abstraction.


Your "very useful abstraction" is not something that has *ever* been
guaranteed by any C standard or reference manual.
Posit: you've got N bits to play with, both for addresses and integers.
You need to be able to form a ptrdiff_t, which is a signed quantity, to
compute d = anobject.a[i] - anobject.a[j], for any indices i,j within the
range of the array. The range of signed quantities is just less than half
that of unsigned. That range must therefore define how large any
individual object can be. I.e., half of your address space. Neat, huh?


The standard explicitly allows for the possibility that pointer
subtraction within a single object might overflow (if so, it invokes
undefined behavior). Or, given that C99 requires 64-bit integer
types, making ptrdiff_t larger should avoid the problem for any
current systems (I don't expect to see full 64-bit address spaces for
a long time).

The standard is full of compromises. Not everyone likes all of them.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 9 '06 #100

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.