473,842 Members | 1,945 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making Fatal Hidden Assumptions

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06
351 13203
Keith Thompson wrote:
"S.Tobias" <si***@FamOuS.B edBuG.pAlS.INVA LID> writes:

.... snip ...

By the same logic, ptr to just past the end of an array (which,
of course, does not point to an object), _can_ compare equal to
null ptr? ;-) - Of course not, semantics for "==" guarantee that
both ptrs must be null (among others) to compare equal.


Interesting. I think you may have found a small flaw in the
standard.

I'm reasonably sure that it's *intended* that a one-past-the-end
pointer is unequal to NULL, but I don't think the standard quite
manages to say so.


I think there is something specific, but even if not, it is
certainly implied. For example, if p points to the last item in an
array, p++ is valid. p-- will then produce a valid dereferencable
pointer, while p = NULL; p-- will not.

--
Chuck F (cb********@yah oo.com) (cb********@wor ldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home .att.net> USE worldnet address!
Mar 15 '06 #181
CBFalconer <cb********@yah oo.com> writes:
Keith Thompson wrote:
"S.Tobias" <si***@FamOuS.B edBuG.pAlS.INVA LID> writes:

... snip ...

By the same logic, ptr to just past the end of an array (which,
of course, does not point to an object), _can_ compare equal to
null ptr? ;-) - Of course not, semantics for "==" guarantee that
both ptrs must be null (among others) to compare equal.


Interesting. I think you may have found a small flaw in the
standard.

I'm reasonably sure that it's *intended* that a one-past-the-end
pointer is unequal to NULL, but I don't think the standard quite
manages to say so.


I think there is something specific, but even if not, it is
certainly implied. For example, if p points to the last item in an
array, p++ is valid. p-- will then produce a valid dereferencable
pointer, while p = NULL; p-- will not.


No, that doesn't follow.

This:
p = NULL;
p --;
normally invokes undefined behavior, but only because the behavior
isn't defined. As far as I can tell, there's actually no explicit
statement in the standard that arithmetic on a null pointer invokes
undefined behavior.

But assume that there exists an array
char foo[15];
such that foo+15==NULL:
char *p = foo + 15; /* p == NULL */
p --;

In this case, the behavior of "p --;" is defined by C99 6.5.6p8:

... and if the expression Q points one past the last element of an
array object, the expression (Q)-1 points to the last element of
the array object.

Allowing a pointer just past the end of an array to be equal to NULL
would be inconvenient (which is why I think the standard should be
corrected), but as far as I can tell it wouldn't actually violate the
standard.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 15 '06 #182
Keith Thompson wrote:
"Rod Pemberton" <do*********@so rry.bitbuck.cmm > writes:
[...]
BTW, I've heard one of the Pascal standards added pointers...


<OT>Pascal has always had pointers.


But not uncontrolled pointers, except in abortive quasi-Pascals
such as Borland/Delphi.
</OT>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>
Mar 15 '06 #183
Chris Torek wrote:
.... snip ...
(I really do have a "strictcc" command, too. Some might regard
it as cheating, but it works.)


Not me (regard it as cheating). I also have the equivalent,
through an alias. I call it cc. The direct access is called gcc.
The alias causes "cc <nothing>" to be translated into gcc --help.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>
Mar 15 '06 #184
On Tue, 14 Mar 2006 14:43:59 -0500
"Rod Pemberton" <do*********@so rry.bitbuck.cmm > wrote:

"Ed Prochak" <ed*******@gmai l.com> wrote in message
news:11******** **************@ i39g2000cwa.goo glegroups.com.. .

How about a bitsliced machine that uses only 6bit integers?


I thought those died out. Were any those CPU's actually used in a computer
sufficiently advanced enough to compile C?


Texas Instruments had a range of machines based on a bitslice
design that became the basis of the TMS9900 processor design. I don't
recall if there was a C compiler for it but it certainly could have
supported one easily.

--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
| http://www.sohara.org/
Mar 15 '06 #185
In article <11************ *********@i40g2 000cwc.googlegr oups.com> "Ed Prochak" <ed*******@gmai l.com> writes:
Dik T. Winter wrote:
In article <11************ *********@j33g2 000cwa.googlegr oups.com> "Ed Prochak" <ed*******@gmai l.com> writes: ....
> -- datatype sizes are dependent on the underlying hardware. While a lot
> of modern hardware has formed around the common 8bit char, and
> multiples of 16 for int types (and recent C standards have started to
> impose these standards), C still supports machines that used 9bit char
> and 18bit and 36bit integers. This was the most frustrating thing for
> me when I first learned C. It forces precisely some of the hidden
> assumptions of this topic.
The same is true for Pascal.


guess I'm getting forgetful in my old age. (haven't touched PASCAL in
over 10tears). I thought PASCAL defined fixed ranges for the datatypes
like integers. I guess I didn't port enough PASCAL applications to see
the difference. (I could have swore you'd get an error on X=32767+2 ;


In the original version of Pascal that was certainly *not* an error.
The range of integers was from -576460752303423 487 to +57646075230342 3487,
although values in absolute value greater than 281474976710655 were
unreliable in some calculations. Do you know where the original limit
of set sizes to 60 elements comes from?
What is valid for C is also valid for Pascal. But not all is valid.
Code generation, for instance, is *not* part of the compiler. Without
a back-end that translates the code generated by the compiler to actual
machine instructions, you are still nowhere. So your simple easily written
initial compiler was not an easily written initial compiler. You have
to consider what you use as assembler as backend. In contrast the very
first Pascal compiler was really single-pass and generated machine code
on-the-fly, without backend. Everything ready to run. (BTW, on that
machine the linking stage was almost never pre-performed, it took only
a few milli-seconds.)


Yes PASCAL and P-code, you have a point there, but I'm not sure it is
in your favor.


*Not* P-code. The Pascal version I refer to predates P-code quite a bit.
P-code came in the picture when porting the language to other processors
came in the picture. The original compiler generated direct machine
instructions (and that could still be done on other architectures).
C->native assembler->program on native hardware
Pascal->program in p-code-> runs in p-code interpreter


In the original version of Pascal it was:
Pascal->program on native hardware
without an intermediate assembler or an interpreter.
In C "register" is only a suggestion, it is not necessary to follow it.


The point is why even include this feature?


It was included at a point in time when optimisation by some compilers
was not as good as you would wish. In a similar way the original version
of Fortran had a statement with which you could tell the compiler what
the probability was that a branch would be taken or not.
On the other hand, the very first Pascal compiler already did optimisation,
but not as a separte pass, but as part of the single pass it had. You
could tweek quite a bit with it when you had access to the source of the
compiler.


So the PASCAL compiler was more advanced than the C compiler of the
time. DO you think maybe it was due to PASCAL being a more abstract
HLL than C might have had an effect here? (more likely though, it was
PASCAL predated C, at least in widespread use.)


No, it was because the machine that compiler was running on was quite a
bit larger and faster than the machines C compilers tended to run on.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 15 '06 #186
In article <11************ **********@j52g 2000cwj.googleg roups.com> "Ed Prochak" <ed*******@gmai l.com> writes:
Dik T. Winter wrote:

....
Indeed. But even when we look at the published instructions C falls
short of providing a construct for every one. Where is the C construct
to do a multply step available in quite a few early RISC machines?
Note also that in assembler you can access the special bits indicating
overflow and whatever (if they are available on the machine). How to
do that in C?


Well you cannot, but those processors did not even exist when C was
created. So those features didn't make it. To some degree, C is more of
a PDP assembler.


How do you get access to the condition bits?
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 15 '06 #187
Keith Thompson wrote:
"Ed Prochak" <ed*******@gmai l.com> writes:
Keith Thompson wrote: [...]

<snip> There's a continuum from raw machine language to very high-level
languages. Macro assembler is only a very small step up from
non-macro assembler. C is a *much* bigger step up from that. Some C
constructs may happen to map to single instructions for *some*
compiler/CPU combinations; they might map to multiple instructions, or
even none, for others. An assignment statement might copy a single
scalar value (integer, floating-point, or pointer) -- or it might copy
an entire structure; the C code looks the same, but the machine code
is radically different.

Using entirely arbitrary units of high-level-ness, I'd call machine
language close to 0, assembly language 10, macro assembler 15, and C
about 50. It might be useful to have something around 35 or so.
(This is, of course, mostly meaningless.)
That's really interesting, because I have pondered the question and
answered it in nearly the same way. However, the conclusion I came to
is that C occupies a relatively large range in the continuum, not a
single point.

C dialects that accept inline assembly push the lower bound much lower
than it would otherwise be. Likewise, the wealth of freely-available C
libraries push the upper bound much further up. You can run a
real-world system written in C all the way from the top to the bottom--
from the GUI (GNOME, for example), to the kernel; from the compiler to
the shells and interpreters.

As the continuum of C expands, the C standard (ideally) acts to reclaim
and make (semi-) portable chunks of this continuum. Look at
'restrict', for example. It enables portable annotation of alasing
assumptions so that encroachment of the lower bounds is not necessary
for better efficiency. Is it perfect? No. Does it need to be? No.
Is it workable? Yes.
Assembly language is usually untyped; types are specified by which
instruction you use, not by the types of the operands. C, by
contrast, associates types with variables. It often figures out how
to implement an operation based on the types of its operands, and many
operations are disallowed (assigning a floating-point value to a
pointer, for example).

I know the old joke that C combines the power of assembly language
with the flexibility of assembly language. I even think it's funny.
But it's not realistic, at least for C programmers who care about
writing good portable code.


I've also pondered that joke on many occasions. Each time I see it, I
think it's more and more of a compliment. But if it added that C is
usually more efficient than non-diety-level assembly, and that
well-written C is nearly pathologically portable, it wouldn't really be
a joke, would it?

In some respects, C is like English: overly succeptable to insane
degrees of corruption, but all the same, nearly universally understood
regardless, for better or for worse.

Mark F. Haigh
mf*****@sbcglob al.net

Mar 15 '06 #188
msg
Dik T. Winter wrote:
the original version of Fortran had a statement with
which you could tell the compiler what the probability
was that a branch would be taken or not.


Sorry I don't have access to our archives at the moment -
we do have materials from 70x and 650 Fortran; to
which machine are you referring? 704, 709, 650?

Michael Grigoni
Cybertheque Museum
Mar 15 '06 #189
"Mark F. Haigh" <mf*****@sbcglo bal.net> writes:
Keith Thompson wrote:
"Ed Prochak" <ed*******@gmai l.com> writes:
> Keith Thompson wrote:

[...]

<snip>
There's a continuum from raw machine language to very high-level
languages. Macro assembler is only a very small step up from
non-macro assembler. C is a *much* bigger step up from that. Some C
constructs may happen to map to single instructions for *some*
compiler/CPU combinations; they might map to multiple instructions, or
even none, for others. An assignment statement might copy a single
scalar value (integer, floating-point, or pointer) -- or it might copy
an entire structure; the C code looks the same, but the machine code
is radically different.

Using entirely arbitrary units of high-level-ness, I'd call machine
language close to 0, assembly language 10, macro assembler 15, and C
about 50. It might be useful to have something around 35 or so.
(This is, of course, mostly meaningless.)


That's really interesting, because I have pondered the question and
answered it in nearly the same way. However, the conclusion I came to
is that C occupies a relatively large range in the continuum, not a
single point.

C dialects that accept inline assembly push the lower bound much lower
than it would otherwise be. Likewise, the wealth of freely-available C
libraries push the upper bound much further up. You can run a
real-world system written in C all the way from the top to the bottom--
from the GUI (GNOME, for example), to the kernel; from the compiler to
the shells and interpreters.


Standard C doesn't accept inline assembly. If some particular
compiler does so, it's an extension -- and you might as well think of
it as an assembler that also accepts C syntax.

As far as libraries are concerned, most of them aren't part of
standard C either -- and they needn't be either implemented in, or
called from, C.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 15 '06 #190

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.