Making Fatal Hidden Assumptions - Page 27

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

13161

Dave Thompson

On 10 Mar 2006 10:16:54 -0800, "Ed Prochak" <ed*******@gmai l.com>
wrote:

Richard G. Riley wrote:
"Ed"posted the following on 2006-03-10: <snip: ++x versus x++> How do you see them being different at the assembler level? They are
not are they? Its just when you do the (pseudo ASM) INC_REGx or ADDL 02,REGx or
whatever that matters isnt it?

Actuall I still think in PDP assembler at times (my first
assemblerprogra mming).
so y=x++; really does map to a single instruction which both moves the
value to y and increments x (which had to be held in a register IIRC)

PDP was (almost) the entire product line of DEC for much of its life,
containing some similar architectures and some quite different ones
with (thus) different assembly languages.

You are probably thinking of PDP-11 autoincrement. This uses the
location _addressed_ by a register (only) which it increments (by
stride) at the same time:
register char * a = ?, b = *a++; // compiles to MOVB (R1)+, R2
register int * a = ?, b = *a++; // compiles to MOV (R1)+, R2
// note PDP-11 pointers are byte addresses & this adds _2_
// I write register explicitly for clarity, although a compiler could
// place variables not declared register in a register, and could
// choose not to use a register even though you specify it.

<snip other points about assy, macro assy, and C I mostly agree with>

- David.Thompson1 at worldnet.att.ne t

Mar 20 '06 #261

Dave Thompson

On 14 Mar 2006 10:52:28 -0800, "Ed Prochak" <ed*******@gmai l.com>
wrote:

Keith Thompson wrote:
"Ed Prochak" <ed*******@gmai l.com> writes: But this is a common feature of many high-level languages. Ada, for
example has an implementation-defined set of integer types, similar to
what C provides; I've never heard anyone claim that Ada is an
assembler.
Forgive my memory,but is it PL/1 or ADA that lets the programmer define
what integer type he wants. Syntax was something like
INTEGER*12 X
defined X as a 12 bit integer. (Note that such syntax is portable in
that on two different processors, you still know that the range of X is
+2048 to -2047

You mean -2048 to +2047 for two's complement, and -2047 to +2047 for
the extremely rare case of non-2sC or 2sC-with-trap-representation.

PL/1 is like DECLARE X FIXED BINARY (11); /* not counting sign */
Pascal uses the actual range like X: -2047 .. +2047 and Ada similarly
except, as arguably usual, more verbosely in most cases. In both cases
you are only guaranteed _at least_ 12 bits; in Ada you can
additionally specify a 'representation clause' that requires exactly
12 bits (or a diagnostic if that can't be implemented).

The syntax INTEGER*n is a common extension in Fortran (though not
standard) for an integer of n _bytes_ not bits.
The point is a 16bit integer in ADA is always a 16bit integer and
writing
x=32768 +10
will always overflow in ADA, but it is dependent on the compiler and
processor in C. It can overflow, or it can succeed.

But my point on this was, you need to know your target processor in C
more than in a language like ADA. This puts a burden on the C
programmer closer to an assembler programmer on the same machine than
to a ADA programmer.
I don't think this is true. In both languages it is fairly easy to do
portable but perhaps overconservativ e code. In C it is easy to get
fairly well 'down to the metal' if you want; in Ada it is fairly easy
to get even further down if the compiler supports it. And at the
extreme, Ada standardly requires an interface to assembler; C does not
do this standardly, but practically all implementations have some way.

<snip> No I was talking about the original motivation for the design of the
language. It was designed to exploit the register increment on DEC
processors. in the right context, (e.g. y=x++;) the increment doesn't
even become a separate instruction, as I mentioned in another post.
Neither motivation nor correct effect; see other posts.

<snip> lets put it this way. there is a gradient scale, from pure digits of
machine language (e.g., programming obcodes in binary is closer to the
hardware than using octal or hex)
at the lowest end and moving up past assebmler to higher and higher
levels of abstraction away from the hardware. On that scale, I put C
much closer to assembler than any other HLL I know. here's some samples
s/much/noticeably/ and I agree. (Except see below.)

PERL, BASH, SQL
Also awk. (Although you could consider it subsumed by perl.) And all
Unix shells more or less not just bash. Also the LISP tribe (Scheme,
etc.) and Prolog here or perhaps a smidgen higher.
C++, JAVA
PASCAL, FORTRAN, COBOL
I'd insert Ada here.
C
I'd insert FORTH here. And maybe pull some of the more powerful macro
assemblers above 'basic' assembler.
assembler
HEX opcodes
binary opcodes
I wouldn't distinguish hex from binary; that's trivial.

I'd insert microcode and then registers/gates here.
digital voltages in the real hardware.

And below that turtles. <G>

- David.Thompson1 at worldnet.att.ne t

Mar 20 '06 #262

Ed Prochak

Chris Dollin wrote:

Alf P. Steinbach wrote:
C was designed as a portable assembly language, the most successful so
far, so if the term has any practical meaning, then C is that meaning.

I would have said it was designed as a portable language to /replace/
[the use of] assembly language(s) [for a variety of "systems programming"
uses].

C looks nothing like any assembly language I've ever seen, which I
suppose may just indicate my limited knowledge of assemblers. Calling
it "portable assembly language" is an instructive metaphor, but it
is just that - a metaphor.

yes, he gets it!

Mar 21 '06 #263

Stephen Sprunk

"Ed Prochak" <ed*******@gmai l.com> wrote in message
news:11******** *************@i 39g2000cwa.goog legroups.com...

Some other languages do so much more for you that you might be scared
to look at the disassembly. e.g. languages that do array bounds
checking for you will generate much more code for a[y]=x; than does C.
You can picture the assembly code for C in your head without much
difficulty. The same doesn't hold true for some other languages.

There are C compilers that will add bounds-checking code for a[y]=x, and
it's quite ugly if disassembled. I think what you mean is that C doesn't
require implementations to do it (and most don't), but other languages do.

I shudder to think of what the asm would look like for a[y]=x if a were a
C++ object with virtual methods for the [] and = operators, with the latter
having to call a copy constructor on x. For fun, compile as PIC too.

However, I agree with the general statement that when you write C, you have
a good shot at imagining what the corresponding asm would be. I'm not sure
that makes it a glorified assembler, however.

C's greatest feature, and its worst, is that you can do all sorts of ugly
unportable things that virtually no other HLL allows but also have portable
constructs available: it's your choice which to use. This means you can
write non-portable implementation code in the same language as portable user
code, and IMHO is the reason for C's enduring popularity.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Mar 21 '06 #264

Ed Prochak

Andrew Reilly wrote:

On Tue, 14 Mar 2006 03:13:08 +0000, Dik T. Winter wrote:
In article <pa************ *************** *@areilly.bpc-users.org> Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
> On Mon, 13 Mar 2006 15:31:35 +0000, Dik T. Winter wrote: ...
> > On the other hand for every machine instruction there should be an
> > construct in the assembler to get that instruction. With that in
> > mind C doesn't fit either.
>
> Well, since we're talking about a "universal assembler" (a reasonably
> commonly used term), that's obviously something different from the usual
> machine-specific assembler, which does indeed usually have that property.
> (Although I've met assemblers where the only way to get certain
> instructions was to insert the data for the op-code in-line. Instruction
> coverage sometimes lags behind features added to actual implementations .)

I have met one such, but not for the reason you think. In this case the
assembler knew the instruction, and translated it, but completely wrong.
Apparently an instruction never used, but nevertheless published. And I
needed it.

But what is (in your opinion) a universal assembler? What properties should
it have to contrast it with a HLL?

I posted a page-long description of what I concieve a universal assembler
to be in a previous message in the thread. Perhaps it didn't get to your
news server? Google has it here:
http://groups.google.com/group/comp....8457481?hl=en&

The main properties that it would have, compared to a C (some other HLLs
do have some of these properties) are:

a) Rigidly defined functionality, without "optimizati on", except for
instruction scheduling, in support of VLIW or (some) superscaler cores.
(Different ways of expressing a particular algorithm, which perform more
or less efficiently on different architectures should be coded as such,
and selected at compile/configuration time, or built using
meta-programming techniques.) This is opposed to the HLL view which is
something like: express the algorithm in a sufficiently abstract way and
the compiler will figure out an efficient way to code it, perhaps. Yes,
compilers are really quite good at that, now, but that's not really the
point. This aspect is a bit like my suggestion in the linked post as
being something a bit like the Java spec, but without objects. Tao's
"Intent" VM is perhaps even closer. Not stack based. I would probably
still be happy if limited common-subexpression-elimination (factoring) was
allowed, to paper-over the array index vs pointer/cursor coding style vs
architecture differences.

If I can summarize this as:
-- the source code changes when the underlying processor architecture
changes
then I agree this is a key reason why i consider C a glorified
assembler.

b) Very little or no "language" level support for control structures or
calling conventions, but made-up-for with powerful compile-time
meta-programming facilities, and a standard "macro" library that provides
most of the expected facilities found in languages like C or Pascal. Much
of what are now thought of as compiler implementation features would wind
up in macro libraries. The advantage of this would be that code could be
written to *rely* on specific transformation performance and existence,
instead of just saying "hope that your compiler is clever enough to
recognize this idiom", in the documentation. It would also make possible
the sorts of small code factorizations that happen all the time in
assembly language, but which single-value-return, unnested function call
conventions in C make close to impossible. Or different coding styles,
like threaded interpreters, reasonable without language extensions.
Interesting features. I'm not sure how much different multiple value
returns would be from values returned via reference parameters
(pointers). it sounds like a good idea.
I imagine something like LLVM (http://llvm.cs.uiuc.edu/), but with a
powerful symbolic compile-time macro language on top (eg scheme...), an
algepraic (infix) operator syntax, and an expression parser.

In the mean time, "C", not as defined by the standard, but as implemented
in the half dozen or so compilers that I regularly use, is not so far from
what I want, to make me put in the effort to build my universal assembler
myself.

Cheers,

--
Andrew

Thanks for the contribution to the discussion.
Ed

Mar 21 '06 #265

Stephen Sprunk

"Al Balmer" <al******@att.n et> wrote in message
news:sb******** *************** *********@4ax.c om...

On Wed, 08 Mar 2006 08:14:37 +1100, Andrew Reilly
<an************ *@areilly.bpc-users.org> wrote:
On Tue, 07 Mar 2006 13:29:11 -0500, Eric Sosman wrote:
The Rationale says the Committee considered defining
the effects at both ends (bilateral dispensations?) , but rejected it for
efficiency reasons. Consider an array of large elements -- structs of
32KB size, say. A system that actually performed hardware checking of
pointer values could accommodate the one-past-the-end rule by allocating
just one extra byte after the end of the array, a byte that the special
pointer value could point at without setting off the hardware's alarms.
But one-before-the-beginning would require an extra 32KB, just to hold
data that could never be used ...

So the standards body broke decades of practice and perfectly safe and
reasonable code to support a *hypothetical* implementation that was so
stupid that it checked pointer values, rather than pointer *use*?
Amazing.

Decades of use? This isn't a new rule.

An implementation might choose, for valid reasons, to prefetch the
data that pointer is pointing to. If it's in a segment not allocated
...

If a system traps on a prefetch, it's fundamentally broken. However, a
system that traps when an invalid pointer is loaded is not broken, and the
AS/400 is the usual example. Annoying, but not broken.

Why IBM did it that way, I'm not sure, but my guess is they found it was
cheaper to do validity/permission checks when the address was loaded than
when it was used since the latter has a latency impact.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Mar 22 '06 #266

Andrew Reilly

On Tue, 21 Mar 2006 14:51:21 -0800, Ed Prochak wrote:

If I can summarize this as:
-- the source code changes when the underlying processor architecture
changes
then I agree this is a key reason why i consider C a glorified
assembler.

That's pretty close. I think that the link to Dan Bernstein's page on the
topic said it better than me: you can use different code and different
approaches where it matters to both the program and to the target
processor, but you can also use a simpler, generic approach that will just
work anywhere, when absolute maximum performance isn't necessary.

b) Very little or no "language" level support for control structures or
calling conventions, but made-up-for with powerful compile-time
meta-programming facilities, and a standard "macro" library that
provides most of the expected facilities found in languages like C or
Pascal. Much of what are now thought of as compiler implementation
features would wind up in macro libraries. The advantage of this would
be that code could be written to *rely* on specific transformation
performance and existence, instead of just saying "hope that your
compiler is clever enough to recognize this idiom", in the
documentation. It would also make possible the sorts of small code
factorizations that happen all the time in assembly language, but which
single-value-return, unnested function call conventions in C make close
to impossible. Or different coding styles, like threaded interpreters,
reasonable without language extensions.

Interesting features. I'm not sure how much different multiple value
returns would be from values returned via reference parameters
(pointers). it sounds like a good idea.

The significant difference is that reference parameters (pointers) can't
be in registers. (Not to mention the inefficiency of repeatedly pushing
the reference onto the call stack...)

Say you have a few to half a dozen peices of state in some algorithm, and
the algorithm operates through a pattern of "mutations" of that state,
such that some or all of the state changes as a result of each operation.
The only way to code that in C is either to write out the code that
comprises the element operations of each pattern long-hand, or use
preprocessor macros.

The most obvious concrete example of this sort of thing is the pattern
where you have one or more "cursors" into a data structure, and code that
walks through it, producing results at the same time. You want your
"codelets" to return both their result *and* change the cursor to point to
the next element in the list to be processed. In C, you can't have both
the result and the pointer in registers, but that's how you would code it
in assembly.

Cheers,

--
Andrew

Mar 22 '06 #267

Stephen Sprunk

"Rod Pemberton" <do*********@so rry.bitbucket.c mm> wrote in message
news:du******** ***@news3.infoa ve.net...

"Gerry Quinn" <ge****@DELETET HISindigo.ie> wrote in message
news:MP******** *************** @news1.eircom.n et...
In article <00************ *************** **@news.verizon .net>,
ra*********@FOO verizonBAR.net says...
> Does everything have to become a racism experiment?

Of course there are those who object to every figure in which the
adjective 'black' has negative connotations.

True. Mostly black people.

In my experience (which is more limited to recent years than many others'
here), it is typically do-gooder whites that are offended by words like
"black" or "oriental" or "Indian" (referring to the US domestic variety).

I recall an interview of Nelson Mandela by (I think) Dan Rather shortly
after the former's first election, and he was asked "How does it feel to be
the first African-American president of South Africa?" Mandela was
understandably confused, but the interviewer simply couldn't bring himself
to say the word "black". Mandela finally figured it out and answered, but
he had to come away from that thinking all Americans are complete dolts.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Mar 22 '06 #268

Andrew Reilly

On Tue, 21 Mar 2006 19:02:55 -0600, Stephen Sprunk wrote:

If a system traps on a prefetch, it's fundamentally broken. However, a
system that traps when an invalid pointer is loaded is not broken, and the
AS/400 is the usual example. Annoying, but not broken.

And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

--
Andrew

Mar 22 '06 #269

Keith Thompson

Andrew Reilly <an************ *@areilly.bpc-users.org> writes:

On Tue, 21 Mar 2006 19:02:55 -0600, Stephen Sprunk wrote:
If a system traps on a prefetch, it's fundamentally broken. However, a
system that traps when an invalid pointer is loaded is not broken, and the
AS/400 is the usual example. Annoying, but not broken.

And I still say that constraining C for everyone so that it could fit the
AS/400, rather than making C-on-AS/400 jump through a few more hoops to
match traditional C behaviour, was the wrong trade-off. I accept that
this may well be a minority view.

It is. The C standard wouldn't just have to forbid an implementation
from trapping when it loads an invalid address; it would have to
define the behavior of any program that uses such an address. A
number of examples have been posted here where that could cause
serious problems for some implementations other than the AS/400.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 22 '06 #270