473,842 Members | 1,863 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making Fatal Hidden Assumptions

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06
351 13183
Mark L Pappin wrote:
A freestanding C implementation does, however, still have a C
compiler. I'm curious which processors Andrew Reilly claims "will
never have a C compiler", and why he makes that claim.


Motorola 4500- though some bugger will probably write one just to prove
me wrong.

Paul Burke
Mar 13 '06 #151
On Mon, 13 Mar 2006 20:41:37 +1100, Allan Herriman wrote:
On 12 Mar 2006 07:14:52 GMT, Jordan Abel <ra*******@gmai l.com> wrote:
On 2006-03-12, Allan Herriman <al***********@ hotmail.com> wrote:

[snippage]
I assume Dr. Reilly's referring to various DSP devices. These often
have features such as saturating arithmetic


Overflow's undefined, so what's wrong here?
and bit-reversed addressing.


I don't know what that is, so I have no idea how it would affect the
ability for there to be a conforming implementation


http://www.google.com.au/search?q=bi...sed+addressing
(First hit)

Bit reversed addressing is just another addressing mode that's simple
to access from assembly language. Do you think that this could be
generated by a C compiler?

Bit reversed addressing is used in the calculation of an FFT (and
almost nowhere else).


To be fair, though, I suspect that bit reversed addressing is a bit
over-rated, within the DSP community. If I were designing a new DSP
processor, I'd be very tempted to leave it out, unless the instruction
space, die space and cycle-time impact it introduced were
completely negligible. In many ordinary processors you can perform a
bit-reverse re-ordering in about 10% of the cost of performing the FFT
itself with ordinary instructions (see below), recursive counting code, or
a lookup table (for smallish FFT sizes, probably less for larger). FFTW
manages to hold many performance benchmark crowns while producing in-order
results on conventional processors. Besides which, not all FFT algorithms
produce results in bit-reverse order anyway.

As an example of the sort of algorithmic weirdness that is sometimes put
into hardware, for which there isn't a good, let alone convenient way to
express in C, it's pretty good.

For the non-DSP-inclined, here's a simple expression of a bit-reverse
increment operation, in C:

unsigned int
bitrev_inc(unsi gned int i, unsigned int N)
{
return (N & i) ? bitrev_inc(i ^ N, N >> 1) : i ^ N;
}

That one needs to be called with N = bins/2 where bins is the size of the
FFT, a power of 2. It's not especially efficient, but it is at least a
single pure function, and GCC does a good job on the tail recursion. An
iteration over an array could use i = bitrev_inc(i, bins/2) as an index
increment operation. It could be coded iteratively as a loop around
while(N & i), but that seems to be even more of a stretch for a compiler
to recognize as simply the invocation of an addressing mode.

Cheers,

--
Andrew

Mar 13 '06 #152
In article <11************ **********@z34g 2000cwc.googleg roups.com> "Ed Prochak" <ed*******@gmai l.com> writes:
....
I would agree that if an assembler must be a one-to-one mapping from
source line to opcode, then C doesn't fit. I just don't agree with that
definition of assembler.


On the other hand for every machine instruction there should be an
construct in the assembler to get that instruction. With that in
mind C doesn't fit either.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 13 '06 #153
[ F'ups set to c.l.c. - please reset if other groups are interested too. ]

In comp.lang.c Keith Thompson <ks***@mib.or g> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Thu, 09 Mar 2006 00:00:34 +0000, Christian Bau wrote:
Question: If the C Standard guarantees that for any array a, &a [-1]
should be valid, should it also guarantee that &a [-1] != NULL
Probably, since NULL has been given the guarantee that it's unique in some
sense. In an embedded environment, or assembly language, the construct

....
How exactly do you get from NULL (more precisely, a null pointer
value) being "unique in some sense" to a guarantee that &a[-1], which
doesn't point to any object, is unequal to NULL?

The standard guarantees that a null pointer "is guaranteed to compare
unequal to a pointer to any object or function". &a[-1] is not a
pointer to any object or function, so the standard doesn't guarantee
that &a[-1] != NULL.

....

By the same logic, ptr to just past the end of an array (which, of course,
does not point to an object), _can_ compare equal to null ptr? ;-)
- Of course not, semantics for "==" guarantee that both ptrs must be
null (among others) to compare equal.

AFAIK the one-past-the-end ptrs were introduced to enable certain
idioms, and as such are not a necessity in the language:
while (*dest++ = *src++) ; /* strcpy-like */
IMHO the language could also use the concept of one-before-the-start
pointers for similar purposes, but it doesn't for practical (and good)
reasons. Here's what C99 Rationale says (6.5.6):

: In the case of p-1, on the other hand, an entire object would have
: to be allocated prior to the array of objects that p traverses,
: so decrement loops that run off the bottom of an array can fail.
: This restriction allows segmented architectures, for instance, to
: place objects at the start of a range of addressable memory.

--
Stan Tobias
mailx `echo si***@FamOuS.Be dBuG.pAlS.INVALID | sed s/[[:upper:]]//g`
Mar 13 '06 #154
In article <sl************ **********@rand om.yi.org>, ra*******@gmail .com
says...
On 2006-03-11, CBFalconer <cb********@yah oo.com> wrote:
Jordan Abel wrote:
On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
> On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:

>> I spent 25 years writing assembler.

> Yeah, me to. Still do, regularly, on processors that will never
> have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of
storage, Harvard architecture, and lack of programmer-accessible
stack. Funnily enough, these are characteristics possessed by
chips for which I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and,
of course, a decent-sized library]


These machines may well have C compilers, just not conforming
ones. The areas of non-conformance are likely to be:

object size available
floating point arithmetic
recursion depth (1 meaning no recursion)

availability of long and long-long.

Or even the range of int itself. [People have claimed that "c"
implementations exist with 8-bit int]


I don't fully understand what you are saying here.
Please elaborate.
At first I thought maybe you meant that you can find no documented case
of an implementation that has an 8 bit int, then I thought, well MAYBE,
he means that it aint 'C' if it has an 8 bit int........

Jim
Mar 13 '06 #155
In article <Iw********@cwi .nl>, Dik T. Winter <Di********@cwi .nl> wrote:
In article <11************ **********@z34g 2000cwc.googleg roups.com> "Ed Prochak" <ed*******@gmai l.com> writes:
I would agree that if an assembler must be a one-to-one mapping from
source line to opcode, then C doesn't fit. I just don't agree with that
definition of assembler.

On the other hand for every machine instruction there should be an
construct in the assembler to get that instruction. With that in
mind C doesn't fit either.


Over the years, there have been notable cases of "hidden" machine
instructions -- undocumented instructions, quite possibly with no
assembler construct (at least not in any publically available
assembler.)
--
Programming is what happens while you're busy making other plans.
Mar 13 '06 #156

Michael Wojcik wrote:
In article <11************ **********@z34g 2000cwc.googleg roups.com>, "Ed Prochak" <ed*******@gmai l.com> writes:
Keith Thompson wrote:
Most assembly languages are, of
course, machine-specific, since they directly specify the actual
instructions.


x++; in most machines can map to a single instruction


Sure, if "most machines" excludes load/store architectures, and
machines which cannot operate directly on an object of the size of
whatever x happens to be, and all the cases where "x" is a pointer to
an object of a size other than the machine's addressing granularity...

I suppose you could argue that "can" in your claim is intended to be
weak - that, for "most machines" (with a conforming C implementation,
presumably), there exists at least one C program containing the
statement "x++;", and a conforming C implementation which will
translate that statement to a single machine instruction.

But that's a very small claim. All machines "can" map that statement
to multiple instructions as well; many "can" map it to zero
instructions in that sense (taking advantage of auto-increment modes
or the like). What can happen says very little about what will.

The presence in C of syntactic sugar for certain simple operations
like "x++" doesn't support the claim that C is somehow akin to
assembler in any case. One distinguishing feature of assembler is
a *lack* of syntactic sugar. (Macros aren't a counterexample
because they're purely lexical constructs; in principle they're
completely separate from code generation.)

C isn't assembler because:

- It doesn't impose a strict mapping between (preprocessed) source
and generated code. The "as if" clause allows the implementation
to have the generated code differ significantly from a strict
interpretation of the source acting on the virtual machine.

- It has generalized constructs (expressions) which can result in
the implementation generating arbitrarily complex code.

--
Michael Wojcik mi************@ microfocus.com

Any average educated person can turn out competent verse. -- W. H. Auden


C is an assembler because

-- It doesn't impose strict data type checking, especially between
integers and pointers.
(While there has been some discussion about cases where conversions
back and forth between them can fail, for most machines it works. Good
thing too or some OS's would be written in some other language.)

-- datatype sizes are dependent on the underlying hardware. While a lot
of modern hardware has formed around the common 8bit char, and
multiples of 16 for int types (and recent C standards have started to
impose these standards), C still supports machines that used 9bit char
and 18bit and 36bit integers. This was the most frustrating thing for
me when I first learned C. It forces precisely some of the hidden
assumptions of this topic.

-- C allows for easy "compilatio n" in that you could do it in one pass
of the source code (well two counting the preprocessor run). The
original C compiler was written in C so that bootstrapping onto a new
machine required only a simple easily written initial compiler to
compile the real compiler.

-- original versions of the C compiler did not have passes like
data-flow optimizers. So optimization was left to the programmer. Hence
things like x++ and register storage became part of the language.
Perhaps they are not needed now, but dropping these features from the
language will nearly make it a differrent language. I do not know of
any other HLL that has register, but about every assembler allows
access to the registers under programmer control.

So IMHO, C is a nice generic assembler. It fits nicely in the narrow
world between hardware and applications. The fact that it is a decent
application development language is a bonus. I like C, I use it often.
Just realize it is a HLL with an assembler side too.

Ed

Mar 13 '06 #157
"S.Tobias" <si***@FamOuS.B edBuG.pAlS.INVA LID> writes:
[ F'ups set to c.l.c. - please reset if other groups are interested too. ] [ Followup obeyed. ] In comp.lang.c Keith Thompson <ks***@mib.or g> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Thu, 09 Mar 2006 00:00:34 +0000, Christian Bau wrote:
Question: If the C Standard guarantees that for any array a, &a [-1]
should be valid, should it also guarantee that &a [-1] != NULL

Probably, since NULL has been given the guarantee that it's unique in some
sense. In an embedded environment, or assembly language, the construct

...

How exactly do you get from NULL (more precisely, a null pointer
value) being "unique in some sense" to a guarantee that &a[-1], which
doesn't point to any object, is unequal to NULL?

The standard guarantees that a null pointer "is guaranteed to compare
unequal to a pointer to any object or function". &a[-1] is not a
pointer to any object or function, so the standard doesn't guarantee
that &a[-1] != NULL.

...

By the same logic, ptr to just past the end of an array (which, of course,
does not point to an object), _can_ compare equal to null ptr? ;-)
- Of course not, semantics for "==" guarantee that both ptrs must be
null (among others) to compare equal.


Interesting. I think you may have found a small flaw in the standard.

I'm reasonably sure that it's *intended* that a one-past-the-end
pointer is unequal to NULL, but I don't think the standard quite
manages to say so.

Consider an implementation in which pointers look like unsigned 32-bit
integers, and a null pointer is represented as 0xffffffff. Suppose a
15-byte object (char foo[15]) happens to be allocated at 0xfffffff0.
Then a pointer just past the end of foo (which is not a pointer to any
object) would have the value 0xffffffff, and would happen to be equal
to NULL.

I don't see anything in this scenario that violates the standard.
Probably the definition of "null pointer" should be tightened up a
little, so that a null pointer compares unequal to a pointer to any
object or function, or just past the end of any object.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 13 '06 #158
"Ed Prochak" <ed*******@gmai l.com> writes:
[...]
C is an assembler because

-- It doesn't impose strict data type checking, especially between
integers and pointers.
(While there has been some discussion about cases where conversions
back and forth between them can fail, for most machines it works. Good
thing too or some OS's would be written in some other language.)
Incorrect. Attempting to assign an integer value to a pointer object,
or vice versa, is a constraint violation, requiring a diagnostic.
Integer and pointers can be converted back and forth only using a cast
(an explicit conversion operator). The result of such a conversion is
implementation-defined.

Even if this were correct, it certainly wouldn't make C an assembler.
-- datatype sizes are dependent on the underlying hardware. While a lot
of modern hardware has formed around the common 8bit char, and
multiples of 16 for int types (and recent C standards have started to
impose these standards), C still supports machines that used 9bit char
and 18bit and 36bit integers. This was the most frustrating thing for
me when I first learned C. It forces precisely some of the hidden
assumptions of this topic.
I don't know what "recent C standards" you're referring to. C
requires CHAR_BIT to be at least 8; it can be larger. short and int
must be at least 16 bits, and long must be at least 32 bits. A
conforming implementation, even with the most current standard, could
have 9-bit char, 18-bit short, 36-bit int, and 72-bit long.

But this is a common feature of many high-level languages. Ada, for
example has an implementation-defined set of integer types, similar to
what C provides; I've never heard anyone claim that Ada is an
assembler.
-- C allows for easy "compilatio n" in that you could do it in one pass
of the source code (well two counting the preprocessor run). The
original C compiler was written in C so that bootstrapping onto a new
machine required only a simple easily written initial compiler to
compile the real compiler.
You're talking about an implementation, not the language.
-- original versions of the C compiler did not have passes like
data-flow optimizers. So optimization was left to the programmer. Hence
things like x++ and register storage became part of the language.
Perhaps they are not needed now, but dropping these features from the
language will nearly make it a differrent language. I do not know of
any other HLL that has register, but about every assembler allows
access to the registers under programmer control.
Again, you're talking about an implementation, not the language.

C doesn't allow access to specific registers (at least not portably).

Here's what the C standard says about the "register" specifier:

A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.

And there are a few restrictions; for example, you can't take the
address of a register-qualified object.
So IMHO, C is a nice generic assembler. It fits nicely in the narrow
world between hardware and applications. The fact that it is a decent
application development language is a bonus. I like C, I use it often.
Just realize it is a HLL with an assembler side too.


You've given a few examples that purport to demonstrate that C is an
assembler.

Try giving a definition of the word "assembler" . If the definition
applies to C (the language, not any particular implementation) , I'd
say it's likely to be a poor definition, but I'm willing to be
surprised.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 13 '06 #159
On Mon, 13 Mar 2006 15:31:35 +0000, Dik T. Winter wrote:
In article <11************ **********@z34g 2000cwc.googleg roups.com> "Ed Prochak" <ed*******@gmai l.com> writes:
...
> I would agree that if an assembler must be a one-to-one mapping from
> source line to opcode, then C doesn't fit. I just don't agree with that
> definition of assembler.


On the other hand for every machine instruction there should be an
construct in the assembler to get that instruction. With that in
mind C doesn't fit either.


Well, since we're talking about a "universal assembler" (a reasonably
commonly used term), that's obviously something different from the usual
machine-specific assembler, which does indeed usually have that property.
(Although I've met assemblers where the only way to get certain
instructions was to insert the data for the op-code in-line. Instruction
coverage sometimes lags behind features added to actual implementations .)

--
Andrew

Mar 13 '06 #160

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.