473,842 Members | 1,876 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making Fatal Hidden Assumptions

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06
351 13191
On 2006-03-12, Allan Herriman <al***********@ hotmail.com> wrote:
On 12 Mar 2006 12:43:56 +1000, Mark L Pappin <ml*@acm.org> wrote:
Jordan Abel <ra*******@gmai l.com> writes:
On 2006-03-11, Mark L Pappin <ml*@acm.org> wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
> On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:
>> I spent 25 years writing assembler.

> Yeah, me to. Still do, regularly, on processors that will never
> have a C compiler.

It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?

(A few I can recall having been proposed are: tiny amounts of storage,
Harvard architecture, and lack of programmer-accessible stack.
Funnily enough, these are characteristics possessed by chips for which
I compile C code every day.)

"tiny amounts of storage" may preclude a conforming hosted
implementation [which must support an object of 65535 bytes, and, of
course, a decent-sized library]
Got me there. Freestanding only, with maybe 16 bytes of RAM and 256
words of ROM - no 'malloc()' here, and 'printf()' can be problematic.

A freestanding C implementation does, however, still have a C
compiler. I'm curious which processors Andrew Reilly claims "will
never have a C compiler", and why he makes that claim.


I assume Dr. Reilly's referring to various DSP devices. These often
have features such as saturating arithmetic


Overflow's undefined, so what's wrong here?
and bit-reversed addressing.
I don't know what that is, so I have no idea how it would affect the
ability for there to be a conforming implementation

Regards,
Allan

Mar 12 '06 #141

Mark L Pappin wrote:
A freestanding C implementation does, however, still have a C
compiler. I'm curious which processors Andrew Reilly claims "will
never have a C compiler", and why he makes that claim.


I don't think anyone will ever write a C Compiler for the
Atmel MARC4 [ http://www.atmel.com/products/MARC4/ ].

--
Guy Macon
<http://www.guymacon.co m/>

Mar 12 '06 #142
On Fri, 10 Mar 2006 11:23:37 +0200, Paul Keinanen wrote:
On 9 Mar 2006 12:34:20 -0800, "Ed Prochak" <ed*******@gmai l.com>
wrote:
Andrew Reilly wrote:

Yeah, me to. Still do, regularly, on processors that will never have a C
compiler. C is as close to a universal assembler as we've got at the
moment. It doesn't stick it's neck out too far, although a more
deliberately designed universal assembler would be a really good thing.
(It's on my list of things to do...)


Would a better universal assembler be more like assembler or more like
high level languages? I really think C hit very close to the optimal
balance.


I don't see much room for a "universal" assembler between C and a
traditional assembler, since the instruction sets can vary quite a
lot.


I think that a useful "universal assembler" would be something that had
the basic set of operators and types, all of which were well defined for a
particular machine model (flat data memory map, 2's compliment arithmetic,
etc.) It could have expressions, as long as the operator precedence was
rigorous enough so that you could absolutely know what the order of
evaluation would be, at coding time.

The two or three most painful things about assembly language programming
are register allocation and making up control-flow symbol names (in
assemblers that don't already have nice structured control flow
macros/pseudo-ops. Both of these can be included in a "universal
assembler", if you forgo some pure control for convenience: conventional
control structures, subroutine calls that follow common conventions. The
machine instruction sets of Java's JVM and C#'s CLR (?) avoid the register
name issue by being stack-based (and muck up the memory model by being
object-centric). Tao's VM is more nearly a plain 32-bit RISC model, but
with an infinite number of registers, which are managed by the "assembler" .
(The third painful thing is instruction scheduling, in super-scalar or
VLIW machines of various sorts. That would probably want to be subsumed
by the language "compiler" too.)

A data model, a set of operators, control flow, a syntax for building
abstractions and domain-specific sub-languages. That could almost be C
right there, except that there are too many holes in the data model and
operator function, both to support old/strange hardware, and to
allow/support compiler optimization transformations . Java has tightened
up the model, but it's not a model of a "bare processor", it's a model of
an "object machine". I'd like the same kind of low-level language
definition, but with objects only built using the language's
meta-programming/macro features, rather than being the only way to do
things.

Just dreaming...

Cheers,

--
Andrew

Mar 12 '06 #143
On Sun, 12 Mar 2006 00:34:53 +1000, Mark L Pappin wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Wed, 08 Mar 2006 18:07:45 -0700, Al Balmer wrote:
I spent 25 years writing assembler.
Yeah, me to. Still do, regularly, on processors that will never
have a C compiler.


It's a little OT in c.l.c, but would you mind telling us just what
processors those are, that you can make such a guarantee? What
characteristics do they have that means they'll never have a C
compiler?


I'm thinking mainly of deeply embedded DSP processors, like those of the
TI TAS3000 family, or Analog Devices Sigma DSPs, or any of several
similar-scale engines from several Japanese manufacturers.

Small memory, sure. Strange word lengths (not really that much of a
problem for C, admittedly). Some of these things don't have pointers in
the usual sense, let alone subroutine call stacks. Their arithmetic
usually doesn't match C's (integer only, usually with saturation on
overflow, freqently with different word lengths for data, coefficient and
result.
(A few I can recall having been proposed are: tiny amounts of storage,
Harvard architecture, and lack of programmer-accessible stack. Funnily
enough, these are characteristics possessed by chips for which I compile
C code every day.)


Apart from the reasons that I mentioned, the biggest one is simply utility
and man-power. No-one is building C compilers for these things because
no-one could or would use one if it existed: the hardware is tuned to do a
particular class of (fairly simple) thing, and that's easy enough to code
up in assembler. Easier than figuring out how to write a C compiler for
it, anyway.

Cheers,

--
Andrew

Mar 12 '06 #144
Jordan Abel wrote:
and, of course, a decent-sized library]


Off topic? Yes. But, I it bothers me when we confuse the
language with the supporting libraries.
--
Michael N. Moran (h) 770 516 7918
5009 Old Field Ct. (c) 678 521 5460
Kennesaw, GA, USA 30144 http://mnmoran.org

"So often times it happens, that we live our lives in chains
and we never even know we have the key."
The Eagles, "Already Gone"

The Beatles were wrong: 1 & 1 & 1 is 1

Mar 12 '06 #145

Dik T. Winter wrote:
In article <44************ ***@yahoo.com> cb********@main eline.net writes:
...
> The reason to use a subtractor is that that guarantees than -0
> never appears in the results. This allows using that value for
> such things as traps, uninitialized, etc.


This was however not done on any of the 1's complement machines I have
worked with. The +0 preferent machines (CDC) just did not generate it
in general. ..


The CDC 6400 and 6600 used "complement recomplement arithmetic."
The only way to get -0 as the result of integer arithmetic was to start
with
-0, i.e. (-0)+(-0) = -0 and (-0)-(+0) = -0.

The normal way to copy a B-register was to write, e.g. SB6, B5
(IIRC) which generated the same machine opcode as SB6, B5+B0.
(B0 was an always-zero register.) Hence SB6, B5 was not guaranteed
to copy B5 exactly! Instead SB6, B5-B0 should be coded.

Since there were fast tests for negative and zero, using both +0 and -0
as flags for testing was a micro-optimization sometimes useful for
speed.

James Dow Allen

Mar 12 '06 #146
On 2006-03-09, James Dow Allen <jd*********@ya hoo.com> wrote:
David Holland wrote:
On 2006-03-07, James Dow Allen <jd*********@ya hoo.com> wrote:
[...] but I'm sincerely curious whether anyone knows of an *actual*
environment where p == s will ever be false after (p = s-1; p++).


The problem is that evaluating s-1 might cause an underflow and a
trap, and then you won't even reach the comparison. You don't
necessarily have to dereference an invalid pointer to get a trap.

You might hit this behavior on any segmented architecture (e.g.,
80286, or 80386+ with segments on) ...


I'm certainly no x86 expert. Can you show or point to the output
of any C compiler which causes an "underflow trap" in this case?


Have you tried bounds-checking gcc?

I don't think I've ever myself seen a compiler that targeted 286
protected mode. Maybe some of the early DOS-extender compilers did,
before everyone switched to 386+. If you can find one and set it to
generate code for some kind of "huge" memory model (supporting
individual objects more than 64K in size) I'd expect it to trap if you
picked a suitable location for `s' to point to.

That assumes you can find a 286 to run it on, too.

Otherwise, I don't know of any, but I'm hardly an expert on strange
platforms.

(Note: Coherent was a 286 protected mode platform, but it only
supported the "small" memory model... and it had a K&R-only compiler,
so it's not a viable example.)

--
- David A. Holland
(the above address works if unscrambled but isn't checked often)
Mar 13 '06 #147
"Michael N. Moran" <mi**@mnmoran.o rg> wrote:
Jordan Abel wrote:
and, of course, a decent-sized library]


Off topic? Yes. But, I it bothers me when we confuse the
language with the supporting libraries.


As long as we're talking about C, they are part of the same Standard.
You can get a freestanding implementation which is allowed not to
implement much of the Standard, but that doesn't make those parts any
less C.

Richard
Mar 13 '06 #148
Jordan Abel <ra*******@gmai l.com> wrote:
On 2006-03-12, Allan Herriman <al***********@ hotmail.com> wrote:
On 12 Mar 2006 12:43:56 +1000, Mark L Pappin <ml*@acm.org> wrote:
A freestanding C implementation does, however, still have a C
compiler. I'm curious which processors Andrew Reilly claims "will
never have a C compiler", and why he makes that claim.


I assume Dr. Reilly's referring to various DSP devices. These often
have features such as saturating arithmetic


Overflow's undefined, so what's wrong here?


If the saturation also occurs for unsigned integers, you're going to
have a pain of a time implementing C's wraparound-on-unsigned-overflow
behaviour.

Richard
Mar 13 '06 #149
On 12 Mar 2006 07:14:52 GMT, Jordan Abel <ra*******@gmai l.com> wrote:
On 2006-03-12, Allan Herriman <al***********@ hotmail.com> wrote:

[snippage]
I assume Dr. Reilly's referring to various DSP devices. These often
have features such as saturating arithmetic


Overflow's undefined, so what's wrong here?
and bit-reversed addressing.


I don't know what that is, so I have no idea how it would affect the
ability for there to be a conforming implementation


http://www.google.com.au/search?q=bi...sed+addressing
(First hit)

Bit reversed addressing is just another addressing mode that's simple
to access from assembly language. Do you think that this could be
generated by a C compiler?

Bit reversed addressing is used in the calculation of an FFT (and
almost nowhere else).

Regards,
Allan
Mar 13 '06 #150

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.