Making Fatal Hidden Assumptions - Page 22

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

13052

Michael Jørgensen

"Paul Keinanen" <ke******@sci.f i> wrote in message
news:8i******** *************** *********@4ax.c om...

If s is stored in 16 bit mode in ES:DX with DX=0, then p=s-1 would
need to decrement ES by one and store 000F in DX.

I don't think this is what happens. Instead I think DX will wrap to FFFF,
and ES will remain unchanged.

-Michael.

Mar 16 '06 #211

CBFalconer

Andrew Reilly wrote:

On Thu, 16 Mar 2006 05:33:30 -0500, CBFalconer wrote:
My mechanism is
hardly well thought out, it is only intended to trigger thoughts
from others. As I said, blue skying.

Thanks for the prompting.

Thinking about this made me realize that actually accessing condition
codes isn't very useful, as an end in itself. Sure, there can be
some (like half-carry, perhaps) that might be useful, but if you really
needed that, I think that some sort of in-line intrinsic could carry it
off. It's not as though it's portable, anyway.

No, the point about condition codes is that they're a means to an end, and
C already covers most of those ends with other constructs. Most of the
basic condition codes are just there to tell you if some result is zero,
or which way a comparison went (whether you consider the arguments signed
or unsigned). Ordinary control (if, while, etc) statements handle those
nicely.

I was thinking more about after the fact overflow detection, with
statements such as:

if ((a = b + c, ccd = _ccd) & OVFLOBIT) {
/* handle overflow, a value is meaningless */
}
else {
/* all is joy in Mudville */
}

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 16 '06 #212

CBFalconer

RSoIsCaIrLiIoA wrote:

.... snip ...
is there someone out there that read this?
am i plonked from all NGs?

You temporarily got off my list due to some changes in newsreader
software. At the first sign of more ridiculous obfuscating macros
you go right back on it. Meanwhile I am not likely to pay much
attention to anything you say.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 16 '06 #213

CBFalconer

"Michael Jørgensen" wrote:

"Paul Keinanen" <ke******@sci.f i> wrote in message
If s is stored in 16 bit mode in ES:DX with DX=0, then p=s-1
would need to decrement ES by one and store 000F in DX.

I don't think this is what happens. Instead I think DX will wrap
to FFFF, and ES will remain unchanged.

More easily answered if the original statement had been preserved
in context. However in this case if the ES segment has range
limits, a segment fault will be triggered. Due to the happy
looseness of the term "undefined behavior" no specific action is
needed, which is what enables the loosing of nasal demons or
launching of missiles.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 16 '06 #214

Keith Thompson

"Alf P. Steinbach" <al***@start.no > writes:

* Richard Bos -> someone:
*Shrug* If you go about redefining terms to support your delusion
that C
is a portable assembler, that's your problem, not that of C.
Happily I haven't followed this debate, so I don't know what you
folks, good or not, think "portable assembler" means, nor how you can
think a language is a compiler (or whatever).

C was designed as a portable assembly language, the most successful so
far, so if the term has any practical meaning, then C is that meaning.

No, it wasn't. Quite possibly the term has no practical meaning, but
even if it does, C isn't it. C is at a lower level than most
high-level languages, but there's a *much* wider semantic gap between
assembly language and C than between C and other HLLs.
C was designed as a portable assembly language for the purpose of
porting Unix (which then had some seven or ten installations, I
forget, but one rumor has it there was only one, namely the PDP-7)
more easily. That didn't fully remove the need for using actual
assembly. But only some small parts still needed to be
machine-specific assembly.

You can find more information about the C language at e.g. Wikipedia,
<url: http://en.wikipedia.or g/wiki/C_programming_l anguage>.

I skimmed that article; it doesn't support your argument. The nearest
thing in the article is:

It is quite possible to write C code at a low level of abstraction
analogous to assembly language; in fact C is sometimes referred to
(and not always pejoratively) as "high-level assembly" or
"portable assembly".

Calling it assembly doesn't make it assembly.

Show us a definition of the term "assembly language". By
"definition ", I mean something that unambiguously allows you to
determine whether something is an assembly language or not. If you do
so, I predict that either (1) your definition won't apply to C, or (2)
your definition won't match any common understanding of the term
"assembly language".

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 16 '06 #215

Al Balmer

On Thu, 16 Mar 2006 12:28:19 +0100, "Alf P. Steinbach"
<al***@start.no > wrote:

* Richard Bos -> someone:

*Shrug* If you go about redefining terms to support your delusion that C
is a portable assembler, that's your problem, not that of C.
Happily I haven't followed this debate, so I don't know what you folks,
good or not, think "portable assembler" means, nor how you can think a
language is a compiler (or whatever).

C was designed as a portable assembly language, the most successful so
far, so if the term has any practical meaning, then C is that meaning.

The reference you give doesn't say that. It says "in fact C is
sometimes referred to (and not always pejoratively) as "high-level
assembly" or "portable assembly"." When they say "sometimes" , they're
talking about you, not the rest of the world. Remember that the moon
is sometimes said to be made of green cheese.
C was designed as a portable assembly language for the purpose of
porting Unix (which then had some seven or ten installations, I forget,
but one rumor has it there was only one, namely the PDP-7) more easily.
In fact, from your reference, "but this was onerous since all the code
was in assembly language. They decided to use a higher-level portable
language so the OS could be ported easily from one computer to
another."

Note the use of the phrase "higher-level portable language."
That didn't fully remove the need for using actual assembly. But only
some small parts still needed to be machine-specific assembly.

You can find more information about the C language at e.g. Wikipedia,
<url: http://en.wikipedia.or g/wiki/C_programming_l anguage>.

There are better sources, including the original authors.

--
Al Balmer
Sun City, AZ

Mar 16 '06 #216

Alf P. Steinbach

* Keith Thompson:

"Alf P. Steinbach" <al***@start.no > writes:
C was designed as a portable assembly language, the most successful so
far, so if the term has any practical meaning, then C is that meaning.

Show us a definition of the term "assembly language". By
"definition ", I mean something that unambiguously allows you to
determine whether something is an assembly language or not. If you do
so, I predict that either (1) your definition won't apply to C, or (2)
your definition won't match any common understanding of the term
"assembly language".

Sorry, that's bullshit.

However, show me a definition of bullshit (by "definition ", I mean
something that unambiguously allows you to determine whether something
is bullshit or not) and I'll readily retract that statement. ;-)

Cheers,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Mar 16 '06 #217

Al Balmer

On Thu, 16 Mar 2006 12:34:01 +0100, RSoIsCaIrLiIoA <zz@zz.z> wrote:

is there someone out there that read this?
am i plonked from all NGs?

According to my newsreader, your first post was at 4:34 am, and your
second at 5:30 am. What were you expecting? Do you have this group
confused with a chat room?

--
Al Balmer
Sun City, AZ

Mar 16 '06 #218

Alf P. Steinbach

* Al Balmer:

On Thu, 16 Mar 2006 12:28:19 +0100, "Alf P. Steinbach"
<al***@start.no > wrote:
* Richard Bos -> someone:
*Shrug* If you go about redefining terms to support your delusion that C
is a portable assembler, that's your problem, not that of C.

Happily I haven't followed this debate, so I don't know what you folks,
good or not, think "portable assembler" means, nor how you can think a
language is a compiler (or whatever).

C was designed as a portable assembly language, the most successful so
far, so if the term has any practical meaning, then C is that meaning.

The reference you give doesn't say that.

I'm sorry for not being clear:

Wikipedia is not a reference to back up some statement, and in fact it's
not a reference of any sort (but let's not go off discussing /that/
again, just trust me: it isn't).

I gave the link as an opportunity for Richard, and now you, to learn
some very basic things, and that's why I wrote "e.g.", which means, "for
example". Also see the links provided elsethread.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Mar 16 '06 #219

RSoIsCaIrLiIoA

On Thu, 16 Mar 2006 10:04 -0700, Al Balmer <al******@att.n et> wrote:

On Thu, 16 Mar 2006 12:34:01 +0100, RSoIsCaIrLiIoA <zz@zz.z> wrote:

is there someone out there that read this?
am i plonked from all NGs?

According to my newsreader, your first post was at 4:34 am, and your
second at 5:30 am. What were you expecting? Do you have this group
confused with a chat room?

pheraps in your time, my second post seems to me 12:34am, here in
Italy; i think you are in US. near Boston?

Mar 16 '06 #220