Making Fatal Hidden Assumptions - Page 20

CBFalconer

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06

Subscribe Reply

351

13102

Mark F. Haigh

Keith Thompson wrote:

"Mark F. Haigh" <mf*****@sbcglo bal.net> writes:
Keith Thompson wrote:
"Ed Prochak" <ed*******@gmai l.com> writes:
> Keith Thompson wrote:
[...] <snip>
There's a continuum from raw machine language to very high-level
languages. Macro assembler is only a very small step up from
non-macro assembler. C is a *much* bigger step up from that. Some C
constructs may happen to map to single instructions for *some*
compiler/CPU combinations; they might map to multiple instructions, or
even none, for others. An assignment statement might copy a single
scalar value (integer, floating-point, or pointer) -- or it might copy
an entire structure; the C code looks the same, but the machine code
is radically different.

Using entirely arbitrary units of high-level-ness, I'd call machine
language close to 0, assembly language 10, macro assembler 15, and C
about 50. It might be useful to have something around 35 or so.
(This is, of course, mostly meaningless.)

That's really interesting, because I have pondered the question and
answered it in nearly the same way. However, the conclusion I came to
is that C occupies a relatively large range in the continuum, not a
single point.

C dialects that accept inline assembly push the lower bound much lower
than it would otherwise be. Likewise, the wealth of freely-available C
libraries push the upper bound much further up. You can run a
real-world system written in C all the way from the top to the bottom--
from the GUI (GNOME, for example), to the kernel; from the compiler to
the shells and interpreters.

Standard C doesn't accept inline assembly. If some particular
compiler does so, it's an extension -- and you might as well think of
it as an assembler that also accepts C syntax.

Thank you, Captain Pedantic. Is that response to me or the masses of
corruptable youths eagerly waiting to sprinkle their code with inline
assembly?

If the former, then my point was that in the realm of C dialects,
Standard C (ideally) occupies a slowly expanding middle ground. If the
latter, then kids, listen to Keith, he's right.
As far as libraries are concerned, most of them aren't part of
standard C either -- and they needn't be either implemented in, or
called from, C.

Again, true, but oblique to the point that it is possible to run a
fully functional _modern_ system using only code written in C dialects.
I don't believe any other language "family" can boast this (for
currently available hardware).
Mark F. Haigh
mf*****@sbcglob al.net

Mar 15 '06 #191

Keith Thompson

"Mark F. Haigh" <mf*****@sbcglo bal.net> writes:

Keith Thompson wrote:
"Mark F. Haigh" <mf*****@sbcglo bal.net> writes: [...]
> C dialects that accept inline assembly push the lower bound much lower
> than it would otherwise be. Likewise, the wealth of freely-available C
> libraries push the upper bound much further up. You can run a
> real-world system written in C all the way from the top to the bottom--
> from the GUI (GNOME, for example), to the kernel; from the compiler to
> the shells and interpreters.
Standard C doesn't accept inline assembly. If some particular
compiler does so, it's an extension -- and you might as well think
of it as an assembler that also accepts C syntax.

Thank you, Captain Pedantic. Is that response to me or the masses of
corruptable youths eagerly waiting to sprinkle their code with inline
assembly?

Was that intended as an insult?

This discussion is cross-posted to three different newsgroups (and
it's been interesting in spite of that). I'm reading and writing this
in comp.lang.c, where we discuss the C programming language as defined
by the standards. The standard specifically allows extensions, and we
sometimes discuss the permitted nature of those extensions, but
details about particular extensions (such as inline assembly) are
considered off-topic.

My comments were intended for anyone who cares to read them.
If the former, then my point was that in the realm of C dialects,
Standard C (ideally) occupies a slowly expanding middle ground. If the
latter, then kids, listen to Keith, he's right.
As far as libraries are concerned, most of them aren't part of
standard C either -- and they needn't be either implemented in, or
called from, C.

Again, true, but oblique to the point that it is possible to run a
fully functional _modern_ system using only code written in C dialects.
I don't believe any other language "family" can boast this (for
currently available hardware).

I have no argument with that point. I was merely making a different
point.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 15 '06 #192

S.Tobias

Keith Thompson <ks***@mib.or g> wrote:

"S.Tobias" <si***@FamOuS.B edBuG.pAlS.INVA LID> writes:

By the same logic, ptr to just past the end of an array (which, of course,
does not point to an object), _can_ compare equal to null ptr? ;-)
- Of course not, semantics for "==" guarantee that both ptrs must be
null (among others) to compare equal.

Seems my second "of course" in not quite so. I wrote that because
I had always imagined that after an array there *is* something (albeit
possibly inaccessible), that past-the-end pointer points to.
Nice pictures are sometimes dangerous.

Interesting. I think you may have found a small flaw in the standard.

I intended my first sentence as a joke, actually. I could only take the
credit for finding anything if saying something stupid could count as
a discovery, too.

Since you've started a new discussion in c.s.c, I left some of my
thoughts there, too.

--
Stan Tobias
mailx `echo si***@FamOuS.Be dBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

Mar 16 '06 #193

Mark F. Haigh

Keith Thompson wrote:

"Mark F. Haigh" <mf*****@sbcglo bal.net> writes:
Keith Thompson wrote:
"Mark F. Haigh" <mf*****@sbcglo bal.net> writes: [...] > C dialects that accept inline assembly push the lower bound much lower
> than it would otherwise be. Likewise, the wealth of freely-available C
> libraries push the upper bound much further up. You can run a
> real-world system written in C all the way from the top to the bottom--
> from the GUI (GNOME, for example), to the kernel; from the compiler to
> the shells and interpreters.

Standard C doesn't accept inline assembly. If some particular
compiler does so, it's an extension -- and you might as well think
of it as an assembler that also accepts C syntax.
Thank you, Captain Pedantic. Is that response to me or the masses of
corruptable youths eagerly waiting to sprinkle their code with inline
assembly?

Was that intended as an insult?

Heh, it's just tongue-in-cheek; a little jab. All in good fun.
This discussion is cross-posted to three different newsgroups (and
it's been interesting in spite of that). I'm reading and writing this
in comp.lang.c, where we discuss the C programming language as defined
by the standards. The standard specifically allows extensions, and we
sometimes discuss the permitted nature of those extensions, but
details about particular extensions (such as inline assembly) are
considered off-topic.

True, but the topic was really a meta-discussion about where C dialects
fit in to the continuum of languages, and where Standard C fits into
the continuum of C dialects. It's topical enough for me, and I was
hoping to spur some genuine discussion.

<snip>

Mark F. Haigh
mf*****@sbcglob al.net

Mar 16 '06 #194

CBFalconer

"Dik T. Winter" wrote:

"Ed Prochak" <ed*******@gmai l.com> writes:
Dik T. Winter wrote:

...
Indeed. But even when we look at the published instructions C
falls short of providing a construct for every one. Where is
the C construct to do a multply step available in quite a few
early RISC machines? Note also that in assembler you can access
the special bits indicating overflow and whatever (if they are
available on the machine). How to do that in C?

Well you cannot, but those processors did not even exist when C
was created. So those features didn't make it. To some degree,
C is more of a PDP assembler.

How do you get access to the condition bits?

With the usual gay abandon about extensions, you might define a
variable in system space, say _ccd, to hold those bits. You
specify the conditions under which it is valid, such as immediately
after an expression with precisely two operands, preserved by use
of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things. A similar
thing such as _high could allow capturing all bits from a
multiplication. i.e.:

a = b * c, ccd = _ccd, ov = _high;

tells you all about the operation without data loss.

Just blue skying here.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 16 '06 #195

Chris Torek

>"Dik T. Winter" wrote:

How do you get access [from C] to the [hardware's] condition bits?

In article <44************ ***@yahoo.com>
CBFalconer <cb********@mai neline.net> wrote:With the usual gay abandon about extensions, you might define a
variable in system space, say _ccd, to hold those bits. You
specify the conditions under which it is valid, such as immediately
after an expression with precisely two operands, preserved by use
of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things.

This turns out not to work very well in Real Compilers. The reasons
are outlined rather nicely in the GCC documentation:

It is a natural idea to look for a way to give access to the condition
code left by the assembler instruction. However, when we attempted to
implement this, we found no way to make it work reliably. The problem
is that output operands might need reloading, which would result in
additional following "store" instructions. On most machines, these
instructions would alter the condition code before there was time to
test it. This problem doesn't arise for ordinary "test" and "compare"
instructions because they don't have any output operands.

For reasons similar to those described above, it is not possible to
give an assembler instruction access to the condition code left by
previous instructions.

(from <http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Extended-Asm.html>).

(You *can* actually capture the condition codes, by doing the
operation itself *and* the condition-code-capture all in one
single inline __asm__, so that no reloading occurs between the
two parts. Getting it right is still fairly tricky. In many
cases you are better off just writing the desired routines in
assembly, and calling them as ordinary functions from the C code.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Mar 16 '06 #196

Paul Keinanen

On 16 Mar 2006 06:41:40 GMT, Chris Torek <no****@torek.n et> wrote:

"Dik T. Winter" wrote:
How do you get access [from C] to the [hardware's] condition bits?

In article <44************ ***@yahoo.com>
CBFalconer <cb********@mai neline.net> wrote:
With the usual gay abandon about extensions, you might define a
variable in system space, say _ccd, to hold those bits. You
specify the conditions under which it is valid, such as immediately
after an expression with precisely two operands, preserved by use
of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things.

This turns out not to work very well in Real Compilers. The reasons
are outlined rather nicely in the GCC documentation:

It is a natural idea to look for a way to give access to the condition
code left by the assembler instruction. However, when we attempted to
implement this, we found no way to make it work reliably. The problem
is that output operands might need reloading, which would result in
additional following "store" instructions. On most machines, these
instructions would alter the condition code before there was time to
test it. This problem doesn't arise for ordinary "test" and "compare"
instructions because they don't have any output operands.

In many architectures, there are at least one way around this problem.
When an interrupt occurs, at least the processor status word
containing the condition codes and the return address are
automatically pushed on the stack (and possible some other registers).
This usually applies also to software generated traps. The ISR can
then copy the condition codes from the stack frame to a safer place.

There are of cause some problems, this would consume one trap
instruction, perhaps shareable with some debugger traps. With separate
user and kernel spaces, a trap will usually switch to kernel mode,
which would require a trap handler to be installed into kernel space
and would generate quite a lot of instructions due to mode switching
and safe copying between modes.

However, in single address space systems, such as most embedded
systems, this should be quite usable.

Paul

Mar 16 '06 #197

RSoIsCaIrLiIoA

On 16 Mar 2006 06:41:40 GMT, Chris Torek <no****@torek.n et> wrote:

"Dik T. Winter" wrote:
How do you get access [from C] to the [hardware's] condition bits?

In article <44************ ***@yahoo.com>
CBFalconer <cb********@mai neline.net> wrote:
With the usual gay abandon about extensions, you might define a
variable in system space, say _ccd, to hold those bits. You
specify the conditions under which it is valid, such as immediately
after an expression with precisely two operands, preserved by use
of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things.

This turns out not to work very well in Real Compilers. The reasons
are outlined rather nicely in the GCC documentation:

It is a natural idea to look for a way to give access to the condition
code left by the assembler instruction. However, when we attempted to
implement this, we found no way to make it work reliably. The problem
is that output operands might need reloading, which would result in
additional following "store" instructions. On most machines, these
instructions would alter the condition code before there was time to
test it. This problem doesn't arise for ordinary "test" and "compare"
instructions because they don't have any output operands.

For reasons similar to those described above, it is not possible to
give an assembler instruction access to the condition code left by
previous instructions.

(from <http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Extended-Asm.html>).

(You *can* actually capture the condition codes, by doing the
operation itself *and* the condition-code-capture all in one
single inline __asm__, so that no reloading occurs between the
two parts. Getting it right is still fairly tricky. In many
cases you are better off just writing the desired routines in
assembly, and calling them as ordinary functions from the C code.)

in what i understand (if i understand) the solution for all your
problems is to have in the compiler implementation some debug mode for
its exe output: the executable when run opens many queues of 20 items:
one register all overflow
one all memory leak
one all memory wrote out of bound
....
then at end of program all those queue are print in a log file

if after 10 years of use that program has not problem in its
log then it could be compiled without debug queues

Mar 16 '06 #198

CBFalconer

Chris Torek wrote:

CBFalconer <cb********@mai neline.net> wrote:
"Dik T. Winter" wrote:
How do you get access [from C] to the [hardware's] condition bits?

With the usual gay abandon about extensions, you might define a
variable in system space, say _ccd, to hold those bits. You
specify the conditions under which it is valid, such as immediately
after an expression with precisely two operands, preserved by use
of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things.

This turns out not to work very well in Real Compilers. The reasons
are outlined rather nicely in the GCC documentation:

It is a natural idea to look for a way to give access to the condition
code left by the assembler instruction. However, when we attempted to
implement this, we found no way to make it work reliably. The problem
is that output operands might need reloading, which would result in
additional following "store" instructions. On most machines, these
instructions would alter the condition code before there was time to
test it. This problem doesn't arise for ordinary "test" and "compare"
instructions because they don't have any output operands.

For reasons similar to those described above, it is not possible to
give an assembler instruction access to the condition code left by
previous instructions.

(from <http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Extended-Asm.html>).

That's why I specified a limitation to a comma operator separated
statement, or some such. This would give an opportunity to use the
equivalent of "push psw" in the generated code, after which the
storage can be assigned and filled. By limiting to some specific
format we can avoid excessive overhead elsewhere. My mechanism is
hardly well thought out, it is only intended to trigger thoughts
from others. As I said, blue skying.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 16 '06 #199

Richard Bos

"Ed Prochak" <ed*******@gmai l.com> wrote:

Richard Bos wrote:
One can make a case that in at least one aspect, C is _less_ like
assembler than many other high-level languages: for. Most languages only
support for loops looking like FOR n=1 TO 13 STEP 2: NEXT or for n:=10
downto 1 do. Such loops are often easily caught in one, or a few, simple
machine instructions. Now try the same with for (node=root; node;
node=node->next) or for (i=1, j=10; x[i] && y[j]; i++, j--).

True, but
for( <p1>; <p2>; p3> )
looks like a MACRO to me.

*Shrug* If you go about redefining terms to support your delusion that C
is a portable assembler, that's your problem, not that of C.

Richard

Mar 16 '06 #200