473,842 Members | 1,919 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making Fatal Hidden Assumptions

We often find hidden, and totally unnecessary, assumptions being
made in code. The following leans heavily on one particular
example, which happens to be in C. However similar things can (and
do) occur in any language.

These assumptions are generally made because of familiarity with
the language. As a non-code example, consider the idea that the
faulty code is written by blackguards bent on foulling the
language. The term blackguards is not in favor these days, and for
good reason. However, the older you are, the more likely you are
to have used it since childhood, and to use it again, barring
specific thought on the subject. The same type of thing applies to
writing code.

I hope, with this little monograph, to encourage people to examine
some hidden assumptions they are making in their code. As ever, in
dealing with C, the reference standard is the ISO C standard.
Versions can be found in text and pdf format, by searching for N869
and N1124. [1] The latter does not have a text version, but is
more up-to-date.

We will always have innocent appearing code with these kinds of
assumptions built-in. However it would be wise to annotate such
code to make the assumptions explicit, which can avoid a great deal
of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the
referenced URL, and the comments are entirely mine, including the
'every 5' linenumber references.

/* Making fatal hidden assumptions */
/* Paul Hsiehs version of strlen.
http://www.azillionmonkeys.com/qed/asmexample.html

Some sneaky hidden assumptions here:
1. p = s - 1 is valid. Not guaranteed. Careless coding.
2. cast (int) p is meaningful. Not guaranteed.
3. Use of 2's complement arithmetic.
4. ints have no trap representations or hidden bits.
5. 4 == sizeof(int) && 8 == CHAR_BIT.
6. size_t is actually int.
7. sizeof(int) is a power of 2.
8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system
designer can guarantee all but item 1. Otherwise this is
not portable. Item 1 can probably be beaten by suitable
code reorganization to avoid the initial p = s - 1. This
is a serious bug which, for example, can cause segfaults
on many systems. It is most likely to foul when (int)s
has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).
*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080)
#define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) {
const char *p; /* 5 */
int d;

p = s - 1;
do {
p++; /* 10 */
if ((((int) p) & (SW - 1)) == 0) {
do {
d = *((int *) p);
p += SW;
} while (!hasNulByte (d)); /* 15 */
p -= SW;
}
} while (*p != 0);
return p - s;
} /* 20 */

Let us start with line 1! The constants appear to require that
sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't
really looked too closely, and it is possible that the ~x term
allows for larger sizeof(int), but nothing allows for larger
CHAR_BIT. A further hidden assumption is that there are no trap
values in the representation of an int. Its functioning is
doubtful when sizeof(int) is less that 4. At the least it will
force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within
an int, provided the preconditions are met. There is nothing wrong
with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by
definition. This just serves to obscure the fact that SW is
actually sizeof(int) later. No hidden assumptions have been made
here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen()
function, it would seem advantageous to use the appropriate
signature for the function. In particular strlen returns a size_t,
not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not
guarantee the action of a pointer below an object. The only real
purpose of this statement is to compensate for the initial
increment in line 10. This can be avoided by rearrangement of the
code, which will then let the routine function where the
assumptions are valid. This is the only real error in the code
that I see.

In line 11 we have several hidden assumptions. The first is that
the cast of a pointer to an int is valid. This is never
guaranteed. A pointer can be much larger than an int, and may have
all sorts of non-integer like information embedded, such as segment
id. If sizeof(int) is less than 4 the validity of this is even
less likely.

Then we come to the purpose of the statement, which is to discover
if the pointer is suitably aligned for an int. It does this by
bit-anding with SW-1, which is the concealed sizeof(int)-1. This
won't be very useful if sizeof(int) is, say, 3 or any other
non-poweroftwo. In addition, it assumes that an aligned pointer
will have those bits zero. While this last is very likely in
todays systems, it is still an assumption. The system designer is
entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int.
This enables the use of the already suspicious macro hasNulByte in
line 15.

If all these assumptions are correct, line 19 finally calculates a
pointer difference (which is valid, and of type size_t or ssize_t,
but will always fit into a size_t). It then does a concealed cast
of this into an int, which could cause undefined or implementation
defined behaviour if the value exceeds what will fit into an int.
This one is also unnecessary, since it is trivial to define the
return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement
arithmetic, which I believe to be embedded in the hasNulByte
macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded
in such innocent looking code? The sneaky thing is that the code
appears trivially correct at first glance. This is the stuff that
Heisenbugs are made of. Yet use of such code is fairly safe if we
are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I
believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/>

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell. org/google/>
Also see <http://www.safalra.com/special/googlegroupsrep ly/>

Mar 6 '06
351 13191
In article <pa************ *************** *@areilly.bpc-users.org>
Andrew Reilly <an************ *@areilly.bpc-users.org> wrote:
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all.
It has a 20-bit architecture, and people did (and still do) use it
that way.
Since that particular platform is (thankfully) falling into obsolescence,
can't we start to consider tidying up the standard, to allow more
traditional, idiomatic, symmetrical codeing styles?


And now the x86-64 is coming, and everything old will be new again.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Mar 26 '06 #321
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
On Sun, 26 Mar 2006 02:53:13 +0000, Keith Thompson wrote:
I asked in comp.std.c whether the AS/400 actually influenced the C
standard. Here's a reply from P.J. Plauger:

] AS/400 might have been mentioned. Several of us had direct experience
] with the Intel 286/386 however, and its penchant for checking anything
] you loaded into a pointer register. IIRC, that was the major exmaple
] put forth for disallowing the generation, or even the copying, of an
] invalid pointer.
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all. Similarly, the
386 can be used as a perfectly reasonable "C machine", and generally is,
these days. It only gets curly when you try to synthesize an extended
address range out of it. Unfortunately, the dominant compiler and platform
made a hash of that, rather than putting in the effort to make it work in
a (more) reasonable way.


I don't know enough about the 286/386 architecture(s) to offer any
meaningful commentary on this. Possibly some committee members
thought that future architectures might take some ideas from the
286/386 and extend them.
Since that particular platform is (thankfully) falling into obsolescence,
can't we start to consider tidying up the standard, to allow more
traditional, idiomatic, symmetrical codeing styles? Restore
pointer-oriented algorithm expressions to their place of idempotic
symmetry with index-oriented expressions? Please?


The only way that's going to happen is if somebody (1) comes up with a
specification and (2) pushes it through the committee. Advocating it
in comp.lang.c won't get it done.

Step 1 means, for each pointer operation, either specifying its
semantics, or stating that the behavior is either
implementation-defined, unspecified, or undefined. Once you get into
the details, you can expect a lot of arguments, such as people
pointing out that the suggested required semantics won't necessarily
work on some real-world system(s).

Step 2 is left as an exercise.

Or you can create your own language, or you can limit your development
to implementations that you *know* meet your requirements (which go
beyond the requirements of the current standard).

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 26 '06 #322
Chris Torek <no****@torek.n et> writes:
In article <pa************ *************** *@areilly.bpc-users.org>
Andrew Reilly <an************ *@areilly.bpc-users.org> wrote:
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all.


It has a 20-bit architecture, and people did (and still do) use it
that way.
Since that particular platform is (thankfully) falling into obsolescence,
can't we start to consider tidying up the standard, to allow more
traditional , idiomatic, symmetrical codeing styles?


And now the x86-64 is coming, and everything old will be new again.


As far as I can tell, the x86-64 uses (or at least is capable of
using) a flat 64-bit address space.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 26 '06 #323
On Sun, 26 Mar 2006 04:56:00 +0000, Chris Torek wrote:
In article <pa************ *************** *@areilly.bpc-users.org>
Andrew Reilly <an************ *@areilly.bpc-users.org> wrote:
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all.


It has a 20-bit architecture, and people did (and still do) use it
that way.


It's vaguely plausible to call the VM86 (real-mode) x86 arch 20-bit, but
it's a stretch, as no processor-visible registers, and no ALU ops are
20-bits long. It's 16-bit in the same sense that the later PDP-11s with
various memory extension schemes were 16-bit. It still gets used, to some
extent, because it's the boot environment of PCs.

The 286 could plausibly be called a 24-bit segmented machine, and shares
much of the memory model from it's IBM FS, OS/36 (which grew up to be
AS/400) and intel 432 anticedants. A nice protected architecture for
Pascal, PL/1, COBOL, and other HLL's of the age. You certainly couldn't
call it a "C machine" other than when used within it's 16-bit, flat memory
model (small) modes. Everything else required language extensions ("near"
and "far" pointers), and any pointer misbehaviour sanctioned by the
standard and by the implmentations could reasonably be said to be limited
to those extensions, anyway. The fact that as much milage was had out of
C in that environment is a testament to the industry's determination and
enthusiasm. When compilation was done so that non-standard pointer
extensions weren't required in the source, then it should have been the
system run-time that gave ground, rather than the standard. I doubt very
much that any new development work is being done in 286 protected mode,
anywhere.
Since that particular platform is (thankfully) falling into
obsolescenc e, can't we start to consider tidying up the standard, to
allow more traditional, idiomatic, symmetrical codeing styles?


And now the x86-64 is coming, and everything old will be new again.


The x86-64 is a lovely architecture for a C machine. Specifically, it has
jetissoned much of the segmentation issues. All 64-bits worth of address
space can be loaded into any "pointer" register, and manipulated with the
full compliment of integer and logical operations (because the pointer
registers are also the integer ALU registers), and the only time you can
ever get a peep out of a trap handler is if you try to actually
access memory at an address not mapped into the process' address space.

--
Andrew

Mar 26 '06 #324
Keith Thompson wrote:
Andrew Reilly <an************ *@areilly.bpc-users.org> writes:
.... snip ...
Since that particular platform is (thankfully) falling into
obsolescence, can't we start to consider tidying up the standard,
to allow more traditional, idiomatic, symmetrical codeing styles?
Restore pointer-oriented algorithm expressions to their place of
idempotic symmetry with index-oriented expressions? Please?


The only way that's going to happen is if somebody (1) comes up
with a specification and (2) pushes it through the committee.
Advocating it in comp.lang.c won't get it done.

Step 1 means, for each pointer operation, either specifying its
semantics, or stating that the behavior is either
implementation-defined, unspecified, or undefined. Once you get
into the details, you can expect a lot of arguments, such as
people pointing out that the suggested required semantics won't
necessarily work on some real-world system(s).

Step 2 is left as an exercise.

Or you can create your own language, or you can limit your
development to implementations that you *know* meet your
requirements (which go beyond the requirements of the current
standard).


We already have an ugly example of this process, in C# and the
entire .NET hoax, from people with more influence (and money) than
Mr Reilly.

--
Some informative links:
news:news.annou nce.newusers
http://www.geocities.com/nnqweb/
http://www.catb.org/~esr/faqs/smart-questions.html
http://www.caliburn.nl/topposting.html
http://www.netmeister.org/news/learn2quote.html
Mar 26 '06 #325
On Sat, 25 Mar 2006 18:46:43 -0600, "Stephen Sprunk"
<st*****@sprunk .org> wrote:
"Paul Keinanen" <ke******@sci.f i> wrote in message
news:t8******* *************** **********@4ax. com...
With separate data and address registers p=s-1 and p++ could as well
be calculated in integer registers and the final result (==s) would be
transferred to the address registers for memory access.

Considering that s is probably already in an address register, doing the
manipulation your way would require transferring it to an integer register,
doing the decrement, then doing the increment, then transferring it back to
an address register when it's needed for dereferencing. Why do that when
you can adjust the address register directly?


We have just been discussing in dozens of messages :-) that this would
trap on AS/400 and that trap could not be ignored.

By doing the calculations in integer registers this problem can be
avoided. Going this route would only be necessary when such
problematic expressions exists in the source code, not always.

Paul

Mar 26 '06 #326
On 2006-03-26, Keith Thompson <ks***@mib.or g> wrote:
Chris Torek <no****@torek.n et> writes:
In article <pa************ *************** *@areilly.bpc-users.org>
Andrew Reilly <an************ *@areilly.bpc-users.org> wrote:
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all.


It has a 20-bit architecture, and people did (and still do) use it
that way.
Since that particular platform is (thankfully) falling into obsolescence,
can't we start to consider tidying up the standard, to allow more
traditiona l, idiomatic, symmetrical codeing styles?


And now the x86-64 is coming, and everything old will be new again.


As far as I can tell, the x86-64 uses (or at least is capable of
using) a flat 64-bit address space.


Your caveat covers you. It can have a flat address space, but also has
its legacy "hw mode" allowing 16 & 32 bit stuff to see the relevant
addressing space.
Mar 26 '06 #327
"Jordan Abel" <ra*******@gmai l.com> wrote in message
news:sl******** **************@ random.yi.org.. .
On 2006-03-26, Stephen Sprunk <st*****@sprunk .org> wrote:
It simply doesn't make sense to do things that way since the only
purpose is to allow violations of the processor's memory protection
model. Work with the model, not against it.


Because it's a stupid memory protection model.

Why can't the trap be caught and ignored?


It can't be ignored because (apparently) the AS/400 and similar machines
only do permission checks on pointer formation. Once the pointer is formed,
accesses do not need permission checks. If you were able to ignore the trap
on formation, that would mean all pointer accesses would be exempt from the
security model.

Personally, I'd rather have my processor trap when an invalid pointer is
formed, since in my code such an occurrence is _always_ a bug. Waiting
until the pointer is dereferenced makes it significantly harder to debug.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
Mar 26 '06 #328
Keith Thompson <ks***@mib.or g> writes:
Chris Torek <no****@torek.n et> writes:
In article <pa************ *************** *@areilly.bpc-users.org>
Andrew Reilly <an************ *@areilly.bpc-users.org> wrote:
I don't understand this argument. The 286/386 doesn't even *have* pointer
registers, as such. It has segment descriptors, which can be used to make
things complicated, if you want to, but when you use a 286 as the 16-bit
machine that it is, then there is no issue here at all.


It has a 20-bit architecture, and people did (and still do) use it
that way.
Since that particular platform is (thankfully) falling into obsolescence,
can't we start to consider tidying up the standard, to allow more
traditiona l, idiomatic, symmetrical codeing styles?


And now the x86-64 is coming, and everything old will be new again.


As far as I can tell, the x86-64 uses (or at least is capable of
using) a flat 64-bit address space.


The piece I missed is that an x86-64 system can run 32-bit code. If I
compile and run a program on an x86-64 system, it uses 64-bit
pointers. If I compile a program on an x86-32 system and copy the
executable to an x86-64 system, it runs properly and uses 32-bit
pointers. (At least on the systems I have access to.)

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 26 '06 #329
>> Chris Torek <no****@torek.n et> writes:
And now the x86-64 is coming, and everything old will be new again.
Keith Thompson <ks***@mib.or g> writes:
As far as I can tell, the x86-64 uses (or at least is capable of
using) a flat 64-bit address space.

In article <ln************ @nuthaus.mib.or g>
Keith Thompson <ks***@mib.or g> wrote:The piece I missed is that an x86-64 system can run 32-bit code. If I
compile and run a program on an x86-64 system, it uses 64-bit
pointers. If I compile a program on an x86-32 system and copy the
executable to an x86-64 system, it runs properly and uses 32-bit
pointers. (At least on the systems I have access to.)


Yes. I am not saying that x86-64 has re-created the old 80x86
segmentation model. No, this is merely the thin end of the wedge.
Segmentation will come back, sooner or later. :-)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Mar 26 '06 #330

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.