clarification on character handling

aegis

7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

--
aegis

Nov 15 '05 #1

Subscribe Post Reply

1969

Peter Nilsson

aegis wrote:

7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?
More to the point, what should it be if _not_ UB?
It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array. It's no different to tolower(32767) on an 8-bit
char system. Why would you _expect_ some defined behaviour?

--
Peter

Nov 15 '05 #2

RAJU

Hi aegis,

The expected argument to tolower(c) is mentioned in the specification.
It's not specified if an unexpected arguments is passed. It's left to
the Compiler writers to have their own implementation, so it's
compiler/system dependent.

It's progrmmer's responsibility to avoid these kind of scenarios. There
is no error code retruned for these C functions. This is very common
for C standard.

Regards,
Raju

aegis wrote:

7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

--
aegis

Nov 15 '05 #3

CBFalconer

aegis wrote:

7.4#1 states
The header <ctype.h> declares several functions useful for
classifying and mapping characters.166) In all cases the argument
is an int, the value of which shall be representable as an
unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Many systems have an array of bits with masks, such that the array
can be indexed by the value of the character + 1. If the value of
EOF is -1 this maps into a normal 0 based array, if EOF is
something else appropriate code can correct. The bits have
significance as to whether the character is upper case, lower case,
printable, numeric, etc. A single index and mask can return the
appropriate characteristic.

Negative (-ve) input values other than EOF foul this up, and result
in illegal memory accesses.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 15 '05 #4

Richard Kettlewell

"aegis" <ae***@mad.scientist.com> writes:

7.4#1 states
The header <ctype.h> declares several functions useful for
classifying and mapping characters.166) In all cases the argument is
an int, the value of which shall be representable as an unsigned
char or shall equal the value of the macro EOF. If the argument has
any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

I would say you have it backwards: the ways in which tolower can be
implemented are defined by the specification, and the specification
allows implementations to break on negative non-EOF input if that's
the most convenient thing for them.

--
http://www.greenend.org.uk/rjk/

Nov 15 '05 #5

Antoine Leca

En <news:11**********************@f14g2000cwb.googleg roups.com>,
aegis va escriure:

Why should something such as:
tolower(-10); invoke undefined behavior?

Because historically it does (out of bounds access), and it was not deemed
worthwhile to put it a reasonable behaviour (which one, by the way?)
Antoine

Nov 15 '05 #6

Antoine Leca

Sorry if I am too picky, I do not know what was the point of the original
poster, but since it posted to both comp.lang.c and comp.std.c, he perhaps
wants to make a point about toxxx() vs. isxxx().

En <news:11**********************@g14g2000cwa.googleg roups.com>,
Peter Nilsson va escriure:

The toxxxx() macros and functions are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

This is unlikely to work correctly on a large scale (and *_flags can't be
0);
furthermore your _flags[] array cannot be shared with toupper(), which makes
its name pretty misleading.

Also, implementations of tolower() and toupper() as macros using the
classification array lookup, like
#define tolower(x) ((x) ^ _flags[(x) + 1] & _upper_case_flag)
(with an adequately choosen _upper_case_flag, i.e. 0x20 for ASCII and 0x40
for EBCDIC) do not comply with the C standard, because the x argument is
evaluated twice.

The other obvious "solution",
#define tolower(x) (_locale_dependent_array_for_tolower[(x) + 1])
is difficult to have it working correctly according to the specifications,
because you should return an int, including for EOF (which is negative) and
UCHAR_MAX (which is positive), so the type of the element of the array
cannot in general be a character type; and the resulting increase in width
wastes memory. As a result, many implementations do not provide tolower()
and toupper() as macros, only as functions.
Antoine

Nov 15 '05 #7

Keith Thompson

"Peter Nilsson" <ai***@acay.com.au> writes:

aegis wrote:
7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

More to the point, what should it be if _not_ UB?

If plain char is signed, it would be sensible to define the various
functions to work properly with signed values, including negative
values. All the characters of the basic character set are required to
be positive, but it would be nice to be able to do something like:

char c = some_arbitrary_value;
if (isupper(c)) {
do_something();
}
else {
do_something_else();
}

The need to cast the argument to unsigned char is well documented, but
IMHO counterintuitive.

The restriction to non-negative values and EOF makes things slightly
easier for the implementation, and slightly more difficult for the
programmer. This may have been a good tradeoff when the functions
were first defined; I don't think it is now.

I've seen implementations of <ctype.h> that work properly for values
from -128 to +255, covering both signed and unsigned characters.
There is an overlap between EOF (typically -1) and whatever character
is encoded as -1 (lowercase-y-with-diaresis in Latin-1, I think), but
that's not a problem in the default locale, since all the functions
happen to return the same value for EOF and that character.

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array. It's no different to tolower(32767) on an 8-bit
char system. Why would you _expect_ some defined behaviour?

This approach can handle negative values sensibly by changing the
offset value and making the array bigger.

Of course, since the standard doesn't require implementations to do
this, portable code still needs to make sure the argument is either
EOF or a non-negative value.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 15 '05 #8

Johan Borkhuis

Peter Nilsson wrote:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.
Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | jo***@borkhuis.com |(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

Nov 15 '05 #9

Krishanu Debnath

Johan Borkhuis wrote:

Peter Nilsson wrote:
Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

Krishanu

Nov 15 '05 #10

Johan Borkhuis

Krishanu Debnath wrote:

Johan Borkhuis wrote:
Peter Nilsson wrote:
Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.
Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,
and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | jo***@borkhuis.com |(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

Nov 15 '05 #11

Krishanu Debnath

Johan Borkhuis wrote:

Krishanu Debnath wrote:
Johan Borkhuis wrote:
Peter Nilsson wrote:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.
It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,

This is exactly what standard says.
and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

*Yes*. Then why do you need a unsigned char cast?

You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

Krishanu

Nov 15 '05 #12

Chris Croughton

On 9 Aug 2005 00:00:58 -0700, Krishanu Debnath
<kr**************@gmail.com> wrote:

Johan Borkhuis wrote:
Peter Nilsson wrote:
> Consider a simple look up table (and the fact that EOF is quite
> often and deliberately set at -1). The toxxxx() macros and functions
> are often implemented in this way...
>
> unsigned char _flags[257] = { 0, .... };
>
> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
>
> If you try tolower(-10), then the element referenced is not within
> the specified array.
Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Incidentally, the #define you are all using is for islower(), not
tolower(). Looking the character up in a table and selecting a bit.
But a similar thing can be done for tolower() etc. using a lookup table
so that it doesn't result in multiple evaluation of the argument
(although it isn't safe to assume that the argument is evaluated only
once).
It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

That doesn't matter (the effect is "undefined" if the character is out
of range, so whether it crashes, returns an incorrect result or causes
demons to fly out of your nose is up to the implementation). More
importantly it fails on EOF (and of course the +1 in the index is now
not needed because (unsigned char)(x) can never be negative).

A better implementation, as someone else mentioned, is to map all of the
characters from CHAR_MIN to UCHAR_MAX into the array:

#define islower(x) (_flags[(x) + CHAR_MIN] & _lower_case_flag)

This still has the problem that EOF will typically map onto one of the
other characters with a negative representation in signed char, but
that's the risk you take, if you want to make sure that the character
(char)EOF is treated as a real character you need to cast it to unsigned
char first still.

(Or better still would be to change the standard and force plain char to
be unsigned, but I doubt that will happen...)

Chris C

Nov 15 '05 #13

Johan Borkhuis

Krishanu Debnath wrote:

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,

This is exactly what standard says.

and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

*Yes*. Then why do you need a unsigned char cast?

The main reason for the cast is to avoid negative index in an array.
You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.
You can also consider a segmentation fault undefined behaviour.
You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.
What is the definition of undefined behaviour? In this case the return
of something (AKA Garbage in Garbage out) can be considered undefined
behaviour (unless you consider the fact that because I defined it, it is
no longer undefined, and thus not according to the standard.....).
But as the output is undefined I don't think you can say that any output
can be considered wrong.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | jo***@borkhuis.com |(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

Nov 15 '05 #14

pete

Krishanu Debnath wrote:

Johan Borkhuis wrote:
Krishanu Debnath wrote:
Johan Borkhuis wrote:

>Peter Nilsson wrote:
>
>>Consider a simple look up table (and the fact that EOF is quite
>>often and deliberately set at -1). The toxxxx() macros and functions
>>are often implemented in this way...
>>
>> unsigned char _flags[257] = { 0, .... };
>>
>> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
>>
>>If you try tolower(-10), then the element referenced is not within
>>the specified array.
>
>Then why not change it to:
>#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
>This will make sure that you cannot get outside the boundaries of the
>lookup table.
>
It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,

This is exactly what standard says.
and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

*Yes*. Then why do you need a unsigned char cast?

You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

The ctype function output with unsigned char cast arguments
is reasonable, especially if you consider that fputc and
functions described in terms of fputc, like putchar,
use the value of their arguments converted to unsigned char.

--
pete

Nov 15 '05 #15

Douglas A. Gwyn

aegis wrote:

Why should something such as:
tolower(-10); invoke undefined behavior?
It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

We discussed this not very long ago.

The obvious implementation is:
#define tolower(c) __lowercase[(c)+1];
and if arbitrary integer values had to be accommodated
(large positive is also a problem), the table would be
far larger than necessary, for no benefit whatever for
correct programs. An alternative would be to use a
function call, with an explicit range check and then a
table look-up, which is much slower than the above.
That's the kind of trade-off that C is generally
unwilling to make, although it may be appropriate for
a more baby-proof PL.

Nov 15 '05 #16

kuyper

Krishanu Debnath wrote:
....

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

There's no such thing as "wrong output" when the behavior is undefined.
In the C standard, "undefined behavior" means behavior for which the C
standard provides no definition. None. Not any. Whatsoever. Of any
kind. In particular, the C standard doesn't define the behavior in any
way which prohibits producing the result his unsigned char cast would
produce.

Nov 15 '05 #17

Douglas A. Gwyn

ku****@wizard.net wrote:

Krishanu Debnath wrote:
You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

There's no such thing as "wrong output" when the behavior is undefined.

I think he meant that the programmer is defining the behavior,
but that the defined behavior might not make sense. Note that
the original example (negative int values) didn't make sense
either.

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set, nor for the most
common codesets for *defined* character codes, but could happen
on some platforms if random garbage values are passed to
tolower(). In practice this could occur when the character
codes come from a hostile user, for example. The most likely
actual risk is denial of service due to crashing the process
with an illegal memory reference.

The "more secure library" TR under current development by WG14
is meant to provide a "drop-in" (easy automated editing) way to
catch such abuses in existing, not-so-carefully-constructed
applications. The alternative is to do a better job in the
original design and coding.

Nov 15 '05 #18

pete

Douglas A. Gwyn wrote:

ku****@wizard.net wrote:
Krishanu Debnath wrote:
You are changing an undefined behavior
to a **wrong output** with the
unsigned char cast.

There's no such thing as "wrong output"
when the behavior is undefined.

I think he meant that the programmer is defining the behavior,
but that the defined behavior might not make sense.

But it does make sense.
If you have a negative integer value like:
('A' - 1 - (unsigned char)-1)

then
putchar('A' - 1 - (unsigned char)-1)
returns 'A'.

and
tolower((unsigned char)('A' - 1 - (unsigned char)-1))
returns 'a'

--
pete

Nov 15 '05 #19

Antoine Leca

En <news:42***************@null.net>, Douglas A. Gwyn va escriure:

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set,
Agreed.
nor for the most common codesets for *defined* character codes,
Disagree.

One side of the problem is the definition of character set. Due to:
1) the overcrowed aspect of the 000-0177 range in ASCII
2) the widely use of 8-bit bytes
many if not all extended character sets these days (usable in char and
compatible with the basic character set of the architecture) defines
characters in the 08/00-15/15 range, that is toggling the 8th bit on.

On the other hand, for various reasons, not all compilers/implementations
that allow use of these extended character sets do switch char to be an
unsigned type. Of course, when the basic character set is EBCDIC, this is
required. But the standard is written in a way that allows to use e.g.
iso-8859-1 as character set while having SCHAR_MAX==127 (and in fact this is
very frequent setup in Western Europe.)
And in such a case, 'ä' is negative... (and is different from the result of
getc() if ä is in the stream :-( )
Which leads to a whole set of complications involving many use of unsigned
char casts.

As a result, I agree that a correctly programmed application should not fall
in the trap (and a current test here in Europe is to input ÿ to see how the
tested app reacts... 'ÿ' is -1 in iso-8859-1 codeset); but it is fairly easy
to be trapped, particularly when the application is ported.

but could happen on some platforms if random garbage values
are passed to tolower().
As I wrote, not only random garbage but also perfectly valid inputs on some
imperfect programs.
In practice this could occur when the character codes come
from a hostile user, for example.
Of course this leads to a risk, as you describe.
But I do not like the idea that what is genuinely a bug would be corrected
not because it harms anybody except the Americans/English-speaking people,
but only because some hostile hackers could turn it into a weapon...

;-) in case you missed it.

The "more secure library" TR under current development by WG14

Doesn't change its name to "safer"?
(http://www.open-std.org/JTC1/SC22/WG...docs/n1114.htm)

BTW, the "safer" library goes quite a bit further than tagging use of
negative value to tolower(). You can have some overview by reading
http://msdn.microsoft.com/library/8ef0s5kh.aspx or
http://msdn2.microsoft.com/library/wd3wzwts.aspx (MS is the sponsor of this
TR, so its implementation leads.)
In a nutshell, /many/ functions of the standard library are superceeded, and
this may need a significant effort to bring an existing tree on par.
Antoine

Nov 15 '05 #20

Chris Croughton

On Tue, 9 Aug 2005 18:40:07 GMT, Douglas A. Gwyn
<DA****@null.net> wrote:

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set,
Correct.
nor for the most
common codesets for *defined* character codes,
Incorrect. The most common character sets in western Europe are the
ISO-8559-x ones (IOS-8559-1 is commonly known as Latin-1; Microsoft's
Windows character sets for English-speaking versions are largely based
on that). They have the top bit of the char set.
but could happen on some platforms if random garbage values are passed
to tolower().
Or perfectly valid national characters, in many cases with a single
keystroke on a national keyboard.
In practice this could occur when the character
codes come from a hostile user, for example.
They don't have to be hostile -- nor non-English-speaking. Shift-3 on a
UK keyboard (we speak English in the UK, mostly) is the British pound
sign (looks like a stylised L with a line through it), and that is value
0xA3 (163 unsigned, -93 signed). It's very likely to be typed by a user
in a text field or document.
The most likely
actual risk is denial of service due to crashing the process
with an illegal memory reference.
With potential loss of data and revenue as high as you can imagine.
The "more secure library" TR under current development by WG14
Where can I find that? It's mentioned on the JTC1/SC22/WG14-C page[1]
as link "TR 24731: Programming language C - Specification for secure C
library functions", but going to that link[2] doesn't mention it (it
does mention and provide links to the other TRs in progress).

[1] http://www.open-std.org/jtc1/sc22/wg14/
[2] http://www.open-std.org/jtc1/sc22/wg...projects#24731
is meant to provide a "drop-in" (easy automated editing) way to
catch such abuses in existing, not-so-carefully-constructed
applications. The alternative is to do a better job in the
original design and coding.

A better alternative would be to (a) make plain char unsigned (some very
few non-conforming programs might have problems) or (b) extend the range
of the ctype.h functions and macros to include the range CHAR_MIN to -1
(which would waste all of 128 bytes on some systems and otherwise hurt
no one).

One could, of course, use unsigned char explicitly for all arrays -- and
then lose all of the functions in string.h (or have to cast for every
use) because they rightly cause diagnostics if called with a pointer to
unsigned char. Or use type punning or multiple pointers to the same
object, both of which are unsafe. All to get round a design flaw which
'saves' all of 128 bytes typically.

Chris C

Nov 15 '05 #21

Tim Rentsch

"aegis" <ae***@mad.scientist.com> writes:

7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Questions that ask "why" are often interesting questions.
Here there are several different answers, depending on what
kind of "why" is meant here.

First answer: to give freedom to implementations. Saying
that calling 'tolower' on arguments outside its range results
in undefined behavior gives implementations complete latitude
to do whatever they choose to in such situations. To say
this another way: to impose a specification that is minimal.

Second answer: because it's implementationally convenient.
Other people have commented on this aspect (with array
access, etc), so I don't think I need to say any more about
that.

Third answer: it's in keeping with "the spirit of C." Like
what the Rationale document says, C programmers expect things
to work when they do the right thing, but don't necessarily
expect any "safety net" when they do the wrong thing. The
definitions of tolower and the other <ctype.h> functions are
consistent with this philosophy.

Nov 15 '05 #22

clarification on character handling

Similar topics