By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,408 Members | 1,914 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,408 IT Pros & Developers. It's quick & easy.

Is there any GENRIC MACROS in c for INTEGERS,CHARACTERS ?

P: n/a
I want to use strrchar(source_string,last_char ) function from
string.h header file,to find out the last occurrence of the NON SPACE
Alphanumeric Character.

Then i will put a NULL CHAR after incrementing that position,received
by strrchar() function by 1.

This is my idea of Trimming a string from right .

According to the definition of the strrchar function,i am supposed to
provide the string and that character that has to be checked for the
last occurrence.But i want to device a way to check the last
occurrence of a "GENRIC ALPHANUMERIC CHARACTER",rather than any
specific character.

Have we got any GENERIC Predefined Macro in C ,for
INTEGERS,CHARACTERS...etc.

Is there any better and fast solution for Right Trimming a Big string
?

Thanks in Advance.

Regards,
shamdurgs
Nov 14 '05 #1
Share this Question
Share on Google+
35 Replies


P: n/a
You can use the functions in "ctype.h" for that:
There's a function called "isalnum" which takes a char as input and
returns an int. It returns non-zero if the char is alphanumeric and zero
otherwise.

Hope this helps..

Nov 14 '05 #2

P: n/a
Durgesh Sharma wrote:
I want to use strrchar(source_string,last_char ) function from
string.h header file,to find out the last occurrence of the NON SPACE
Alphanumeric Character.

Then i will put a NULL CHAR after incrementing that position,received
by strrchar() function by 1.

This is my idea of Trimming a string from right .

According to the definition of the strrchar function,i am supposed to
provide the string and that character that has to be checked for the
last occurrence.But i want to device a way to check the last
occurrence of a "GENRIC ALPHANUMERIC CHARACTER",rather than any
specific character.

Have we got any GENERIC Predefined Macro in C ,for
INTEGERS,CHARACTERS...etc.

Is there any better and fast solution for Right Trimming a Big string
?

Thanks in Advance.

Regards,
shamdurgs


Use mine ..

char * rtrim(char *str) {
char *s, *p; int c;
s = p = str;
while ((c = *s++)) if (!isspace(c)) p = s;
*p = '\0';
return str;
}

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #3

P: n/a
Bas Wassink wrote:
You can use the functions in "ctype.h" for that:
There's a function called "isalnum" which takes a char as input and


Actually, like all the functions prototyped in <ctype.h>, isalnum
takes an int, not a char.

The Standard says:

4.3.1.1 The isalnum function

Synopsis

#include <ctype.h>
int isalnum(int c);

Description

The isalnum function tests for any character for which isalpha or
isdigit is true.
Nov 14 '05 #4

P: n/a
Joe Wright wrote:
Durgesh Sharma wrote:
I want to use strrchar(source_string,last_char ) function from
string.h header file,to find out the last occurrence of the NON SPACE
Alphanumeric Character.

Then i will put a NULL CHAR after incrementing that position,received
by strrchar() function by 1.

This is my idea of Trimming a string from right .

According to the definition of the strrchar function,i am supposed to
provide the string and that character that has to be checked for the
last occurrence.But i want to device a way to check the last
occurrence of a "GENRIC ALPHANUMERIC CHARACTER",rather than any
specific character.

Have we got any GENERIC Predefined Macro in C ,for
INTEGERS,CHARACTERS...etc.

Is there any better and fast solution for Right Trimming a Big string
?

Thanks in Advance.

Regards,
shamdurgs

Use mine ..

char * rtrim(char *str) {
char *s, *p; int c;
s = p = str;
while ((c = *s++)) if (!isspace(c)) p = s;


while ((c = *s++)) if (!isspace((unsigned char)c)) p = s;

.... or else you can be in deep trouble if `char' is signed.
*p = '\0';
return str;
}


--
Eric Sosman
es*****@acm-dot-org.invalid
Nov 14 '05 #5

P: n/a

"Eric Sosman" <es*****@acm-dot-org.invalid> wrote
while ((c = *s++)) if (!isspace(c)) p = s;


while ((c = *s++)) if (!isspace((unsigned char)c)) p = s;

... or else you can be in deep trouble if `char' is signed.

Personally I regard this as a bug in the standard. It is unacceptable that
the former code is not conforming.
Nov 14 '05 #6

P: n/a
Eric Sosman wrote:
Joe Wright wrote:

[ snip ]

Use mine ..

char * rtrim(char *str) {
char *s, *p; int c;
s = p = str;
while ((c = *s++)) if (!isspace(c)) p = s;

while ((c = *s++)) if (!isspace((unsigned char)c)) p = s;

... or else you can be in deep trouble if `char' is signed.
*p = '\0';
return str;
}


From N869 ..

#include <ctype.h>
int isspace(int c);

Description

[#2] The isspace function tests for any character that is a
standard white-space character or is one of a locale-
specific set of characters for which isalnum is false. The
standard white-space characters are the following: space
(' '), form feed ('\f'), new-line ('\n'), carriage return
('\r'), horizontal tab ('\t'), and vertical tab ('\v'). In
the "C" locale, isspace returns true only for the standard
white-space characters.

The descriptions of the ctype functions all take int values. I know
that char is converted to int in this case and that if char is
signed and negative, the result is probably a negative int.

So what? Clearly -50 is not space or form feed, tab, etc. and the
expression (isspace(-50) == 0) is true.

What is the case for casting this otherwise negative int to unsigned
char? What 'deep trouble' could happen if I didn't? Why wouldn't the
function be written so as to take any int as advertised?

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #7

P: n/a
Joe Wright wrote:
Eric Sosman wrote:
Joe Wright wrote:
while ((c = *s++)) if (!isspace(c)) p = s;
while ((c = *s++)) if (!isspace((unsigned char)c)) p = s;

... or else you can be in deep trouble if `char' is signed.


From N869 ..

#include <ctype.h>
int isspace(int c);

Description

[#2] The isspace function tests for any character that is a
standard white-space character or is one of a locale-
specific set of characters for which isalnum is false. The
standard white-space characters are the following: space
(' '), form feed ('\f'), new-line ('\n'), carriage return
('\r'), horizontal tab ('\t'), and vertical tab ('\v'). In
the "C" locale, isspace returns true only for the standard
white-space characters.

The descriptions of the ctype functions all take int values. I know that
char is converted to int in this case and that if char is signed and
negative, the result is probably a negative int.


... but they don't take "just any" int values; the
argument must be in a restricted range. 7.4, paragraph 1
(I don't have N869 so this is from ISO/IEC 9899:1999,
which is very nearly as good):

"In all cases the argument is an int, the value of
which shall be representable as an unsigned char or
shall equal the value of the macro EOF. If the
argument has any other value, the behavior is
undefined."
So what? Clearly -50 is not space or form feed, tab, etc. and the
expression (isspace(-50) == 0) is true.
isspace(-50) produces undefined behavior unless EOF==-50.
What is the case for casting this otherwise negative int to unsigned
char? What 'deep trouble' could happen if I didn't? Why wouldn't the
function be written so as to take any int as advertised?


Well, "deep trouble" may have been an overstatement on my
part. Undefined behavior, by its very undefinedness, can be
beneficial rather than harmful. Who knows? The experience of
having demons fly out of your nose may be pleasant. ;-)

As to why the functions require a restricted range, I can
think of two likely reasons:

- For speed, the functions are frequently implemented as
macros that do simple array references. isspace() and its
kin just take the argument value, subtract EOF, and use the
difference as an index to an array containing the precomputed
answer. If the argument range were unrestricted, you'd need
an array with INT_MAX-INT_MIN+1 elements, which even with
today's enormous memories would be excessive. A range check
could be introduced, but this is difficult to do in a macro.

- Even with a different implementation strategy you face an
ambiguity when the argument equals EOF: Is it end-of-file or
a legitimate character (e.g., 0xFF on many systems)? Given
the value alone there is no way to tell. The Standard requires
that the legitimate characters be passed as non-negative values
so they can be distinguished from the negative value EOF.

IMHO this is one of those unpleasant little corners in the
language. It seems to me things would have been simpler had
`char' been synonymous with `unsigned char' right from the
start. However, machines disagree on just what should happen
when a byte is fetched from memory into a wider CPU register
for further manipulation: Some machines widen by sign-extending,
some by zero-extending, and some by leaving the pre-existing
high-order register contents unchanged. Requiring `unsigned char'
on all these types of machines (and on others I haven't thought
of) would have imposed a burden of extra instructions on at
least some of them.

And even a universal `unsigned char' would be no panacea.
I have heard tell of machines with 32-bit characters and 32-bit
integers, and I imagine the proper choice of an EOF value on such
machines must involve ugly compromises.

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 14 '05 #8

P: n/a
Eric Sosman wrote:
Joe Wright wrote:
Eric Sosman wrote:
Joe Wright wrote:

while ((c = *s++)) if (!isspace(c)) p = s;
while ((c = *s++)) if (!isspace((unsigned char)c)) p = s;

... or else you can be in deep trouble if `char' is signed.

From N869 ..

#include <ctype.h>
int isspace(int c);

Description

[#2] The isspace function tests for any character that is a
standard white-space character or is one of a locale-
specific set of characters for which isalnum is false. The
standard white-space characters are the following: space
(' '), form feed ('\f'), new-line ('\n'), carriage return
('\r'), horizontal tab ('\t'), and vertical tab ('\v'). In
the "C" locale, isspace returns true only for the standard
white-space characters.

The descriptions of the ctype functions all take int values. I know
that char is converted to int in this case and that if char is signed
and negative, the result is probably a negative int.

... but they don't take "just any" int values; the
argument must be in a restricted range. 7.4, paragraph 1
(I don't have N869 so this is from ISO/IEC 9899:1999,
which is very nearly as good):

"In all cases the argument is an int, the value of
which shall be representable as an unsigned char or
shall equal the value of the macro EOF. If the
argument has any other value, the behavior is
undefined."
So what? Clearly -50 is not space or form feed, tab, etc. and the
expression (isspace(-50) == 0) is true.

isspace(-50) produces undefined behavior unless EOF==-50.


What is EOF for in this context? I'm not overly afraid of 'Undefined
Behavior'. isspace(c) is required to return 0 if c (now converted to
int) is not among the 'space' characters. Clearly EOF is not among
the 'space' characters and so 0 must be the result. Right?
What is the case for casting this otherwise negative int to unsigned
char? What 'deep trouble' could happen if I didn't? Why wouldn't the
function be written so as to take any int as advertised?

Well, "deep trouble" may have been an overstatement on my
part. Undefined behavior, by its very undefinedness, can be
beneficial rather than harmful. Who knows? The experience of
having demons fly out of your nose may be pleasant. ;-)

As to why the functions require a restricted range, I can
think of two likely reasons:

- For speed, the functions are frequently implemented as
macros that do simple array references. isspace() and its
kin just take the argument value, subtract EOF, and use the
difference as an index to an array containing the precomputed
answer. If the argument range were unrestricted, you'd need
an array with INT_MAX-INT_MIN+1 elements, which even with
today's enormous memories would be excessive. A range check
could be introduced, but this is difficult to do in a macro.


No, you don't. EOF is a non-event (must return 0) and (c && 0xff)
will give you the index into a 256-byte array of answers to the
questions.
- Even with a different implementation strategy you face an
ambiguity when the argument equals EOF: Is it end-of-file or
a legitimate character (e.g., 0xFF on many systems)? Given
the value alone there is no way to tell. The Standard requires
that the legitimate characters be passed as non-negative values
so they can be distinguished from the negative value EOF.

The Standard requirements for non-negative notwithstanding, having
checked the value for EOF and finding that it is not, mask the value
with 0xff and carry on. Surely.
IMHO this is one of those unpleasant little corners in the
language. It seems to me things would have been simpler had
`char' been synonymous with `unsigned char' right from the
start. However, machines disagree on just what should happen
when a byte is fetched from memory into a wider CPU register
for further manipulation: Some machines widen by sign-extending,
some by zero-extending, and some by leaving the pre-existing
high-order register contents unchanged. Requiring `unsigned char'
on all these types of machines (and on others I haven't thought
of) would have imposed a burden of extra instructions on at
least some of them.

The Standard's mention of 'unsigned char' in this context is
unfortunate. We are talking about values of an int.
And even a universal `unsigned char' would be no panacea.
I have heard tell of machines with 32-bit characters and 32-bit
integers, and I imagine the proper choice of an EOF value on such
machines must involve ugly compromises.


I think it's a question of domains within a range. For 32-bit
unsigned integers, the range of values is 0..4,294,967,295. NULL
defined as 0 is within the domain of pointers and EOF as -1 is
outside the domain of characters. Good choices.

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #9

P: n/a
Joe Wright wrote:
Eric Sosman wrote:
Joe Wright wrote:
[...]
The descriptions of the ctype functions all take int values. I know
that char is converted to int in this case and that if char is signed
and negative, the result is probably a negative int.
... but they don't take "just any" int values; the
argument must be in a restricted range. 7.4, paragraph 1
(I don't have N869 so this is from ISO/IEC 9899:1999,
which is very nearly as good):

"In all cases the argument is an int, the value of
which shall be representable as an unsigned char or
shall equal the value of the macro EOF. If the
argument has any other value, the behavior is
undefined."
So what? Clearly -50 is not space or form feed, tab, etc. and the
expression (isspace(-50) == 0) is true.


isspace(-50) produces undefined behavior unless EOF==-50.


What is EOF for in this context?


EOF is a macro defined in <stdio.h>. Its expansion is
a negative integer constant (usually -1, although the Standard
does not require this). Various I/O functions return EOF to
indicate that something unusual (e.g., end-of-file or I/O
error) has happened.

The <ctype.h> functions accept EOF as an argument value
in addition to all the (non-negative) values of legitimate
characters, presumably because somebody once thought it would
be convenient to do things like

int ch;
/* skip leading spaces */
while (isspace(ch = getchar()))
;
if (ch == EOF)
/* end-of-file or error */ ;
else
/* found a non-space character */ ;

If isspace() didn't accept EOF, you'd need to write

int ch;
/* skip leading spaces */
while ((ch = getchar()) != EOF) {
if (! isspace(ch))
break;
}
if (ch == EOF)
/* end-of-file or error */ ;
else
/* found a non-space character */ ;

Observe that this loop makes two tests per character instead
of the first form's single test. The original inventors of
<ctype.h> were, I guess, offended by the inefficiency of a
two-test loop and saw a way to define the functions so as to
eliminate half the testing. In hindsight, it looks like this
worship of The Little Tin God may have been misplaced -- but
the ANSI committee was asked to codify existing practice, and
they took the bitter with the sweet.
I'm not overly afraid of 'Undefined
Behavior'.
You need not be "overly afraid," just "afraid enough."
isspace(c) is required to return 0 if c (now converted to
int) is not among the 'space' characters.
... and if c is among the permitted values.
Clearly EOF is not among the
'space' characters and so 0 must be the result. Right?
Right.
- For speed, the functions are frequently implemented as
macros that do simple array references. isspace() and its
kin just take the argument value, subtract EOF, and use the
difference as an index to an array containing the precomputed
answer. If the argument range were unrestricted, you'd need
an array with INT_MAX-INT_MIN+1 elements, which even with
today's enormous memories would be excessive. A range check
could be introduced, but this is difficult to do in a macro.


No, you don't. EOF is a non-event (must return 0) and (c && 0xff) will
give you the index into a 256-byte array of answers to the questions.


I don't understand what you mean by "a non-event." You
are right that isspace(EOF) must return zero, but it does not
follow that isspace(negative_value_not_equal_to_EOF) must
return zero, or must even return at all.

Also, take another look at your `c && 0xff' (by which I
imagine you actually meant `c & 0xff'). Let's assume, as you
apparently have, a system with eight-bit characters and two's
complement arithmetic. Let's further assume EOF == -1, which
is the case for most implementations. Then `EOF & 0xff' gives
the value 255 -- but 255 is the code for some perfectly valid
character. If the current locale considers that character as
a space (or as an XXXX for the isXXXX() function), you have the
conflicting requirement that isXXXX(EOF) must return zero but
isXXXX(255) must return non-zero. If the function's first step
is to convert EOF to 255, the distinction can no longer be made.
The Standard requirements for non-negative notwithstanding, having
checked the value for EOF and finding that it is not, mask the value
with 0xff and carry on. Surely.
That would work (on a two's complement eight-bit system).
It is possible that `(unsigned char)c' does exactly this
masking. However, the cast will work on all systems while
your mask will work on only some. Also, on systems where
char is already unsigned, the cast presumably compiles to
a no-op while your cast generates unnecessary code. All
in all, the cast wins on both portability and efficiency.
IMHO this is one of those unpleasant little corners in the
language. It seems to me things would have been simpler had
`char' been synonymous with `unsigned char' right from the
start. However, machines disagree on just what should happen
when a byte is fetched from memory into a wider CPU register
for further manipulation: Some machines widen by sign-extending,
some by zero-extending, and some by leaving the pre-existing
high-order register contents unchanged. Requiring `unsigned char'
on all these types of machines (and on others I haven't thought
of) would have imposed a burden of extra instructions on at
least some of them.


The Standard's mention of 'unsigned char' in this context is
unfortunate. We are talking about values of an int.


Again, I'm not sure what you mean. By "unfortunate" do you
mean "The Standard is wrong," or do you mean "It's too bad the
pre-Standard <ctype.h> worked this way so the Standard had to
adopt it?"

Note, too, that the int values in question are, specifically,
the value of EOF and the values of unsigned char.
I think it's a question of domains within a range. For 32-bit unsigned
integers, the range of values is 0..4,294,967,295. NULL defined as 0 is
within the domain of pointers and EOF as -1 is outside the domain of
characters. Good choices.


For the third time, I fail to understand what you are trying
to say -- but this time, I can't even begin to puzzle it out.

--
Eric Sosman
es*****@acm-dot-org.invalid
Nov 14 '05 #10

P: n/a
Eric Sosman wrote:
Joe Wright wrote:
Eric Sosman wrote:
Joe Wright wrote:

[...]
The descriptions of the ctype functions all take int values. I know
that char is converted to int in this case and that if char is
signed and negative, the result is probably a negative int.
... but they don't take "just any" int values; the
argument must be in a restricted range. 7.4, paragraph 1
(I don't have N869 so this is from ISO/IEC 9899:1999,
which is very nearly as good):

"In all cases the argument is an int, the value of
which shall be representable as an unsigned char or
shall equal the value of the macro EOF. If the
argument has any other value, the behavior is
undefined."

So what? Clearly -50 is not space or form feed, tab, etc. and the
expression (isspace(-50) == 0) is true.
isspace(-50) produces undefined behavior unless EOF==-50.


What is EOF for in this context?

EOF is a macro defined in <stdio.h>. Its expansion is
a negative integer constant (usually -1, although the Standard
does not require this). Various I/O functions return EOF to
indicate that something unusual (e.g., end-of-file or I/O
error) has happened.

The <ctype.h> functions accept EOF as an argument value
in addition to all the (non-negative) values of legitimate
characters, presumably because somebody once thought it would
be convenient to do things like

int ch;
/* skip leading spaces */
while (isspace(ch = getchar()))
;
if (ch == EOF)
/* end-of-file or error */ ;
else
/* found a non-space character */ ;

If isspace() didn't accept EOF, you'd need to write

int ch;
/* skip leading spaces */
while ((ch = getchar()) != EOF) {
if (! isspace(ch))
break;
}
if (ch == EOF)
/* end-of-file or error */ ;
else
/* found a non-space character */ ;

Observe that this loop makes two tests per character instead
of the first form's single test. The original inventors of
<ctype.h> were, I guess, offended by the inefficiency of a
two-test loop and saw a way to define the functions so as to
eliminate half the testing. In hindsight, it looks like this
worship of The Little Tin God may have been misplaced -- but
the ANSI committee was asked to codify existing practice, and
they took the bitter with the sweet.
I'm not overly afraid of 'Undefined Behavior'.

You need not be "overly afraid," just "afraid enough."
isspace(c) is required to return 0 if c (now converted to int) is not
among the 'space' characters.

... and if c is among the permitted values.
Clearly EOF is not among the 'space' characters and so 0 must be the
result. Right?

Right.
- For speed, the functions are frequently implemented as
macros that do simple array references. isspace() and its
kin just take the argument value, subtract EOF, and use the
difference as an index to an array containing the precomputed
answer. If the argument range were unrestricted, you'd need
an array with INT_MAX-INT_MIN+1 elements, which even with
today's enormous memories would be excessive. A range check
could be introduced, but this is difficult to do in a macro.

No, you don't. EOF is a non-event (must return 0) and (c && 0xff) will
give you the index into a 256-byte array of answers to the questions.

I don't understand what you mean by "a non-event." You
are right that isspace(EOF) must return zero, but it does not
follow that isspace(negative_value_not_equal_to_EOF) must
return zero, or must even return at all.

Also, take another look at your `c && 0xff' (by which I
imagine you actually meant `c & 0xff'). Let's assume, as you
apparently have, a system with eight-bit characters and two's
complement arithmetic. Let's further assume EOF == -1, which
is the case for most implementations. Then `EOF & 0xff' gives
the value 255 -- but 255 is the code for some perfectly valid
character. If the current locale considers that character as
a space (or as an XXXX for the isXXXX() function), you have the
conflicting requirement that isXXXX(EOF) must return zero but
isXXXX(255) must return non-zero. If the function's first step
is to convert EOF to 255, the distinction can no longer be made.
The Standard requirements for non-negative notwithstanding, having
checked the value for EOF and finding that it is not, mask the value
with 0xff and carry on. Surely.

That would work (on a two's complement eight-bit system).
It is possible that `(unsigned char)c' does exactly this
masking. However, the cast will work on all systems while
your mask will work on only some. Also, on systems where
char is already unsigned, the cast presumably compiles to
a no-op while your cast generates unnecessary code. All
in all, the cast wins on both portability and efficiency.
IMHO this is one of those unpleasant little corners in the
language. It seems to me things would have been simpler had
`char' been synonymous with `unsigned char' right from the
start. However, machines disagree on just what should happen
when a byte is fetched from memory into a wider CPU register
for further manipulation: Some machines widen by sign-extending,
some by zero-extending, and some by leaving the pre-existing
high-order register contents unchanged. Requiring `unsigned char'
on all these types of machines (and on others I haven't thought
of) would have imposed a burden of extra instructions on at
least some of them.

The Standard's mention of 'unsigned char' in this context is
unfortunate. We are talking about values of an int.

Again, I'm not sure what you mean. By "unfortunate" do you
mean "The Standard is wrong," or do you mean "It's too bad the
pre-Standard <ctype.h> worked this way so the Standard had to
adopt it?"

Note, too, that the int values in question are, specifically,
the value of EOF and the values of unsigned char.
I think it's a question of domains within a range. For 32-bit unsigned
integers, the range of values is 0..4,294,967,295. NULL defined as 0
is within the domain of pointers and EOF as -1 is outside the domain
of characters. Good choices.

For the third time, I fail to understand what you are trying
to say -- but this time, I can't even begin to puzzle it out.


OK, the function prototype looks like ..

int isspace(int c);

... and is described as returning 0 unless c is among the 'space'
characters, otherwise non-zero.

I can read. The Standard's requirement that c, if not EOF be
positive in the range of unsigned char is unnecessary. If the value
of c is not one of 'white-space' characters the function must return
0. It is onerous to require (unsigned char) cast to c. The Standard
should remove the requirement. P.J. and I can take care of it. :-)

About NULL and EOF.
It is interesting and useful to have a pointer value which can be
known to be 'invalid'. What would you pick that value to be? A
pointer value is usually a memory address and we can't know how much
memory the host machine will have. Zero is a good choice.

If we need an EOF value outside the domain of any character, -1 is
perfect.

I mean that within the range of 32-bits, NULL is in the pointer
domain, as it should be, and EOF is out of the character domain as
it should be.

Whether you agree or not, surely you understand. ?
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #11

P: n/a
Joe Wright <jo********@comcast.net> writes:
[...]
OK, the function prototype looks like ..

int isspace(int c);

.. and is described as returning 0 unless c is among the 'space'
characters, otherwise non-zero.

I can read. The Standard's requirement that c, if not EOF be positive
in the range of unsigned char is unnecessary. If the value of c is not
one of 'white-space' characters the function must return 0. It is
onerous to require (unsigned char) cast to c. The Standard should
remove the requirement. P.J. and I can take care of it. :-)


isspace() invokes undefined behavior for arguments other than EOF and
values within the range of unsigned char. This allows it to be
implemented as a simple array lookup (after adding 1 to the argument,
assuming EOF==-1). Take a look at your system's <ctype.h> header
(assuming it's implemented as a file).

Requiring isspace() to return 0 for all other arguments, rather than
invoking undefined behavior, would require an addition test before the
array indexing operation. This would hurt performance (marginally)
for all the existing programs that use isspace() properly. The only
benefit would be avoidance of undefined behavior for programs passing
nonsensical values to isspace().

On the other hand, it would make more sense (IMHO) for isspace() to
take an argument of type char, and to drop the wording about EOF. But
isspace() was designed before the invention of prototypes, so a char
argument was promoted to int anyway. And making this kind of change
now would break existing code.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #12

P: n/a
Keith Thompson <ks***@mib.org> wrote:
On the other hand, it would make more sense (IMHO) for isspace() to
take an argument of type char, and to drop the wording about EOF. But
isspace() was designed before the invention of prototypes, so a char
argument was promoted to int anyway. And making this kind of change
now would break existing code.


There's another reason: isspace() and friends take the same kind of
argument (int, with a value of unsigned char or EOF) that getchar()
returns. I can't help but think that this is intentional. It can
certainly be very useful.

Richard
Nov 14 '05 #13

P: n/a
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
Keith Thompson <ks***@mib.org> wrote:
On the other hand, it would make more sense (IMHO) for isspace() to
take an argument of type char, and to drop the wording about EOF. But
isspace() was designed before the invention of prototypes, so a char
argument was promoted to int anyway. And making this kind of change
now would break existing code.


There's another reason: isspace() and friends take the same kind of
argument (int, with a value of unsigned char or EOF) that getchar()
returns. I can't help but think that this is intentional. It can
certainly be very useful.


The only case I can think of where it makes a real difference is
isspace(EOF), which I don't find particularly useful. (And of course
all this applies equally to the rest of the is*() functions.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #14

P: n/a
Keith Thompson wrote:

isspace() invokes undefined behavior for arguments other than EOF and
values within the range of unsigned char. This allows it to be
implemented as a simple array lookup (after adding 1 to the argument,
assuming EOF==-1). Take a look at your system's <ctype.h> header
(assuming it's implemented as a file).


A "tolerant" implementation on a system where `char'
is signed might define EOF as CHAR_MIN-1 instead of the
traditional -1. It could then duplicate half of each
array used by the <ctype.h> functions so as to deliver
the right answer even when handed an unconverted (and
possibly negative) `char' value as an argument. For
example, on a system with 8-bit `char' and two's
complement arithmetic,

array[0] : value for EOF (-129)
array[1-128] : values for 0x80,0x81,...,0xFF
array[129-256] : values for 0x00,0x01,...,0x7F
array[257-368] : values for 0x80,0x81,...,0xFF

The Standard does not require this, but it might be
a "friendly gesture" on machines with character sets that
are not too large. Is anyone aware of an implementation
that uses such a trick?

--
Er*********@sun.com

Nov 14 '05 #15

P: n/a
Eric Sosman <er*********@sun.com> writes:
Keith Thompson wrote:
isspace() invokes undefined behavior for arguments other than EOF and
values within the range of unsigned char. This allows it to be
implemented as a simple array lookup (after adding 1 to the argument,
assuming EOF==-1). Take a look at your system's <ctype.h> header
(assuming it's implemented as a file).


A "tolerant" implementation on a system where `char'
is signed might define EOF as CHAR_MIN-1 instead of the
traditional -1. It could then duplicate half of each
array used by the <ctype.h> functions so as to deliver
the right answer even when handed an unconverted (and
possibly negative) `char' value as an argument. For
example, on a system with 8-bit `char' and two's
complement arithmetic,

array[0] : value for EOF (-129)
array[1-128] : values for 0x80,0x81,...,0xFF
array[129-256] : values for 0x00,0x01,...,0x7F
array[257-368] : values for 0x80,0x81,...,0xFF

The Standard does not require this, but it might be
a "friendly gesture" on machines with character sets that
are not too large. Is anyone aware of an implementation
that uses such a trick?


I don't know, but I mistrust such "friendly" gestures. Generating
meaningful results for code that will break on other implementations
isn't what I call friendly. It makes it more difficult to detect
non-portable code.

If the <ctype.h> facility could be redesigned *in the standard* to be
less error-prone, that would be fine. The problem is when a single
implementation does this.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #16

P: n/a
Keith Thompson wrote:
Joe Wright <jo********@comcast.net> writes:
[...]
OK, the function prototype looks like ..

int isspace(int c);

.. and is described as returning 0 unless c is among the 'space'
characters, otherwise non-zero.

I can read. The Standard's requirement that c, if not EOF be positive
in the range of unsigned char is unnecessary. If the value of c is not
one of 'white-space' characters the function must return 0. It is
onerous to require (unsigned char) cast to c. The Standard should
remove the requirement. P.J. and I can take care of it. :-)

isspace() invokes undefined behavior for arguments other than EOF and
values within the range of unsigned char. This allows it to be
implemented as a simple array lookup (after adding 1 to the argument,
assuming EOF==-1). Take a look at your system's <ctype.h> header
(assuming it's implemented as a file).

Thank you for making me do just that.
Requiring isspace() to return 0 for all other arguments, rather than
invoking undefined behavior, would require an addition test before the
array indexing operation. This would hurt performance (marginally)
for all the existing programs that use isspace() properly. The only
benefit would be avoidance of undefined behavior for programs passing
nonsensical values to isspace().

On the other hand, it would make more sense (IMHO) for isspace() to
take an argument of type char, and to drop the wording about EOF. But
isspace() was designed before the invention of prototypes, so a char
argument was promoted to int anyway. And making this kind of change
now would break existing code.


Now that I think about it (thanks for telling me what to think) I
don't have a problem with the Standard's UB warning.

But, is*(int c) makes it possible to accept EOF outside the
character domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

... limits the index 0..256 regardless the int value of c.

So, my original point, casting the argument to is*() to unsigned
char serves no purpose. You don't need to do it. Ever.

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #17

P: n/a
Joe Wright <jo********@comcast.net> writes:
[...]
But, is*(int c) makes it possible to accept EOF outside the character
domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

.. limits the index 0..256 regardless the int value of c.
Sure, if that's the way your implementation defines it.
So, my original point, casting the argument to is*() to unsigned char
serves no purpose. You don't need to do it. Ever.


I think the example we were talking about upthread involved passing
characters extracted from a string to one of the is*() functions.
Here's a little program I just threw together. It's intended to print
the number of digits in each of its command-line arguments.

#include <stdio.h>
#include <ctype.h>

static int count_digits(char *s)
{
int result = 0;
int i;
for (i = 0; s[i] != '\0'; i ++) {
if (isdigit(s[i])) { /* PROBLEM HERE */
result ++;
}
}
return result;
}

int main(int argc, char **argv)
{
int i;

for (i = 1; i < argc; i ++) {
printf("\"%s\" has %d digit(s)\n",
argv[i],
count_digits(argv[i]));
}
return 0;
}

Let's assume CHAR_BIT==8, and plain char is signed. Suppose one of
the arguments contains the character '\xe9' (233 decimal). As a
signed character, its value is -23. isdigit(-23) invokes undefined
behavior. (A given implementation may define isdigit() in such a way
that it doesn't cause any problems, but it's still undefined behavior.)

Changing the condition
isdigit(s[i])
to either
isdigit((unsigned char)s[i])
or
isdigit((unsigned)s[i])
avoids the undefined behavior.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #18

P: n/a
Joe Wright wrote:
So, my original point, casting the argument to is*() to unsigned
char serves no purpose. You don't need to do it. Ever.


If you have a negative integer value, like

#define NEG_5 ('5' - 1 - (unsigned char)-1)

then
isdigit((unsigned char)NEG_5)
will return true, as it should, since
putchar(NEG_5)
will return '5'.

--
pete
Nov 14 '05 #19

P: n/a
Eric Sosman wrote:

Well, "deep trouble" may have been an overstatement on my
part. Undefined behavior, by its very undefinedness, can be
beneficial rather than harmful. Who knows? The experience of
having demons fly out of your nose may be pleasant. ;-)


Well, a succubus is a type of demon. I am just speculating of
course. ;)

--
Thomas.
Nov 14 '05 #20

P: n/a
Keith Thompson <ks***@mib.org> wrote:
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
Keith Thompson <ks***@mib.org> wrote:
On the other hand, it would make more sense (IMHO) for isspace() to
take an argument of type char, and to drop the wording about EOF. But
isspace() was designed before the invention of prototypes, so a char
argument was promoted to int anyway. And making this kind of change
now would break existing code.


There's another reason: isspace() and friends take the same kind of
argument (int, with a value of unsigned char or EOF) that getchar()
returns. I can't help but think that this is intentional. It can
certainly be very useful.


The only case I can think of where it makes a real difference is
isspace(EOF), which I don't find particularly useful. (And of course
all this applies equally to the rest of the is*() functions.)


It can be useful, for example, in situations like

while (isspace(getchar())) ;

Admittedly, it's rather more useful in things like isalnum(), and even
more so in tolower(fgetc()) (think case-insensitive indexing, for
example). It wouldn't be a good idea, in any case, to give isspace() a
different interface from the other <ctype.h> functions.

Richard
Nov 14 '05 #21

P: n/a
Joe Wright wrote:
[...]
But, is*(int c) makes it possible to accept EOF outside the
character domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

.. limits the index 0..256 regardless the int value of c.
Actually, it limits the index to 1..255; ctype[0]
will never be used. Are you sure you've transliterated
the macro correctly?
So, my original point, casting the argument to is*() to unsigned
char serves no purpose. You don't need to do it. Ever.


Wrong. R-O-N-G, wrong. See up-thread for detailed
explanations that I don't see any use in repeating here;
if you didn't understand them the first time, you won't
understand them this time either. Just take it on faith:
You're Fire-- er, You're Wrong.

--
Er*********@sun.com

Nov 14 '05 #22

P: n/a
Eric Sosman wrote:
Joe Wright wrote:
[...]
But, is*(int c) makes it possible to accept EOF outside the
character domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

.. limits the index 0..256 regardless the int value of c.

Actually, it limits the index to 1..255;


Hmmm. "Actually actually," that should be 1..256.
Sorry for the confusion.

--
Er*********@sun.com

Nov 14 '05 #23

P: n/a
Thomas Stegen wrote:
Eric Sosman wrote:

Well, "deep trouble" may have been an overstatement on my
part. Undefined behavior, by its very undefinedness, can be
beneficial rather than harmful. Who knows? The experience of
having demons fly out of your nose may be pleasant. ;-)


Well, a succubus is a type of demon. I am just speculating of
course. ;)


I always considered it to be a poorly implemented peripheral
communications scheme for micro-computers. :-)

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
Nov 14 '05 #24

P: n/a
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
Keith Thompson <ks***@mib.org> wrote: [...]
The only case I can think of where it makes a real difference is
isspace(EOF), which I don't find particularly useful. (And of course
all this applies equally to the rest of the is*() functions.)


It can be useful, for example, in situations like

while (isspace(getchar())) ;

Admittedly, it's rather more useful in things like isalnum(), and even
more so in tolower(fgetc()) (think case-insensitive indexing, for
example).


I'd prefer to check for EOF before passing the result to isspace(),
but I suppose you could squeeze out a few cycles by doing it in one
fell swoop. De gustibus et cetera.
It wouldn't be a good idea, in any case, to give isspace() a
different interface from the other <ctype.h> functions.


Agreed; I woulnd't suggest such a thing. (I meant isspace() as an
example covering all the is*() functions, a point I could have made
more clearly.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #25

P: n/a
Eric Sosman wrote:
Joe Wright wrote:
[...]
But, is*(int c) makes it possible to accept EOF outside the
character domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

.. limits the index 0..256 regardless the int value of c.

Actually, it limits the index to 1..255; ctype[0]
will never be used. Are you sure you've transliterated
the macro correctly?

So, my original point, casting the argument to is*() to unsigned
char serves no purpose. You don't need to do it. Ever.

Wrong. R-O-N-G, wrong. See up-thread for detailed
explanations that I don't see any use in repeating here;
if you didn't understand them the first time, you won't
understand them this time either. Just take it on faith:
You're Fire-- er, You're Wrong.


Well yes, I am. I just wrote a little program to prove that I was
Right and the program says that I'm Wrong.

If any of you feel I've wasted your time, I apologize.

I'm going to study the 'problem' and its solution and report back
here, if anyone will still listen to me.

Sorry I was Wrong. It doesn't happen often (I hope).
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #26

P: n/a
Keith Thompson wrote:
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:

.... snip ...

It can be useful, for example, in situations like

while (isspace(getchar())) ;

Admittedly, it's rather more useful in things like isalnum(),
and even more so in tolower(fgetc()) (think case-insensitive
indexing, for example).


I'd prefer to check for EOF before passing the result to
isspace(), but I suppose you could squeeze out a few cycles by
doing it in one fell swoop. De gustibus et cetera.


On the contrary, you may well want to use something like:

int getnonblank(FILE *f)
{
int ch;

while (isspace(ch = getc(f))) continue;
return ch;
}

and let the caller worry about EOF. I can see this called by:

int ch;

ch = getnonblank(f);
while (isdigit(ch)) {
/* process ch */
ch = getc(f);
return ch;

and the caller of that can still handle EOF conditions.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
Nov 14 '05 #27

P: n/a
Joe Wright wrote:
Eric Sosman wrote:
Joe Wright wrote:
[...]
But, is*(int c) makes it possible to accept EOF outside the character
domain and characters as well. But the macro ..

#define is*(c) (ctype[((c)&255)+1] & *)

.. limits the index 0..256 regardless the int value of c.


Actually, it limits the index to 1..255; ctype[0]
will never be used. Are you sure you've transliterated
the macro correctly?

So, my original point, casting the argument to is*() to unsigned char
serves no purpose. You don't need to do it. Ever.


Wrong. R-O-N-G, wrong. See up-thread for detailed
explanations that I don't see any use in repeating here;
if you didn't understand them the first time, you won't
understand them this time either. Just take it on faith:
You're Fire-- er, You're Wrong.


Well yes, I am. I just wrote a little program to prove that I was Right
and the program says that I'm Wrong.

If any of you feel I've wasted your time, I apologize.

I'm going to study the 'problem' and its solution and report back here,
if anyone will still listen to me.

Sorry I was Wrong. It doesn't happen often (I hope).


But in my own defense, I wasn't sure why I was wrong. Here's what I
found..

1. I did transliterate the macro correctly from a 1995 version of
the code..

#define isspace(c) (ctype[((c)&255)+1] & ISSPACE)

...and I agree it is broken. In a 1998 (and current) version we have..

#define isspace(c) (ctype[(int)(c)+1] & ISSPACE)

...which will accept EOF correctly (the first one didn't) but can
cause all kinds of havoc if other values are not in the range of
0..255. I brought this up with the author of the macro. He said the
Standard required me to present a value 0..255 and if I didn't and
my code blew up, shame on me. So I fixed it..

#define isspace(c) \
(ctype[(unsigned)(c) > 255 ? 0 : ((c)+1)] & ISSPACE)

...and presented it to him. He rejected it out of hand because the
conditional would be too much of a performance hit.

Now I understand better how and why I was wrong. Broken code is OK
if it's fast, even if you have to require your user to perform
unnatural acts.

Merry Christmas
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #28

P: n/a
Joe Wright wrote:
[...] So I fixed it..

#define isspace(c) \
(ctype[(unsigned)(c) > 255 ? 0 : ((c)+1)] & ISSPACE)

..and presented it to him. He rejected it out of hand because the
conditional would be too much of a performance hit.


Performance aside, it suffers from a problem frequently
encountered with macros: it may evaluate its argument more
than once. This is Very Bad if the argument has side-effects:

if (isspace(*p++)) ...

if (isspace(ch = getchar())) ...

You might enjoy reading P.J. Plauger's "The Standard C
Library" for an exposition of the considerations that go into
implementing these and other (C90) Standard library functions.
I found it both entertaining and educational.

--
Eric Sosman
es*****@acm-dot-org.invalid
Nov 14 '05 #29

P: n/a
Eric Sosman wrote:
Joe Wright wrote:
[...] So I fixed it..

#define isspace(c) \
(ctype[(unsigned)(c) > 255 ? 0 : ((c)+1)] & ISSPACE)

..and presented it to him. He rejected it out of hand because the
conditional would be too much of a performance hit.

Performance aside, it suffers from a problem frequently
encountered with macros: it may evaluate its argument more
than once. This is Very Bad if the argument has side-effects:

if (isspace(*p++)) ...

if (isspace(ch = getchar())) ...

You might enjoy reading P.J. Plauger's "The Standard C
Library" for an exposition of the considerations that go into
implementing these and other (C90) Standard library functions.
I found it both entertaining and educational.


I'm sure I would learn lots from it. P.J. is a hero. To the point,
what about..

((unsigned)(c)+1)&(UCHAR_MAX*2+1)

...as the index? No conditional, no multiple use. Still Wrong?
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #30

P: n/a
Joe Wright <jo********@comcast.net> writes:
[...]
In a 1998 (and current) version we have..

#define isspace(c) (ctype[(int)(c)+1] & ISSPACE)

..which will accept EOF correctly (the first one didn't) but can cause
all kinds of havoc if other values are not in the range of 0..255. I
brought this up with the author of the macro. He said the Standard
required me to present a value 0..255 and if I didn't and my code blew
up, shame on me.


He's right (assuming UCHAR_MAX==255).

But note that ISSPACE is in the user's namespace; if your
implementation uses this definition (and not something like _ISSPACE),
it's non-conforming.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #31

P: n/a
In article <0d********************@comcast.com>
Joe Wright <jo********@comcast.net> wrote:
... To the point, what about..

((unsigned)(c)+1)&(UCHAR_MAX*2+1)

..as the index? No conditional, no multiple use. Still Wrong?


Looks OK to me, on typical implementations (and if UCHAR_MAX is the
same as UINT_MAX we have problems implementing the entire library
anyway :-) ). It has two negative consequences, though:

- It doubles (well, almost) the size of the table, which used to
only be 257 entries (for the typical EOF=-1 through the typical
UCHAR_MAX=255).

- It only works if you *also* make sure that EOF is defined as,
e.g., -129 on machines where plain char is signed.

If we assume that you are the (sole) implementor, *you* get to
define whether plain char is signed, and you get to #define EOF in
<stdio.h>. You also get to decide on the actual values of UCHAR_MAX,
SCHAR_MIN, and SCHAR_MAX; let us assume you go with the typical
255, -128, and 127.

If you then choose to make both:

char *p; ... isspace(*p++) ...
and
int c; ... isspace(c = getc(fp)) ...

work, you can do this more simply by:

a) in stdio.h, #define EOF -129
b) in ctype.h,
#define isspace(c) (__ctype_table[(c) + 129] & __CT_ISSPACE)

where __ctype_table is an array of size (255+129) or 384. (The
double underscore names are in your -- the implementor's -- reserved
namespace, so you can be sure no user has used them for anything.
No silly user would go and put "#ifndef __FOO_H / #define __FOO_H_ /
#endif" in a header file, would they? :-) )

Note that, for ctype.h macros, there are three cases:

- the user passes a plain (or explicitly signed) "char" value;
- the user passes a correctly-converted "unsigned char" value;
- the user passes a value obtained from the getc() family.

In the first case, the possible valid values are -128..127 (we know
this because we, the implementors, just *defined* CHAR_MIN and
CHAR_MAX, while writing the C compiler!). In the second case, the
possible valid values are 0..255 (again, *we* defined these when
we wrote the compiler). In the last case, the valid values are
{EOF = -129, 0..255} -- again, we defined EOF.

Note that if we choose to define EOF as -1, we will not be able
to tell, in our table lookup, an invocation of isspace(EOF) from
"char c = -1; isspace(c)". If character -1 is not a space, that
might be OK (because EOF is not a space either), but character -1
is often y-umlaut ("˙", if your Usenet client has not eaten it),
which should produce a true (nonzero) value for some of the is*
functions for which is*(EOF) must be false (zero).

Alternatively, instead of testing whether the user has written
correct EOF-handling code (i.e., has not assumed that EOF is defined
as -1), and allowing the user to get by with sloppy is*() calls,
we can write the ctype.h macros in the usual fashion, and test
whether the user has written correct is*() calls while letting the
user get by with sloppy EOF-handling code. This shrinks our table
back from 384 entries to 257, and makes incorrect C code break on
*our* machine (whatever it is) in the same cases where it breaks
on Intel machines running Microsoftware, instead of breaking
different incorrect C code. (And: guess which breakage people
accept more easily.... :-) )
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 14 '05 #32

P: n/a
Chris Torek wrote:
In article <0d********************@comcast.com>
Joe Wright <jo********@comcast.net> wrote:
... To the point, what about..

((unsigned)(c)+1)&(UCHAR_MAX*2+1)

..as the index? No conditional, no multiple use. Still Wrong?

Looks OK to me, on typical implementations (and if UCHAR_MAX is the
same as UINT_MAX we have problems implementing the entire library
anyway :-) ). It has two negative consequences, though:

- It doubles (well, almost) the size of the table, which used to
only be 257 entries (for the typical EOF=-1 through the typical
UCHAR_MAX=255).

- It only works if you *also* make sure that EOF is defined as,
e.g., -129 on machines where plain char is signed.

If we assume that you are the (sole) implementor, *you* get to
define whether plain char is signed, and you get to #define EOF in
<stdio.h>. You also get to decide on the actual values of UCHAR_MAX,
SCHAR_MIN, and SCHAR_MAX; let us assume you go with the typical
255, -128, and 127.

If you then choose to make both:

char *p; ... isspace(*p++) ...
and
int c; ... isspace(c = getc(fp)) ...

work, you can do this more simply by:

a) in stdio.h, #define EOF -129
b) in ctype.h,
#define isspace(c) (__ctype_table[(c) + 129] & __CT_ISSPACE)

where __ctype_table is an array of size (255+129) or 384. (The
double underscore names are in your -- the implementor's -- reserved
namespace, so you can be sure no user has used them for anything.
No silly user would go and put "#ifndef __FOO_H / #define __FOO_H_ /
#endif" in a header file, would they? :-) )

Note that, for ctype.h macros, there are three cases:

- the user passes a plain (or explicitly signed) "char" value;
- the user passes a correctly-converted "unsigned char" value;
- the user passes a value obtained from the getc() family.

In the first case, the possible valid values are -128..127 (we know
this because we, the implementors, just *defined* CHAR_MIN and
CHAR_MAX, while writing the C compiler!). In the second case, the
possible valid values are 0..255 (again, *we* defined these when
we wrote the compiler). In the last case, the valid values are
{EOF = -129, 0..255} -- again, we defined EOF.

Note that if we choose to define EOF as -1, we will not be able
to tell, in our table lookup, an invocation of isspace(EOF) from
"char c = -1; isspace(c)". If character -1 is not a space, that
might be OK (because EOF is not a space either), but character -1
is often y-umlaut ("˙", if your Usenet client has not eaten it),
which should produce a true (nonzero) value for some of the is*
functions for which is*(EOF) must be false (zero).

Alternatively, instead of testing whether the user has written
correct EOF-handling code (i.e., has not assumed that EOF is defined
as -1), and allowing the user to get by with sloppy is*() calls,
we can write the ctype.h macros in the usual fashion, and test
whether the user has written correct is*() calls while letting the
user get by with sloppy EOF-handling code. This shrinks our table
back from 384 entries to 257, and makes incorrect C code break on
*our* machine (whatever it is) in the same cases where it breaks
on Intel machines running Microsoftware, instead of breaking
different incorrect C code. (And: guess which breakage people
accept more easily.... :-) )


The array remains at 257 of unsigned short. (UCHAR_MAX*2+1) is a
mask of 9 bits instead of 8 so as to accommodate 256. All the values
of interest are among -1..255 contiguous. Casting unsigned and
adding 1 we get 0..256 which is exactly what we want.

All valid values to is*() are EOF or 0..UCHAR_MAX. The comments
about -129 et al makes my head hurt. The characters of interest are
of type char and is positive. On all my systems I have ASCII
characters with value 0..127 (7 bits) so the fact that char is
signed is of no consequence. The 'other' character set is EBCDIC
(0..255 8 bits). Such a system usually has char unsigned. In any
case, EBCDIC is defined in 256 bytes and with EOF -1 fits our model.
--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #33

P: n/a
On Tue, 21 Dec 2004 07:00:50 GMT, Keith Thompson <ks***@mib.org>
wrote:
<snip>
Let's assume CHAR_BIT==8, and plain char is signed. Suppose one of
the arguments contains the character '\xe9' (233 decimal). As a
signed character, its value is -23. isdigit(-23) invokes undefined
behavior. (A given implementation may define isdigit() in such a way
that it doesn't cause any problems, but it's still undefined behavior.)

Changing the condition
isdigit(s[i])
to either
isdigit((unsigned char)s[i])
or
isdigit((unsigned)s[i])
avoids the undefined behavior.


The former does. The latter gives you isdigit( USHRT_MAX+1 -23 ) where
USHRT_MAX is at least 65535 and thus way out of the range of 8-bit
unsigned char which is 255.

The other safe form is isdigit( * (unsigned char*) &s[i] ) .

- David.Thompson1 at worldnet.att.net
Nov 14 '05 #34

P: n/a
Dave Thompson <da*************@worldnet.att.net> writes:
On Tue, 21 Dec 2004 07:00:50 GMT, Keith Thompson <ks***@mib.org>
wrote:

[...]
Changing the condition
isdigit(s[i])
to either
isdigit((unsigned char)s[i])
or
isdigit((unsigned)s[i])
avoids the undefined behavior.


The former does. The latter gives you isdigit( USHRT_MAX+1 -23 ) where
USHRT_MAX is at least 65535 and thus way out of the range of 8-bit
unsigned char which is 255.


Whoops, you're right.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #35

P: n/a
>Chris Torek wrote:
Note that, for ctype.h macros, there are three cases:

- the user passes a plain (or explicitly signed) "char" value;
- the user passes a correctly-converted "unsigned char" value;
- the user passes a value obtained from the getc() family.

In article <Fv********************@comcast.com>
Joe Wright <jo********@comcast.net> wrote:All valid values to is*() are EOF or 0..UCHAR_MAX. The comments
about -129 et al makes my head hurt.
I thought your entire goal here was to make incorrect C code of
the form:

char *p;
...
for (p = buf; *p != '\0'; p++)
*p = toupper(*p); /* WRONG */

"work right".
On all my systems I have ASCII
characters with value 0..127 (7 bits) so the fact that char is
signed is of no consequence. The 'other' character set is EBCDIC
(0..255 8 bits).


On *my* systems I have ISO-Latin-1, and someone might type in his
name as "Pádraig" (P, a-with-accent-acute, d, r, a, i, g). The
second character, when inspected via *p, has value -31.

If all you want is to make *correct* C code work, just require the
user to write:

for (p = buf; *p != '\0'; p++)
*p = toupper((unsigned char)*p); /* RIGHT */

in the first place, and there is no need for any masking -- toupper()
can just use:

#define toupper(c) (__ctype_map_to_upper[(c) + 1])

(with similar mask-free code for the is* macros). This is the
original <ctype.h> code to which you apparently objected.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 14 '05 #36

This discussion thread is closed

Replies have been disabled for this discussion.