Plain Char

Peter Nilsson

In a post regarding toupper(), Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so. And pete has raised a doubt in my mind
on the same issue.

Either through ignorance or incompetence, I've been unable to resolve some
issues.
6.4.4.4p6 states...

The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.

6.4.4.4p9 states...

An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...

What does this mean? Why does it use the phrase 'value of the
_representation_'?

It goes on to say...

If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.

What does this mean?

I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?

What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue? [In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]

Moving onto other source methods for plain char values:

7.19.3p11 states...

...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.

7.19.7.1 states...

...the fgetc function obtains that character as an unsigned char
converted to an int...

7.19.8.1 states...

The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.

Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).

So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]

Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals. If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]

Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?

Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]

--
Peter

Nov 14 '05 #1

Subscribe Post Reply

3380

pete

Peter Nilsson wrote:

In a post regarding toupper(),
Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean,
and whether it was sensible to actually do so.
And pete has raised a doubt in my mind on the same issue.
The idea was that
memcmp(s1, s2, shorter_string_length + 1)
should equal
strcmp(s1, s2)
Writing a generic to_upper(char*)
function would seem to be impossible.
It's up to you, to define what to_upper(char*)
is supposed do with what, and when it is undefined.

This is what I have for int to_upper(int c):

#include <limits.h>
#include <string.h>

#define UPPER "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define LOWER "abcdefghijklmnopqrstuvwxyz"

int to_upper(int c)
{
char *upper;
const char *const lower = LOWER;

upper = CHAR_MAX >= c && c > '\0' ? strchr(lower, c) : NULL;
return upper != NULL ? *(upper - lower + UPPER) : c;
}
If the function chooses to convert the original
string pointer to an unsigned char *,
then it fails for strings read by fgets and possibly string
literals.
If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int)
to (signed plain) char is a literal
reinterpretation of the low order bits.]

Implementations are allowed to support extended character sets,
and those character codings may not be representable as positive
values in plain char.
So how can a strictly portable program deal with such characters?
I don't think that strictly portable programs have to support
other locales besides the C locale.
Do string manipulation functions have to know the potential source?

Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char)
values?

If the argument to toupper isn't either representable
as unsigned char, or equal to EOF, then the behavior is undefined.

--
pete

Nov 14 '05 #2

Dan Pop

In <40******@news.rivernet.com.au> "Peter Nilsson" <ai***@acay.com.au> writes:

In a post regarding toupper(), Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so.
It's not sensible to obtain negative char values in the first place,
in a string/text processing context. There is NO portable use for the
negative char values (portable code needing them for arithmetic purposes
should use the signed char type instead).
Either through ignorance or incompetence, I've been unable to resolve some
issues.
There is a third possibility: the standard is a complete mess in this
area. The origin of the mess is historical: the string handling functions
expect pointers to plain char, but the actual processing involves unsigned
char values:

1 The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
being compared.

This is consistent with the way fgetc() works and avoids all kind of
complications that I mention below. So, any time you need a character
value from a string, use a pointer to unsigned char to obtain it.

If you want some headaches, consider the case with CHAR_BIT > 8 and
CHAR_MAX == 127. Accessing strings via char pointers effectively
means losing information in the process, so even the positive char
values thus obtained are problematic. Allow trap representations into
the padding bits for extra fun...
6.4.4.4p6 states...

The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.

6.4.4.4p9 states...

An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...

What does this mean? Why does it use the phrase 'value of the
_representation_'?

It goes on to say...

If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.

What does this mean?

I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?

What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue?
The latter, apparently. Otherwise, the wording is unnecessarily
complicated (no need to introduce an *object* of type char, when all you
mean is a type conversion between unsigned char and char, followed by
promotion to int). The mention of the object means that a hypothetical
byte must be used in obtaining the value of the constant.

I'm not saying that this is necessarily a sensible way of getting the
value of the character constant, but this is the only sensible
interpretation of what the standard actually says.
[In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]
What's wrong with '\x80' on one's complement machines? Looks like a legit
representation of -127, unless I'm missing something. It's '\xff' that is
problematic on one's complement machines (-0 or trap representation).

The existence of these bit patterns is the reason I expressed my own
doubts WRT to the sanity of this method, above.
Moving onto other source methods for plain char values:

7.19.3p11 states...

...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.

7.19.7.1 states...

...the fgetc function obtains that character as an unsigned char
converted to an int...

7.19.8.1 states...

The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.

Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).
Not necessarily: the fgets specification simply doesn't say how the
function is doing its job. It could convert its s parameter to pointer
to unsigned char and simply store the unsigned char values returned by
fgetc as ints. If a conversion to plain char were involved in the
process, we'd have problems with those values that yield trap
representations and/or negative zeros when converted to plain char.
On a one's complement implementation, the character value 255 may end
up as a null character in fgets' buffer, which is certainly not what you
want.
So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]
???
Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals.
Why?
If it converts single char values to unsigned then it possibly
fails for strings read via fread.
The values have been written as unsigned char values and it is unsafe to
read them back as signed char values (all three types of supported
representations can have "forbidden" bit patterns in the signed types).
[Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]
All three types of supported representations can have "forbidden" bit
patterns in the signed types. So, reinterpreting the low order bits is
not an option here.
Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?
A portable program only uses unsigned char pointers to process data
obtained in string format from the <stdio.h> input functions. For reasons
already explained above.
Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]

The program avoids dealing with negative plain char values in the first
place.

Another way of dealing with the issue is by postulating that only
characters from the basic execution character set can be portably
processed in strings. This avoids all the problems discussed above,
because all the character values are positive and representable as both
char and unsigned char.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #3

by: Akseli Mäki | last post by:

Hi, Hopefully this is not too much offtopic. I'm working on a FAQ. I want to make two versions of it, plain text and HTML. I'm looking for a tool that will make a plain text doc out of the...

HTML / CSS

send e-mails that show both HTML and plain text?

by: LRW | last post by:

I'm not sure this message is totally appropriate for this group, so please, if anyone has a better group suggestion, let me know! My company sends out a monthly newsletter in HTML format to our...

HTML / CSS

Displaying text/plain as text in IE with asp.net

by: Mike Bridge | last post by:

Is there any way to get Internet explorer to treat a text/plain .net page as plain text using asp.net? It seems like IE doesn't trust text/plain as a mime type, and so it (ironically) displays it...

ASP.NET

Download HTML As Plain Text

by: Doominato | last post by:

good day, I was just wondering how can I download a web page as plain text from a certain web site. I have tried to use the OpenURL() method from INET control in my VB.NET app, but it returns...

Visual Basic .NET

When plain text page is treated as HTML

by: Eric Lindsay | last post by:

This may be too far off topic, however I was looking at this page http://www.hixie.ch/advocacy/xhtml about XHTML problems by Ian Hickson. It is served as text/plain, according to Firefox...

HTML / CSS

'restrict' in plain English?

by: Me | last post by:

I'm trying to wrap my head around the wording but from what I think the standard says: 1. it's impossible to swap a restrict pointer with another pointer, i.e. int a = 1, b = 2; int *...

C / C++

character code points versus plain integers

by: dorkz | last post by:

if i do this: #include <iostream> int main() { char number1 = '9' , number2 = '1' , answer ; answer = '9' + '1'

C / C++

[proper index architecture needed]Print out time in plain English

by: mattmao | last post by:

Hi all. There is a challenge question I encountered recently, which says: "In plain English, there are six different ways when you want to tell someone else about the current time: ...

C / C++

paste as plain text from word

by: Flyzone | last post by:

Hello, i'm trying to paste copied text from word into an input box. This text is saved into a oracle db and then used as text in another javascript. The problem is that using the saved text...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Similar topics