I want unsigned char * string literals

Michael B Allen

Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *. With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions (at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable (but professional) let's
hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

Jul 22 '07 #1

Subscribe Post Reply

15416

Malcolm McLean

"Michael B Allen" <io****@gmail.comwrote in message
news:20****************************@gmail.com...

Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the
elements of these arrays are decidedly not signed. In fact, they may not
even represent complete characters. At this point I think of text as
simple binary blobs. What charset, character encoding and termination
they use should not be exposed in the interface used to operate on
them.

char * for a list of human readable characters.
unsigned char *for a list of arbitrary bytes - almost always octets.
signed char * - very rare. Sometimes you might need a tiny integer. I will
resist mentioning my campaign for 64 bit ints.

unsigned char really ought to be "byte". Unfortunately a bad decison was
taken to treat characters and bytes the same way, and now we are stuck with
sizeof(char) == 1 byte.

If you start using unsigned char* for strings then, as you have found, you
will merrily break all the calls to string library functions. This can be
patched up by a cast, but the real answer is not to do that in the first
place.
Very rarely are you interested in the actual encoding of a character. A few
exceptions arise when you want to code lookup tables for speed, or write
low-level routines to convert from decimal to machine letter, or put text
into binary files in an agreed coding, but they are very few.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Jul 22 '07 #2

Keith Thompson

Michael B Allen <io****@gmail.comwrites:

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *. With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

[...]

No, C string literals have type 'array[N] of char'; in most, but not
all, contexts, this is implicity converted to 'char*. (Consider
'sizeof "hello, world"'.)

My main point isn't that they're arrays rather than pointers, but that
they're arrays of (plain) char, not of signed char. Plain char is
equivalent to *either* signed char or unsigned char, but is still a
distinct type from either of them. It appears that plain char is
signed in your implementation.

I know this doesn't answer your actual question; hopefully someone
else can help with that.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jul 22 '07 #3

pete

Michael B Allen wrote:

>
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.

They are arrays of plain char,
which may be either a signed or unsigned type.

With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better idea
I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

The solution is obvious: use arrays of char to contain strings.

Using arrays of unsigned char to hold strings
creates a problem for you, but solves nothing.

If I have a problem
that is caused by using arrays of char to hold strings,
I'm unaware of what the problem is.

--
pete

Jul 22 '07 #4

Michael B Allen

On Sun, 22 Jul 2007 22:02:42 GMT
pete <pf*****@mindspring.comwrote:

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better idea
I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

The solution is obvious: use arrays of char to contain strings.

Using arrays of unsigned char to hold strings
creates a problem for you, but solves nothing.

If I have a problem
that is caused by using arrays of char to hold strings,
I'm unaware of what the problem is.

Hi pete,

I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

If you read data from binary file would you read it into a char buffer
or unsigned char buffer?

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

The only problem with using unsigned char is string literals and that
seems like a weak reason to make all downstream functions use char.

Also, technically speaking, if I used char all internationalized string
functions eventually have to cast char to unsigned char so that it could
decode and encode and interpret whole characters.

If compilers allowed the user to specify what the type for string literals
was, that would basically solve this "problem".

Mike

Jul 22 '07 #5

Keith Thompson

Michael B Allen <io****@gmail.comwrites:
[...]

I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

But that's exactly what it's *supposed* to be. If you're saying it
doesn't meet that requirement, I don't disagree. Personally, I think
it would make more sense i most environments for plain char to be
unsigned.

If you read data from binary file would you read it into a char buffer
or unsigned char buffer?

Probably an unsigned char buffer, but a binary file could be anything.
It if contained 8-bit signed data, I'd use signed char.

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

The only problem with using unsigned char is string literals and that
seems like a weak reason to make all downstream functions use char.

Also, technically speaking, if I used char all internationalized string
functions eventually have to cast char to unsigned char so that it could
decode and encode and interpret whole characters.

If compilers allowed the user to specify what the type for string literals
was, that would basically solve this "problem".

Not really; the standard functions that take strings would still
require pointers to plain char.

As I said, IMHO making plain char unsigned is the best solution in
most environments. I don't know why that hasn't caught on. Perhaps
there's to much badly writen code that assumes plain char is signed.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jul 23 '07 #6

pete

Michael B Allen wrote:

>
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.
With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

I think it might be simpler to retain the char interface,
and then cast inside your functions:

int
text_copy(const char *src, char *dst, int n)
{
unsigned char *s1 = ( unsigned char *)dst;
const unsigned char *s2 = (const unsigned char *)src;

while (n != 0 && *s2 != '\0') {
*s1++ = *s2++;
--n;
}
while (n-- != 0) {
*s1++ = '\0';
}

--
pete

Jul 23 '07 #7

Michael B Allen

On Mon, 23 Jul 2007 01:31:22 GMT
pete <pf*****@mindspring.comwrote:

Michael B Allen wrote:

Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.
With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

I think it might be simpler to retain the char interface,
and then cast inside your functions:

int
text_copy(const char *src, char *dst, int n)
{
unsigned char *s1 = ( unsigned char *)dst;
const unsigned char *s2 = (const unsigned char *)src;

Hi pete,

Ok, I'm giving in. I asked, I got an answer and you guys are right.

Even though char is wrong, it's just another little legacy wart with
no serious technical impact other than the fact that to inspect bytes
within the text one should cast to unsigned char first. So if casting
has to occur, doing it in the base functions is a lot more elegant than
casting every string literal throughout the entire codebase.

But in hope that someday compilers will provide an option for char to
be unsigned, I have started to replaced all instances of the char type
with my own typedef so that when that day comes I can tweak one line of
code and have what I want.

Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

Mike

Jul 23 '07 #8

Ian Collins

Michael B Allen wrote:

>
Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

Could it be that it simply makes char unsigned?

--
Ian Collins.

Jul 23 '07 #9

Alan Curry

In article <20****************************@gmail.com>,
Michael B Allen <io****@gmail.comwrote:

>
Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

-funsigned-char affects the compiler's behavior, possibly causing your
program to behave differently, but it doesn't make your code correct. Correct
code works when compiled with either -fsigned-char or -funsigned-char.
The warning is designed to help you make your code correct, by alerting you
when you've done something which might not work the same if you changed from
-funsigned-char to -fsigned-char (or from gcc to some other compiler that
doesn't let you choose)

If you got different warnings depending on your -f[un]signed-char option,
you'd have to compile your code twice to see all the possible warnings. That
wouldn't be friendly.

--
Alan Curry
pa****@world.std.com

Jul 23 '07 #10

Richard Heathfield

Michael B Allen said:

<snip>

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

Strange, that - I've used it with EBCDIC, with the high bit set, and it
worked just fine. I wonder what I'm doing wrong.

If we're ever going to create a new "standard" library for C the first
step is to admit that the one we have now is useless for anything but
hello world programs.

The standard C library could be a lot, lot better, it's true, but it's
surprising just how much can be done with it if you try.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Jul 23 '07 #11

Eric Sosman

Michael B Allen wrote On 07/23/07 12:53,:

On Mon, 23 Jul 2007 09:02:04 -0400
Eric Sosman <es*****@ieee-dot-org.invalidwrote:
>[...]
Perhaps you're unhappy about the casting that *is* needed
for the <ctype.hfunctions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

First, C does not assume ASCII character encodings,
and runs happily on systems that do not use ASCII. The
only constraints on the encoding are (1) that the available
characters include a specified set of "basic" characters,
(2) that the codes for the basic characters be non-negative,
and (3) that the codes for the characters '0' through '9'
be consecutive and ascending. Any encoding that meets
these requirements -- ASCII or not -- is acceptable for C.

Second, the <ctype.hfunctions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.hfunctions cannot ignore that half.

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

If you think so, then why use C? You're planning on
throwing away the entire library and changing the handling
of text in fundamental ways (ways that go far beyond your
initial "I want unsigned text" plea). The result would be
a programming language in which existing C programs would
not run and perhaps would not compile; why are so you set
on calling this new and different language "C?" Call it
"D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

--
Er*********@sun.com

Jul 23 '07 #12

Michael B Allen

On Mon, 23 Jul 2007 13:31:24 -0400
Eric Sosman <Er*********@sun.comwrote:

Michael B Allen wrote On 07/23/07 12:53,:
On Mon, 23 Jul 2007 09:02:04 -0400
Eric Sosman <es*****@ieee-dot-org.invalidwrote:
[...]
Perhaps you're unhappy about the casting that *is* needed
for the <ctype.hfunctions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.
First, C does not assume ASCII character encodings,
and runs happily on systems that do not use ASCII. The
only constraints on the encoding are (1) that the available
characters include a specified set of "basic" characters,
(2) that the codes for the basic characters be non-negative,
and (3) that the codes for the characters '0' through '9'
be consecutive and ascending. Any encoding that meets
these requirements -- ASCII or not -- is acceptable for C.

True. I forgot about EBCDIC and such (thanks Richard).

But that is just a pedantic distraction from the real point which is that
your code will not work with non-latin1 encodings and that is going to
seriously impact it's portablity.

Second, the <ctype.hfunctions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.hfunctions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.
If you think so, then why use C? You're planning on
throwing away the entire library and changing the handling
of text in fundamental ways (ways that go far beyond your
initial "I want unsigned text" plea). The result would be
a programming language in which existing C programs would
not run and perhaps would not compile; why are so you set
on calling this new and different language "C?" Call it
"D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.

Mike

Jul 23 '07 #13

Eric Sosman

Michael B Allen wrote On 07/23/07 14:10,:

On Mon, 23 Jul 2007 13:31:24 -0400
Eric Sosman <Er*********@sun.comwrote:

>[...]
Second, the <ctype.hfunctions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.hfunctions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

You've heard of the <locale.hmechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.hnear the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");

The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).

C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.

I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).

If it "needs to be C," then it "needs to be C," and
wishing that C were something radically different from what
it is isn't going to help you. Either learn to solve your
problems in C, or find another language -- which may mean
finding another "it."

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.

... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.

And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

--
Er*********@sun.com

Jul 23 '07 #14

Michael B Allen

On Mon, 23 Jul 2007 14:54:07 -0400
Eric Sosman <Er*********@sun.comwrote:

Michael B Allen wrote On 07/23/07 14:10,:
On Mon, 23 Jul 2007 13:31:24 -0400
Eric Sosman <Er*********@sun.comwrote:

[...]
Second, the <ctype.hfunctions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.hfunctions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

You've heard of the <locale.hmechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.hnear the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");

Ahh, yes. I just wrote that code in haste. I spaced on setlocale.

Still my understanding from being on the linux-utf8 mailing list (which
doesn't just discuss UTF-8) for a few years, is that some of these
functions even with the latin1 codes do not work correctly.

The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).
C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.

I never said using unsigned char would "fix" anything. I don't know where
you got that in this conversation. I just want to use the right data
type for binary data and string literals were the only thing standing
in the way.

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.
... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.

This is were we clearly disagree. Translating between character encodings
at interface boundries is a hack.

And not every function that touches a
character would need "KlingonKapability" (whatever that is). Most of the
time you're just working on ASCII characters anyway so that code
is not a whole lot different from before. It's only when you need to
do things like case comparison of non-ASCII characters that you need to
do extra work. But even then you just put that work into a function like
text_casecmp and you don't have to think about it much again.

You make it sound like I'm rewriting everything and have my own
translation tables and so on. That's not the case. As useless as
the C standard library is, I still have no choice but to use it. The
implementations are just less efficient and inelegant. For example to do
a caseless comparison of two strings you have to use mbtowc to convert
each character to wchar_t, use towupper, then compare and repeat for
the next character. It's kinda ugly but it's a lot better than tripping
over the Unicode speed bump every time you want to call a function that
expects a different character encoding.

And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

Yeah, yeah, yeah. I agree that even though signed char is the wrong type
for binary data, there's no technical problem with using it throughout
higher level code. I never claimed there was.

The problem (which isn't really much of a problem) is when any function
that actually inspects the binary data represeting text will almost
certainly want to do it using unsigned char. For example, codebases that
support UTF-8 throughout usually have a fast code path for UTF-8. A
function to decode one UTF-8 character into it's Unicode value might
start out something like the following:

int
utf8towc(const char *ssrc, const char *sslim, wchar_t *wc)
{
const unsigned char *src = (unsigned char *)ssrc;
const unsigned char *slim = (unsigned char *)sslim;

if (*src < 0x80) {
*wc = *src;
} else if ((*src & 0xE0) == 0xC0) {
...

You can't do the above conditional comparisons on signed char so you
gotta cast.

Mike

Jul 23 '07 #15

Michael B Allen

On Mon, 23 Jul 2007 16:17:11 -0400
Michael B Allen <io****@gmail.comwrote:

On Mon, 23 Jul 2007 14:54:07 -0400
Eric Sosman <Er*********@sun.comwrote:

Michael B Allen wrote On 07/23/07 14:10,:
On Mon, 23 Jul 2007 13:31:24 -0400
Eric Sosman <Er*********@sun.comwrote:

>[...]
> Second, the <ctype.hfunctions are required to accept
>>arguments whose values cover the entire range of unsigned
>>char (plus EOF). Half those values have the high bit set,
>>and the <ctype.hfunctions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

You've heard of the <locale.hmechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.hnear the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");
Ahh, yes. I just wrote that code in haste. I spaced on setlocale.

Still my understanding from being on the linux-utf8 mailing list (which
doesn't just discuss UTF-8) for a few years, is that some of these
functions even with the latin1 codes do not work correctly.

Actually I think maybe I'm wrong about this. I think these functions do
work with latin1.

But of coarse they definitely do not work with multi-byte encodings like
UTF-8 or SHIFT-JIS or something like that. For that you would have to
convert to a wchar_t and test with isw*.

Mike

Jul 23 '07 #16

Keith Thompson

Michael B Allen <io****@gmail.comwrites:
[...]

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

[...]

setjmp - not portable

The restrictions are a bit severe, but how is it not portable? Any
conforming hosted implementation has to support it correctly.

[...]

stdlib - malloc has no context object

Can you expand on this? What's a "context object"?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jul 23 '07 #17

Keith Thompson

Eric Sosman <Er*********@sun.comwrites:
[...]

And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

If you're just using ASCII (which is a 7-bit character set), it
doesn't matter whether plain char is signed or unsigned.

If you're using one of the 8-bit extended versions of ASCII, such as
ISO 8859-1, then making plain char unsigned can have some advantages.
For example, lower case 'e' with an acute accent is character 0xe9;
referring to that as -23 rather than 233 doesn't make a lot of sense.
(I understand the sign-extension performance issue; does that still
apply to modern systems?)

Once you go beyond that, I'm not sure whether it matters or not.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jul 23 '07 #18

Ben Pfaff

Michael B Allen <io****@gmail.comwrites:

locale - no context object so it can't be safely used in libraries

POSIX is standardizing an extended set of locale functions with
what you call "context objects". It appears to me to be
implementable as a set of wrappers around the existing functions
for systems that don't support it, so this may actually be a
viable interface fairly soon. (For folks who are familiar with
gnulib--which is not c.l.c compliant code by any means--I'm
thinking about adding a module to support these new functions.)

setjmp - not portable

How so? I've successfully used it in code that I believe to be
portable, and some fairly portable libraries, e.g. libpng, use it
also.
--
int main(void){char p[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuv wxyz.\
\n",*q="kl BIcNBFr.NKEzjwCIxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+=strchr(p,*q++)-p;if(i>=(int)sizeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}

Jul 23 '07 #19

Ben Pfaff

Keith Thompson <ks***@mib.orgwrites:

(I understand the sign-extension performance issue; does that still
apply to modern systems?)

What performance issue is there with sign extension?
--
Ben Pfaff
http://benpfaff.org

Jul 23 '07 #20

santosh

Keith Thompson wrote:

Michael B Allen <io****@gmail.comwrites:
[...]
>Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

[...]

>stdlib - malloc has no context object

Can you expand on this? What's a "context object"?

He probably means that the Standard library functions are not guaranteed to
be reentrant.

Jul 23 '07 #21

Michael B Allen

On Mon, 23 Jul 2007 14:01:18 -0700
Ben Pfaff <bl*@cs.stanford.eduwrote:

Michael B Allen <io****@gmail.comwrites:

locale - no context object so it can't be safely used in libraries

POSIX is standardizing an extended set of locale functions with
what you call "context objects". It appears to me to be
implementable as a set of wrappers around the existing functions
for systems that don't support it, so this may actually be a
viable interface fairly soon. (For folks who are familiar with
gnulib--which is not c.l.c compliant code by any means--I'm
thinking about adding a module to support these new functions.)

Glad to hear it. Can you have pointers to info about what the API will
look like?

setjmp - not portable

How so? I've successfully used it in code that I believe to be
portable, and some fairly portable libraries, e.g. libpng, use it
also.

I don't know. I just remember suggesting using setjmp for some Samba thing
(the CIFS server for UNIX) and I was told setjmp was taboo because it
was not portable. Those guys have a pretty big build farm so I wasn't
about to ask for an explaination.

Mike

Jul 23 '07 #22

Michael B Allen

On Tue, 24 Jul 2007 02:40:05 +0530
santosh <sa*********@gmail.comwrote:

Keith Thompson wrote:

Michael B Allen <io****@gmail.comwrites:
[...]
Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.
[...]

stdlib - malloc has no context object
Can you expand on this? What's a "context object"?

He probably means that the Standard library functions are not guaranteed to
be reentrant.

Hi santosh (and Keith),

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

[Note that this should not be confused with Object Oriented Programming
so please hold the "you want C++" responses. For OOP you need polymorphism
which is not obtained by simply adding an "context object".]

In general, malloc (like many of the C APIs) was just poorly designed. At
the time it was invented it was fine but now it's a good example of a
poorly designed API.

Incedentially, the current malloc is reentrant and thread-safe because
it uses locks (although I don't recall of the top of my head if that
is a standards requirement). But with a context object you wouldn't
need the locks and the code would still be reentrant and it would be
thread-safe provided your threads used their own context objects (or
if they locks). So reentrance is one benifit but it is by no means the
only one.

Mike

Jul 23 '07 #23

Ben Pfaff

Michael B Allen <io****@gmail.comwrites:

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.
--
Ben Pfaff
http://benpfaff.org

Jul 23 '07 #24

Michael B Allen

On Mon, 23 Jul 2007 15:26:06 -0700
Ben Pfaff <bl*@cs.stanford.eduwrote:

Michael B Allen <io****@gmail.comwrites:

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

Hi Ben,

I think you just agreed with me (not sure) but I just want to add that
the "context object" focus comes from the fact that you can use the same
functions with different context objects. So the context object would
encapsulate everything about the behavior of the allocator. That allows
the user to swap out the allocator with a different implementation but
without changing all of the allocation calls in your code.

Mike

Jul 23 '07 #25

Mark McIntyre

On Sun, 22 Jul 2007 19:54:26 -0400, in comp.lang.c , Michael B Allen
<io****@gmail.comwrote:

>
I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

Huh? A char array is perfect for text.

Do you perchance mean wide characters? Considered wchar_t?

>Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

Binary blobs are not text however. They're binary data. Unsigned char
arrays are good for that, but I suspect you want either wchar_t or
some specific binary representation of multibyte characters. If
/thats/ what you're after, unsigned char arrays are still good.

--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Jul 23 '07 #26

Ben Pfaff

Michael B Allen <io****@gmail.comwrites:

On Mon, 23 Jul 2007 15:26:06 -0700
Ben Pfaff <bl*@cs.stanford.eduwrote:

>Michael B Allen <io****@gmail.comwrites:

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

I think you just agreed with me (not sure)[...]

No, I was trying to say that you can layer all of the allocators
you want on top of malloc, or make them independent of it,
instead of needing to give malloc multiple independent arenas,
etc., based on a context object.
--
Ben Pfaff
http://benpfaff.org

Jul 24 '07 #27

Michael B Allen

On Mon, 23 Jul 2007 17:24:26 -0700
Ben Pfaff <bl*@cs.stanford.eduwrote:

Michael B Allen <io****@gmail.comwrites:

On Mon, 23 Jul 2007 15:26:06 -0700
Ben Pfaff <bl*@cs.stanford.eduwrote:

Michael B Allen <io****@gmail.comwrites:

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.
I think you just agreed with me (not sure)[...]

No, I was trying to say that you can layer all of the allocators
you want on top of malloc, or make them independent of it,
instead of needing to give malloc multiple independent arenas,
etc., based on a context object.

Hi Ben,

I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless because it didn't have
one. And as such it shouldn't be used.

As for allocating backing memory from malloc(3) for one such
implementation of the API I'm describing that seems fine (if you're ok
with the locking overhead) but I'm not sure I understand the significace
of doing that wrt the topic of malloc being poorly designed.

Mike

Jul 24 '07 #28

Ben Pfaff

Michael B Allen <io****@gmail.comwrites:

I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless because it didn't have
one. And as such it shouldn't be used.

malloc is "useless" because it doesn't have a context object?
Please don't exaggerate. This is refuted by the existence of
hundreds of millions of lines of code that make use of malloc.
--
Ben Pfaff
http://benpfaff.org

Jul 24 '07 #29

Eric Sosman

Michael B Allen wrote:

[...]
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless [...]

For the Nth time: Forget about C and find a language
more suited to your tastes. If you truly believe C is
useless, you're just wasting your time and our patience.
Go away! Be happy! Be happy somewhere else, please!
We who are about to be obsoleted salute thee; just leave
us to our misery and begone!

--
Eric Sosman
es*****@ieee-dot-org.invalid

Jul 24 '07 #30

Michael B Allen

On Mon, 23 Jul 2007 22:55:24 -0400
Eric Sosman <es*****@ieee-dot-org.invalidwrote:

Michael B Allen wrote:
[...]
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless [...]

For the Nth time: Forget about C and find a language
more suited to your tastes. If you truly believe C is
useless, you're just wasting your time and our patience.
Go away! Be happy! Be happy somewhere else, please!
We who are about to be obsoleted salute thee; just leave
us to our misery and begone!

Oh please. I appreciate your input. It's usually good advice. But spare
me the drama. Just because I think The C Standard Library is useless
[1], that has little impact on using C The Language.

Mike

[1] Ok, yes, "useless" is an exaggeration simply because you *have*
to use the standard library to interface with the host. But otherwise
I don't use a lot of it (e.g. I literally don't use malloc *at all* -
I have my own allocators).

Jul 24 '07 #31

Malcolm McLean

"Michael B Allen" <io****@gmail.comwrote in message
news:20****************************@gmail.com...

Oh please. I appreciate your input. It's usually good advice. But spare
me the drama. Just because I think The C Standard Library is useless
[1], that has little impact on using C The Language.

I grew up on systems without a standard library.
Just about the first thing I always did when getting new hardware was to
implement a cut down stdlib for it.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Jul 24 '07 #32

Mark McIntyre

On Mon, 23 Jul 2007 23:45:58 -0400, in comp.lang.c , Michael B Allen
<io****@gmail.comwrote:

>On Mon, 23 Jul 2007 22:55:24 -0400
Eric Sosman <es*****@ieee-dot-org.invalidwrote:

>Michael B Allen wrote:
[...]
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless [...]

For the Nth time: Forget about C and find a language
more suited to your tastes.

>Just because I think The C Standard Library is useless
[1], that has little impact on using C The Language.

The point is, the Standard Library is part of the language definition.
You are also apparently taking exception to how C defines a string
variable. This is a fundamental part of C. And claiming that malloc is
useless is just plain stupid (tm).

--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Jul 24 '07 #33

Richard Bos

Michael B Allen <io****@gmail.comwrote:

Incedentially, the current malloc is reentrant and thread-safe because
it uses locks (although I don't recall of the top of my head if that
is a standards requirement).

_The_ current malloc()? I suspect you don't grasp the real situation,
here.

Richard

Jul 26 '07 #34

I want unsigned char * string literals

Similar topics