Clever Monkey wrote:
I was suggesting UTF-8 was the way to go. This means wide chars, correct?
No, it doesn't mean wide chars necessarily.
UTF-8 data is generally stored as strings in C (arrays of char
terminated by a null character).
A UTF-8 data stream may or may not have multi-byte characters. The size
of each character can vary. However, ASCII characters from 0 to 127
always occupy a single byte. Any byte in a UTF-8 data stream that has a
value from 0 to 127 must be a single character, not part of a multi-byte
character. Thus the null character ('\0') can still be used in the
normal way to terminate a string. The strlen() function is useful for
determining the number of bytes that a UTF-8 string takes, but not the
number of characters.
Functions like isalpha() or tolower() are no longer useful for UTF-8
because they need to operate on more than one byte at a time. Converting
a character from upper to lower case or vice versa may even change the
number of bytes that a particular character takes up.
Here's how I would go about converting the UTF-8 character "A" to
lowercase, on a system where there is a locale available such that
multibyte encoding is UTF-8.
/* The locale name in the line below must correspond
to a valid UTF-8 locale on your implementation */
setlocale(LC_ALL, "en_US.UTF-8");
/* utf8 array contains the string "A" with enough space to
store any multibyte character plus the null character */
char utf8[MB_CUR_MAX + 1] = "A";
/* the tmp variable will contain the wide character */
wchar_t tmp;
/* The first multibyte character found in utf8 is
converted to a wide character and stored in tmp */
mbtowc(&tmp, utf8, strlen(utf8));
/* tmp is replaced by a lowercase version of itself */
tmp = towlower(tmp);
/* tmp is converted to a multibyte character sequence
and stored in utf8, followed by a null character */
utf8[wctomb(utf8, tmp)] = 0;
I believe there is no issue with utf8 being written to twice in the
statement above, as there is a sequence point just before the return of
any library functions.
--
Simon.