Marco Iannaccone wrote:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc...)?
C provides a concept of wide characters (arrays of wchar_t) and
multibyte characters (arrays of char where each character may take up
more than one byte). The C standard defines functions for converting
between wide and multibyte representations . The standard does not
specify what encoding these two representationa l forms take.
On at least one platform, depending on the current locale setting, the
wide characters built in to C represent Unicode characters, and the
multibyte characters represent the UTF-8 form.
The following program attempts to set the locale to en_AU.UTF-8, which
means Australian English in UTF-8 encoding. The language portion doesn't
matter, just the encoding does. It then takes a UTF-8 string (which
happens to contain Simplified Chinese characters), and converts it to
the wide character representation, which on my platform is equivalent to
Unicode.
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
wchar_t ucs2[5];
if(!setlocale(L C_ALL, "en_AU.UTF-8"))
{
printf("Unable to set locale to Australian English in UTF-8\n");
return 0;
}
/* The UTF-8 representation of string "æ°´è°ƒæ*Œå ¤´"
(four Chinese characters pronounced shui3 diao4 ge1 tou2) */
char *utf8 = "\xE6\xB0\xB4\x E8\xB0\x83\xE6\ xAD\x8C\xE5\xA4 \xB4";
mbstowcs(ucs2, utf8, sizeof ucs2 / sizeof *ucs2);
printf("UTF-8: ");
for(char *p = utf8; *p; p++)
printf("%02X ", (unsigned)(unsi gned char)*p);
printf("\n");
printf("Unicode : ");
for(wchar_t *p = ucs2; *p; p++)
printf("U+%04lX ", (unsigned long) *p);
printf("\n");
return 0;
}
[sbiber@eagle c]$ c99 -Wall utf8ucs2.c -o utf8ucs2
[sbiber@eagle c]$ ./utf8ucs2
UTF-8: E6 B0 B4 E8 B0 83 E6 AD 8C E5 A4 B4
Unicode: U+6C34 U+8C03 U+6B4C U+5934
I'd be interested to know how widespread this technique works. Is it
portable?
--
Simon.