On Nov 15, 3:00 pm, Ben Bacarisse <ben.use...@bsb.me.ukwrote:
TK <tok...@web.dewrites:
how can I handle multibyte characters like ä, ü (german vowel mutation)?
This does't work:
switch(c)
case 'ä':
... some action
break;
case 'ü':
... some action
break;
...
...
wchar_t c;
with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.
It depends (and his question is opening a can of worms). If
he's not interested in internationalization---the program will
only be used in German speaking areas, then using wide
characters is overkill. Maybe. Independantly of the question
wchar_t vs. char, the very first question is what encoding he is
using at execution time, and what encoding the compiler supposes
he is using. If, for example, he is using ISO 8859-1
everywhere, exactly what he has written might actually work---it
works with all the compilers I have here at work (where
everything is ISO 8859-1): g++, Sun CC and VC++. It probably
won't work on my Unix system at home, because there I use UTF-8.
If his environment uses UTF-8 anywhere, he'll have to find a
different solution: in UTF-8, 'ä' is a multi-byte character
(0xC3, 0xA4).
The solution he should probably adopt depends a lot on context.
If he can get away with only the characters in ISO 8859-1 (which
is sufficient for German---but he might have to handle proper
names with other characters in them), it's definitely easier to
code. If in addition he can configure his editor so that it
also writes all files in ISO 8859-1 (":set fileencoding=latin1"
in vim), and he is using one of the compilers I use (Sun CC, g++
or VC++), then he can even write the Umlauts in his source code
(but IMHO, that's pushing things a bit---I'd just use 0xE4,
etc.). If he has to deal with other characters, or with files
which might use other encodings, the problem becomes more
difficult. I usually use UTF-8, even internally, but which
encoding format to choose depends somewhat on what you are doing
with the text, and probably to some degree on the compiler as
well: for some jobs, you'll absolutely want UTF-32 (which means
using int32_t, and not wchar_t). Of course, if he's using
UTF-8, something like the above would have to be written using
an if/else chain, and not as a switch. If this only occurs
once, and there are only three or four cases in the switch, it's
no big deal; if it occurs in a lot of places, that's probably a
sign that UTF-8 is not the correct choice for your application.
Regardless of the solution chosen, you have to consider four
encodings: that in the files you are reading and writing, that
which you use internally, that which the compiler assumes you
are using, and if you use the umlauted characters in your
source, that which the compiler uses to read your sources. Note
that L'\u00E4' isn't a panacea either. The compiler will
translate it into the a-Umlaut in whatever encoding it thinks
you are using internally. If the encoding it thinks you are
using is the one you are actually using, fine. If not,
however... If you know that you want to use Unicode, UTF-32
format, for example, your only portable solution is something
like:
typedef uint32_t UTF32Char ;
UTF32Char const aUmlaut = 0x00E4 ;
// ...
Of course, if you do this, you'll probably have to reimplement
large parts of iostream and locale as well.
--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34