Dietmar Kuehl wrote:
I didn't dispute this. However, some Unicode sequences don't make any
sense if you rip apart certain characters, notably the combination of
a Unicode character and a following combining character (which are two
Unicode characters if I got things right).
No, that makes perfect sense: it's two Unicode characters, the first
being, say, LATIN SMALL LETTER U (0x0075), and the second being
COMBINING DIAERESIS (0x0308). If you're concerned about keeping those
two Unicode characters together, replace them with the single character
LATIN SMALL LETTER U WITH DIAERESIS (0x00fc).
The point is that in Unicode every code point (i.e. valid numeric value
in a 32-bit representation) always means the same thing; you don't have
to look at context to figure out what it means. That's the basic
requirement for wchar_t, as well. It's not the case for char, though,
because the meaning of a single code point can depend on what comes
after it (first byte in a multi-byte character) or what came before it
(with shift encodings and with the second or subsequent bytes in a
multi-byte character).
As to glyphs, they involve a great deal more than what we might call a
"letter". From the Unicode standard:
The difference between identifying a code value and rendering it
on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by
a Unicode value is an abstract entity, such as "LATIN CAPITAL
LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper,
called a glyph, is a visual representation of the character.
--
Pete Becker
Dinkumware, Ltd. (
http://www.dinkumware.com)