std::string vs. Unicode UTF-8

Wolfgang Draxinger

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger
--

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 13 '05 #1

Subscribe Post Reply

49637

Niels Dybdahl

> The question is, wouldn't it be logical to make std::string

Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

It is much easier to handle unicode strings with wchar_t internally and
there is much less confusion about whether the string is ANSI or UTF8
encoded. So I have started using wchar_t wherever I can and I only use UTF8
for external communication.

Niels Dybdahl
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #2

John Harrison

Wolfgang Draxinger wrote:

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

UTF-8 is only an encoding, why to you think a strings internal to the
program should be represented as UTF-8? Makes more sense to me to
translate to or from UTF-8 when you input or output strings from your
program. C++ already has the framework in place for that.

john

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #3

peter.koch.larsen

Wolfgang Draxinger wrote:

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Correct. Also you can't print it or anything else.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
Ok.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? It already is - using e.g. wchar_t. I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. It is not limited.Of course there is
also the wchar_t variant, but actually I don't like that.
So you'd like to have Unicode support. And you realize you already have
it. But you don't like it. Why?
Wolfgang Draxinger
--

/Peter

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #4

Bob Hairgrove

On Tue, 13 Sep 2005 04:20:30 GMT, wd********@darkstargames.de
(Wolfgang Draxinger) wrote:

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Not only that, but substr(), operator[] etc. pose equally
"interesting" problems.
To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

People use std::string in many different ways. You can even store
binary data with embedded null characters in it. I don't know for
sure, but I believe there are already proposals in front of the C++
standards committee for what you suggest. In the meantime, it might
make more sense to use a third-party UTF-8 string class if that is
what you mainly use it for. IBM has released the ICU library as open
source, for example, and it is widely used these days.

--
Bob Hairgrove
No**********@Home.com

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #5

benben

"Wolfgang Draxinger" <wd********@darkstargames.de> wrote in message
news:q2***********@darkstargames.dnsalias.net...

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger

That's why people have std::wstring :)

Ben
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #6

Pete Becker

Wolfgang Draxinger wrote:

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Yup. That's what happens when you use the wrong tool.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays.
There's much more to internationalization than Unicode. Requiring
std::string to be Unicode aware (presumably that means UTF-8 aware)
would impose implementation overhead that's not needed for the kinds of
things it was designed for, like the various ISO 8859 code sets. In
general, neither string nor wstring knows anything about multi-character
encodings. That's for efficiency. Do the translation on input and output.

Of course there is
also the wchar_t variant, but actually I don't like that.

That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #7

msalters

Wolfgang Draxinger schreef:

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Usually correct, but not always. A char is a byte in C++, but
a byte might not be an octet. UTF-8 is of course octet-based.
The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

wchar_t isn't always Unicode, either. There's a proposal to add an
extra unicode char type, and that probably will include std::ustring

However, that is probably a 20+bit type. Unicode itself assigns
numbers to characters, and the numbers have exceeded 65536.
UTF-x means Unicode Transformation Format - x. These formats
map each number to one or more x-bit values. E.g. UTF-8 maps
the number of each unicode character to an octet sequence,
with the additional property that the 0 byte isn't used for
anything but number 0.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #8

kanze

msalters wrote:

Wolfgang Draxinger schreef:
[...] However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.
It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.
Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #9

Dave Rahardja

On 14 Sep 2005 14:40:21 GMT, "kanze" <ka***@gabi-soft.fr> wrote:

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

RFC 3629 says it this way:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit characters."

Note that UTF-8 is intended to _encode_ a larger space, its primary purpose
being the compatibily of the encoded format with "applications and protocols"
that assume 8- or 7-bit characters. This suggests to me that UTF-8 was devised
so that Unicode text can be _passed through_ older protocols that only
understand 8- or 7-bit characters by encoding it at the input, and later
decoding it at the output to recover the original data.

If you want to _manipulate_ Unicode characters, however, why not deal with
them in their native, unencoded space? wchar_t is guaranteed to be wide enough
to contain all characters in all supported locales in the implementation, and
each character will have an equal size in memory.

-dr

Sep 15 '05 #10

msalters

kanze schreef:

msalters wrote:
Wolfgang Draxinger schreef:

[...]
However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.

It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.

Yep, that makes scanning through a byte sequence a lot easier.
However, that's not very important for std::string. .substr()
can't do anything useful with it. For .c_str(), the non-null
property is important.
Of course, for an utf8string type, these additional properties
make implementations a lot easier. UTF8 is quite a good encoding
actually.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

Getting a substring, uppercasing, finding characters, replacing
characters: all common string operations, but non-trivial in UTF8
Saving to file, sending over TCP/IP, or to mobile devices: all
common I/O operations, and UTF8 makes it easy.

Regards,
Michiel Salters

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 16 '05 #11

kanze

msalters wrote:

kanze schreef:
[...]
I don't know where you find that these formats are intended
just for data transfer. Depending on what the code is doing
(and the text it has to deal with), the ideal solution may
be UTF-8, UTF-16 or UTF-32. For most of what I do, UTF-8
would be more appropriate, including internally, than any of
the other formats. (It's also required in some cases.)

Getting a substring, uppercasing, finding characters,
replacing characters: all common string operations, but
non-trivial in UTF8.
I said "for most of what I do". Comparing for equality, using
as keys in std::set or an unordered_set, for example. UTF-8
works fine, and because it uses less memory, it will result in
better overall performance (less cache misses, less paging,
etc.).

In other cases, I've been dealing with binary input, with
embedded UTF-8 strings. Which means that I cannot translate
directly on input, only once I've parsed the binary structure
enough to know where the strings are located. In the last
application, the strings were just user names and passwords --
again, no processing which wouldn't work just fine in UTF-8.

Imagine a C++ compiler. The only place UTF-8 might cause some
added difficulty is when scanning a symbol -- and even there, I
can imagine some fairly simply solutions. For all of the
rest... the critical delimiters can all be easily recognized in
UTF-8, and of course, once past scanning, we're talking about
symbol table management, and perhaps concatenation (to generate
mangled names), but they're both easily done in UTF-8. All in
all, I think a C++ compiler would be a good example of an
application where using UTF-8 as the internal encoding would
make sense.
Saving to file, sending over TCP/IP, or to mobile devices: all
common I/O operations, and UTF8 makes it easy.

The external world is byte oriented. That's for sure. UTF-8
(or some other 8 bit format) is definitly required for external
use. But there are numerous cases where UTF-8 is also a good
choice for internal use as well; why bother with the conversions
and the added memory overhead if it doesn't buy you anything?

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 17 '05 #12

lancediduck

UTF-8 is already in iostream. Just about any platform, when use set you
locale to something with "utf8" support, then your libraries codecvt
facet will likely convert the utf8 to whatever wide char type your
platform supports.
Which on some platforms is 16 byte, and others is 32.

But the main trouble that C++ programmers have with unicode is that
they still want to use it just like arrays of ASCII encoded characters
that you can send to a console command line. That won't work. At the
very least, Unicode assumes that it will be displayed on a graphical
terminal. And there is certainly no "one to one correspondence"
between the characters rendered by the device and what you see encoded
in your Unicode string.
And don't even ask about Unicode regular expressions or "equality
comparison"-- Consider JavaScript,
var a='Hello';
var b=' World!';
if ((a+b) == 'Hello world!')
The conditional expression really means "encode in UTF16LE, normalize
each string using Unicode Normalization Form 3, and then do a byte by
byte comparison and return true if they match"

Just like ASCII is not a better way of doing Morse Code, Unicode is not
a better ASCII, but something way different.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 17 '05 #13

Dietmar Kuehl

Pete Becker wrote:

That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 28 '05 #14

Mirek Fidler

Dietmar Kuehl wrote:

Pete Becker wrote:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities

^^^^^^^^^^^^^^^^

16-bit?

Mirek

Sep 28 '05 #15

Dave Rahardja

On Wed, 28 Sep 2005 08:28:13 +0200, Mirek Fidler <cx*@volny.cz> wrote:

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities

^^^^^^^^^^^^^^^^

16-bit?

From the Unicode Technical Introduction:

"In all, the Unicode Standard, Version 4.0 provides codes for 96,447
characters from the world's alphabets, ideograph sets, and symbol
collections...The majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic multilingual
plane, or BMP for short. There are about 6,300 unused code points for future
expansion in the BMP, plus over 870,000 unused supplementary code points on
the other planes...The Unicode Standard also reserves code points for private
use. Vendors or end users can assign these internally for their own characters
and symbols, or use them with specialized fonts. There are 6,400 private use
code points on the BMP and another 131,068 supplementary private use code
points, should 6,400 be insufficient for particular applications."

Despite the indication that the code space for Unicode is potentially larger
than 32 bits, the following statement seems to suggest that a 32-bit integer
is more than enough to represent all Unicode characters:

"UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded
in a single 32-bit code unit when using UTF-32."

http://www.unicode.org/standard/principles.html

-dr

Sep 28 '05 #16

Pete Becker

Dietmar Kuehl wrote:

Pete Becker wrote:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.

Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 28 '05 #17

Jonathan Coxhead

Pete Becker wrote:

Dietmar Kuehl wrote:
Pete Becker wrote:
That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
was still pretending that they use 16-bit characters and that each
Unicode character consists of a single 16-bit character. Neither of
these two properties holds: Unicode is [currently] a 20-bit encoding
and a Unicode character can consist of multiple such 20-bit entities
for combining characters.

Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.

Actually, you need 21 bits. There are 0x11 planes with 0x10000 characters in
each, so 0x110000 characters. This space is completely flat, though it has
holes. Or, you can use UTF-16, where a character is encoded as 1 or 2 16-bit
values, so in C counts as neither a wide-character encoding nor a multibyte
encoding. (It might be a "multishort" encoding, if such a thing existed.) Or you
can use UTF-8, which is a true multibyte encoding. The translation between these
representations is purely algorithmic.

Anyway, 20 bits: not enough.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 1 '05 #18

kanze

Pete Becker wrote:

Dietmar Kuehl wrote:
Pete Becker wrote:
That's unfortunate, since it's exactly what wchar_t and
wstring were designed for. What is your objection to them?
Well, 'wchar_t' and 'wstring' were designed at a time when
Unicode was still pretending that they use 16-bit characters
and that each Unicode character consists of a single 16-bit
character. Neither of these two properties holds: Unicode is
[currently] a 20-bit encoding and a Unicode character can
consist of multiple such 20-bit entities for combining
characters.

(If you have 20 or more bits, there's no need for the combining
characters; there only present to allow representing character
codes larger than 0xFFFF as two 16 bit characters.)
Well, true, but wchar_t can certainly be large enough to hold
20 bits. And the claim from the Unicode folks is that that's
all you need.

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 1 '05 #19

Dave Rahardja

On Fri, 30 Sep 2005 23:41:35 CST, "kanze" <ka***@gabi-soft.fr> wrote:

Well, true, but wchar_t can certainly be large enough to hold
20 bits. And the claim from the Unicode folks is that that's
all you need.

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

I think wchar_t is fine the way it is defined:

(3.9.1.5)
Type wchar_t is a distinct type whose values can represent distinct codes for
all members of the largest extended character set specified among the
supported locales (22.1.1). Type wchar_t shall have the same size, signedness,
and alignment requirements (3.9) as one of the other integral types, called
its underlying type.

What we need is a Unicode locale! ;-)

-dr

Oct 1 '05 #20

Richard Kettlewell

"kanze" <ka***@gabi-soft.fr> writes:

(If you have 20 or more bits, there's no need for the combining
characters; there only present to allow representing character codes
larger than 0xFFFF as two 16 bit characters.)

I believe you are thinking of surrogates, rather than combining
characters, here. The need (or otherwise) for the latter is
independent of representation.

--
http://www.greenend.org.uk/rjk/

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 1 '05 #21

P.J. Plauger

"kanze" <ka***@gabi-soft.fr> wrote in message
news:11*********************@g49g2000cwa.googlegro ups.com...

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

Right, there's a (non-normative) Technical Report that defines
16- and 32-bit character types independent of wchar_t. We'll
be shipping it as part of our next release, along with a slew
of code conversions you can use with these new types.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 1 '05 #22

kanze

Richard Kettlewell wrote:

"kanze" <ka***@gabi-soft.fr> writes:
(If you have 20 or more bits, there's no need for the
combining characters; there only present to allow
representing character codes larger than 0xFFFF as two 16
bit characters.)
I believe you are thinking of surrogates, rather than
combining characters, here. The need (or otherwise) for the
latter is independent of representation.

I was definitly talking about surrogates. And it is possible to
represent any Unicode character in UTF-32 without the use of
surrogates; they are only necessary in UTF-16.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 4 '05 #23

Sean Parent

A few comments on this thread -

Unicode has been 21 bits since it's inception, at least it was 21 bits by
the time Unicode 1.0 came out - (I worked with Eric Mader, Dave Opstad, and
Mark Davis at Apple <http://www.unicode.org/history/>). Although I've heard
grumblings that people would like to extend it to include pages for more
dead languages.

UCS-2 is a subset of Unicode that fits in 16 bits without double word
encoding. It is part of ISO 10646, which also defines UCS-4, which for all
practical purposes is the same encoding as UTF-32 (there's a document on the
relationship on the unicode.org site). UTF-16 and UTF-32 both have endian
variants.

Operations such as "the number of characters in a string" has very little
meaning - there is no direct relationship between characters and glyphs,
there are combining characters (not the same as a multi-byte or word
encoding). Even if defined as the number of Unicode code points in a string,
it isn't particularly interesting.

Operations such as string catenation, sub-string searching, upper-case to
lower-case conversion, and collation are all non-trivial on a Unicode string
regardless of the encoding.

I think the current string classes and codecvt functionality in the language
is pretty decent (I would have preferred if wchar_t had been nailed to 32
bits, or even 16 bits... But that will be somewhat addressed). I'd like to
see the complexity of the current string classes specified - and I think a
lightweight copy (constant time) is needed - but I think move semantics will
address this. I also think it would be good to mark strings with their
encoding because it is too easy to end up with Mojibake
<http://en.wikipedia.org/wiki/Mojibake> but I don't think this requires a
whole new string class (I honestly don't think there is such a thing as a
once size fits all string class).

I'd love to see the functionality of the IBM ICU libraries
<http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
not a fan of the ICU C++ interface (as I mentioned above - I don't see a
need for a new string class, I'd like ICU rethought as generic algorithms
that work regardless of the string representation.

Beyond that, I'd like to work towards a standard markup - strings require
more information than just their encoding to really be handled properly. You
need to know which sections of a string are in which language (which can't
be determined completely from the characters used) - items such as gender,
plurality, and formal forms all play a part in doing proper operations such
as replacements. The ASL xstring glossary library is a step in this
direction <http://opensource.adobe.com/group__asl__xstring.html>

--
Sean Parent
Sr. Engineering Manager
Software Technology Lab
Adobe Systems Incorporated
sp*****@adobe.com

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 5 '05 #24

Niklas Matthies

On 2005-10-04 04:00, kanze wrote:
:

I was definitly talking about surrogates. And it is possible to
represent any Unicode character in UTF-32 without the use of
surrogates;

It's even necessary, because surrogate code points outside of UTF-16
are non-conformant and cause the corresponding byte or code point
sequences to be ill-formed.

-- Niklas Matthies

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 5 '05 #25

kuyper

kanze wrote:

Richard Kettlewell wrote:
"kanze" <ka***@gabi-soft.fr> writes:
(If you have 20 or more bits, there's no need for the
combining characters; there only present to allow
representing character codes larger than 0xFFFF as two 16
bit characters.)

I believe you are thinking of surrogates, rather than
combining characters, here. The need (or otherwise) for the
latter is independent of representation.

I was definitly talking about surrogates. And it is possible to
represent any Unicode character in UTF-32 without the use of
surrogates; they are only necessary in UTF-16.

As the Unicode documents themselves point out, what a reader would
consider to be a single character is often represented in Unicode as
the combination of several unicode characters. Can an implementation
use UTF-32 encoding for wchar_t, and meet all of the requirements of
the C standard with respect to wchar_t, when combined characters are
involved? I think you can meet those requirements only by interpreting
every reference in the C standard to a wide "character" as referring to
a "unicode character" rather than as referring to what end users would
consider a character.

If search_string ends with an uncombined character, and target_string
contains the exact same sequence of wchar_t values followed by one or
more combining characters, I believe that wcsstr(search_string,
target_string) is supposed to report a match. That strikes me as
problematic.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 5 '05 #26

kuyper

Sean Parent wrote:
..

I think the current string classes and codecvt functionality in the language
is pretty decent (I would have preferred if wchar_t had been nailed to 32
bits, or even 16 bits... But that will be somewhat addressed). I'd like to

Requiring wchar_t to have more than 8 bits is pointless in itself. If
an implementor would have chosen to make wchar_t 8 bits without that
requirement, forcing the implementor to use 16 bits will merely
encourage definition of a 16-bit type that contains the same range of
values as his 8 bit type would have had. In the process, you'll be
making his implementation marginally more complicated and inefficient.

What might be worthwhile is to require some actual support for Unicode.
I'm not sure it's a good idea to impose such a requirement; there's a
real advantage to giving implementors the freedom to not support
Unicode if they know that their particular customer base has no need
for it. However, such a requirement would at least guarantee some
benefit to some users, which requiring wchar_t to be at least 16 bits
would NOT do.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 6 '05 #27

Lance Diduck

This was a great overvewi .Thanks!

I think the current string classes and codecvt functionality in the language
is pretty decent (I would have preferred if wchar_t had been nailed to 32
bits, or even 16 bits... Of the four platforms that I regularly code for , two are 32 bit, and
two are 16bit def for wchar_t. And of each variety, two are big endian
(AIX and Solaris), and two are little (Linux and Microsoft) (I haven't
researched Cygwin, which would be interesting to see). This is four
different encodings. Any comparisions involving literals are suspect,
not to mention "binary support."
message catalogs help -- and the diversity there is off topic, but is
far far more non standard and uneven than whar_t support.
Given that most localization is done in a GUI framework rather than
through IOstreams, it would help if automatic invocation of codecvt
were placed in something like stringstream. But as it is codecvt only
invoked automatically in things that don't write to memory. And except
perhpas for CGI calls, there is little demand for "console mode"
internationalized applications.

I'd love to see the functionality of the IBM ICU libraries
<http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
not a fan of the ICU C++ interface (as I mentioned above - I don't see a
need for a new string class, The ICU C++ string uses -- and I'm not kidding --"bogus sematics."
http://icu.sourceforge.net/apiref/ic...tring.html#a82 You
check the validity of your string by calling the isBogus
method.Additionally, every ICU class inherits from UMemory, and can
only change the heap manager by redefining this base class, and
redeploying the library.
THe ICU looks like a port from Java, and has a very Java feel to it. I
believe it is a great starting point though.

Other than string literals, and the lack of character iterators, the
main problem with the C++ string and Unicode is the compare function.
To get a true comparision one would really use the locale compare
function, mapped to some normalization and collation algorithm, and not
string compare, which is more or less memcmp. The interface for string
compare can only compare using the number of bytes in the smaller of
the strings to be compared -- so even if you did manage somehow to cram
normalization in a char_traits class, the triats::copare interface
requires truncation the larger of the two strings.
This works great for backward compatibility, though.

Beyond that, I'd like to work towards a standard markup -

But wouldn't that depend on the renderer? But adoption of XSL-FO may be
a goos start. However, RIM devices etc would barely be able to fit such
a renderer.

]

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 6 '05 #28

Sean Parent

in article 11*********************@g49g2000cwa.googlegroups.c om, Lance
Diduck at la*********@nyc.rr.com wrote on 10/5/05 11:22 PM:

Beyond that, I'd like to work towards a standard markup -

But wouldn't that depend on the renderer? But adoption of XSL-FO may be
a goos start. However, RIM devices etc would barely be able to fit such
a renderer.

I should have clarified - I'm not looking at markup for rendering intents
(that's a separate but important issues) rather for semantic intents -
marking substrings with their language, gender, plurality, and locale as
well as alternates (alternate languages, alternate forms such as
formal/casual). These are important attributes for string processing. More
RDF than XSL-FO.

Sean

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 6 '05 #29

Simon Bone

On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:

What might be worthwhile is to require some actual support for Unicode.
I'm not sure it's a good idea to impose such a requirement; there's a
real advantage to giving implementors the freedom to not support
Unicode if they know that their particular customer base has no need
for it. However, such a requirement would at least guarantee some
benefit to some users, which requiring wchar_t to be at least 16 bits
would NOT do.

Like the freedom not to implement export because no-one in their customer
base needs it? ;-)

I think standard Unicode support would be more widely appreciated than
export. If some vendors continue to decide not to quite finish their
implementations, so what? The world has not stopped turning while we wait
for more C++ 98 implementations to become strictly complete. I also expect
most C++ implementors would provide Unicode support following the
standard, if it was included.

Simon Bone

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 7 '05 #30

kuyper

Simon Bone wrote:

On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:

What might be worthwhile is to require some actual support for Unicode.
I'm not sure it's a good idea to impose such a requirement; there's a
real advantage to giving implementors the freedom to not support
Unicode if they know that their particular customer base has no need
for it. However, such a requirement would at least guarantee some
benefit to some users, which requiring wchar_t to be at least 16 bits
would NOT do.

Like the freedom not to implement export because no-one in their customer
base needs it? ;-)

Not really. The freedom to not implement export exists because
customers don't insist that an implementation be fully conforming in
that regard. The freedom to provide a trivial implementation of wide
characters is available because the standard is quite deliberatly
designed to allow even a fully conforming implementation to provide
such an implementation. Those freedoms seem quite different to me.
I think standard Unicode support would be more widely appreciated than
export. ...

Perhaps; I can't speak for anyone but myself. Personally, in my current
job I have absolutely no need for Unicode support, or even support for
any encoding other than US ASCII, nor for any locale other than the "C"
locale. On the other hand, I'd love to be able to use "export". I'm not
opposed to supporting other locales, it just isn't relevant on my
current job.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 7 '05 #31

Simon Bone

On Fri, 07 Oct 2005 06:20:01 +0000, kuyper wrote:

Simon Bone wrote:
On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:

> What might be worthwhile is to require some actual support for Unicode.
> I'm not sure it's a good idea to impose such a requirement; there's a
> real advantage to giving implementors the freedom to not support
> Unicode if they know that their particular customer base has no need
> for it. However, such a requirement would at least guarantee some
> benefit to some users, which requiring wchar_t to be at least 16 bits
> would NOT do.
>

Like the freedom not to implement export because no-one in their customer
base needs it? ;-)

Not really. The freedom to not implement export exists because
customers don't insist that an implementation be fully conforming in
that regard. The freedom to provide a trivial implementation of wide
characters is available because the standard is quite deliberatly
designed to allow even a fully conforming implementation to provide
such an implementation. Those freedoms seem quite different to me.

I didn't intend to compare the C++98 export requirements to the C++98
wide character requirements, but rather to some hypothetical C++0x Unicode
requirements.

At the moment, implementors have a freedom in how they implement wide
character support that in practice seems to make writing portable programs
that handle plain text more difficult than it needs to be. Adding a
requirement to support Unicode directly would help IMO.

I think standard Unicode support would be more widely appreciated than export. ...

Perhaps; I can't speak for anyone but myself. Personally, in my current
job I have absolutely no need for Unicode support, or even support for
any encoding other than US ASCII, nor for any locale other than the "C"
locale. On the other hand, I'd love to be able to use "export". I'm not
opposed to supporting other locales, it just isn't relevant on my
current job.

I find support for extended characters in most of the software I use, even
if not in all I write. Unicode really has become very widespread - enough
to be considered as portable as US ASCII ever was. So I would like support
guaranteed by the standard.

And for what its worth, I think I'd like to be able to use export too. I'm
not trying to argue for losing that (or even the hope of that), but
rather for increased requirements in plain text handling facilities. I
think a standard Unicode library would be widely enough implemented to
displace most of the various libraries currently used. It is enough of a
hassle to pass Unicode data around between different codebases now to be
worth fixing this.

I feel a lot of C++ code right now is probably using one or another
library to solve the need to use Unicode. Moving to a future where most
code needing this support uses a single, well specified interface would be
a big improvement.

If a particular implementor sees their customers as not needing this, no
doubt they will ship without it, regardless of what the standard says.
This could well happen for some compilers targeting embedded systems; and
that is not a change. Lots of implementations have rough edges and when a
particular C++ codebase is ported, problems are often found. It doesn't
mean we should give up any hope of a useful standard. Rather, it helps us
work out who is to blame or at least where the extra work should be
targeted (at improving the compiler or changing the codebase).

Simon Bone

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Oct 8 '05 #32

quisam

Hi,

I've been out of C++ now for around 9 years. (Doing java since then)
For an upcoming project I have to get back into it.

So I have been looking around what changed in the last ten years in the
C++ world and found interesting things like STL, boost, ICU, etc.

Now I'm looking for a solution for the following problem:

I need some kind of RessourceManager, which is able to suspend
some threads and dump their memory to disk. It must maintain
dependencies between ressources and should be as transparanet
as possible to the rest of the source code.

I think a starting point would be to use an Allocator which is connected
to a ResourceManager.

Is there any standard way, which I haven't found, to do this?

Many thanks for any answer!
Markus

Sep 14 '06 #33

std::string vs. Unicode UTF-8

Similar topics