By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,677 Members | 1,218 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,677 IT Pros & Developers. It's quick & easy.

C++0x two Unicode proposals. A correction one and a different one

P: n/a
Based on a discussion about Unicode in clc++ inside a discussion thread
with subject "next ISO C++ standard", and the data provided in
http://en.wikipedia.org/wiki/C%2B%2B0x , and with the design ideals:

1. To provide Unicode support in C++0x always and explicitly.
2. To provide support to all Unicode sets out there.
I think the implementation of these as:

a) char, char16_t and char32_t types.
b) built-in Unicode literals.

should become:

I) Library, implementation defined types like utf8_char, utf16_char, and
utf32_char, leaving alone and not polluting the existing built in types
like char for now and in the future.

II) Leave b) as it is.
In this way, the built in types are not polluted with additional
ever-growing list of UTFs, while in the future the old ones can easily
be deprecated/obsoleted in the library. The pollution of an ever growing
list of UTF characters and literals will be minimal.

Also I think this UTF implementation change will cause minimal change in
the existing C++0x.

---------------------------------------------------------------------------
My second thought on this, is that Unicode support should also become
optional. This will further decrease pollution of built in types and
string literals. An implementation should be able to choose whether it
will support Unicode and which one.
Jan 17 '08 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Ioannis Vranos wrote:
Based on a discussion about Unicode in clc++ inside a discussion thread
with subject "next ISO C++ standard", and the data provided in
http://en.wikipedia.org/wiki/C%2B%2B0x , and with the design ideals:

1. To provide Unicode support in C++0x always and explicitly.
2. To provide support to all Unicode sets out there.
I think the implementation of these as:

a) char, char16_t and char32_t types.
b) built-in Unicode literals.

should become:

I) Library, implementation defined types like utf8_char, utf16_char, and
utf32_char, leaving alone and not polluting the existing built in types
like char for now and in the future.
The problem is that if the library does something like this:

typedef uint32_t char32_t;

then when I write

char32_t c = L'a';
cout << c;

It will output c as "64", not 'c', because the overloading of operator<<
can't detect the typedef.

The library could implement a char32_t like

class char32_t {
uint32_t impl;
....
};

but that has its own problems. It all works OK if these are built-in types.
II) Leave b) as it is.
So if I write a UTF-16 literal using the built-in literal syntax, what
is its type? It has to be a built-in type, not a library type.
Phil.
Jan 17 '08 #2

P: n/a
Phil Endecott wrote:
Ioannis Vranos wrote:
>Based on a discussion about Unicode in clc++ inside a discussion thread
with subject "next ISO C++ standard", and the data provided in
http://en.wikipedia.org/wiki/C%2B%2B0x , and with the design ideals:

1. To provide Unicode support in C++0x always and explicitly.
2. To provide support to all Unicode sets out there.
I think the implementation of these as:

a) char, char16_t and char32_t types.
b) built-in Unicode literals.

should become:

I) Library, implementation defined types like utf8_char, utf16_char, and
utf32_char, leaving alone and not polluting the existing built in types
like char for now and in the future.

The problem is that if the library does something like this:

typedef uint32_t char32_t;

then when I write

char32_t c = L'a';
cout << c;

It will output c as "64", not 'c', because the overloading of operator<<
can't detect the typedef.

Well, then the library should not do that typedef and operator<< of cout
should be implemented to work with the provided character type.

The library could implement a char32_t like

class char32_t {
uint32_t impl;
....
};

but that has its own problems. It all works OK if these are built-in
types.

If your above type suggestion is not possible to be implemented, why not
focusing on providing language tools that make it possible instead?
>
>II) Leave b) as it is.

So if I write a UTF-16 literal using the built-in literal syntax, what
is its type? It has to be a built-in type, not a library type.

It can be a library type. AFAIK a built-in type can also look like a
library type, if it is hidden when the equivalent header is not #included.

In any case my main point of my "correction" proposal, is that the C++
built-in types should not be tied with a specific character encoding system.

Consider the possibility if after some years, a now non-existent, new
character system becomes the dominant one, while C++ built in types are
tied with Unicode.

Having any specific character system provided as a library extension
(implementation-defined type), C++ will have the flexibility to adapt to
new character systems that will emerge in the future without messing
with its built in types.

The same way math-specific types should not become built-in in C++ but
as library extensions, I think the same should happen with character
systems, regular expressions etc.

So as another example, although probably not needed in standard C++,
let's consider adding EBCDIC support explicitly as a library extension.

Something like:

#include <whatever>

// ...
std::ebcdic_char *p= EB"This is a text";
std::ebcdic char c= EB'c';
This style can work for whatever character type system. UTF8, UTF16,
UTF32 whatever.

I think tiying any specific character system with built in types, is
Java-style approach (like C#/.NET etc.) which is a whole framework and
not a programming language alone, and can be changed at will.
Apart from this, I also think that wchar_t should be the largest
character system a specific compiler provides, so for example if a
compiler provides UTF32 as its largest character type, for this compiler
wchar_t should be equivalent with the UTF32 character type of this
compiler.
Jan 17 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.