By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,906 Members | 1,771 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,906 IT Pros & Developers. It's quick & easy.

unicode mess in c++

P: n/a
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj
May 11 '06 #1
Share this Question
Share on Google+
12 Replies


P: n/a
* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because
this is all standard c
It isn't.

it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable.
No.

So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd
It is.

and
have not been able to produce such behaviour on windows os.
Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author
Uh.
asserts that wstring characters are 32-bit.
They're not (necessarily).

The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?


The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
May 11 '06 #2

P: n/a
damjan wrote :
In c/c++, the unicode characters are introduced by the means of wchar_t
type.
Wrong.
wchar_t has nothing to do with Unicode.

Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.
Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.
Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?


In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.
May 11 '06 #3

P: n/a
dj
loufoque wrote:
damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.


Well, perhaps not philosophically, but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:
Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.


True, perhaps i am just too used to microsoft way of "adapting" things.
Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).


wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.


Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?
Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.


Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.


P.S. Though MBCS works exactly that way, i think.
I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book
from josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions
in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.


Don't be misled by the c in the title. wstring is an stl template class.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?


In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.


Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8? According to wikipedia, UCS-2 is fixed 16-bit unicode
encoding. Well, this sounds to me like a perfect match for the c
representation, but again, if w_char is not necessarily 16-bit ...
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.


Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.
May 11 '06 #4

P: n/a
dj wrote:
Well, perhaps not philosophically, but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:


I understand that Petzold assumes his readers know the context. After all
"...programming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
May 11 '06 #5

P: n/a
dj
Alf P. Steinbach wrote:
* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a
beginner C++ programmer, only had no need to delve into character
encoding intricacies before.

In c/c++, the unicode characters are introduced by the means of
wchar_t type. Based on the presence of _UNICODE definition C functions
are macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because this is all standard c


It isn't.


I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?

it should be platform independent and as much as I understand, all
unicode characters in c (and windows) are 16-bit (because w_char is
usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?


It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.


So which encoding does wchar "use" (i.e. is compatible with)?
I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable.


No.

So pointer increment on a string would sometimes progress by 1, 2 or 4
bytes. I think this is absurd


It is.

and have not been able to produce such behaviour on windows os.


Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author


Uh.
asserts that wstring characters are 32-bit.


They're not (necessarily).

The c++ stl book from josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?


The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.


OK, so my conclusion is that C's wchar and unicode really have no
logical connection. wchar is only a way to allow for 2- or more byte
character strings and there is no other way to handle 4-byte unicode
chars and various unicode encodings in visual c++ but to implement my
own library or search for one.
May 11 '06 #6

P: n/a
dj
Julián Albo wrote:
dj wrote:
Well, perhaps not philosophically, but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:


I understand that Petzold assumes his readers know the context. After all
"...programming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.


The title of the section i quoted from is "Wide Characters and C". The
"Wide Characters and Windows" is the next section. I am not a Windows
freak, just used to it most. My original question was about unicode and
c (it so emerged that a microsoft version of c). If you are bothered by
that, skip this thread next time.
May 11 '06 #7

P: n/a
dj wrote:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.


Start by learning that C an C++ are different languages.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
May 11 '06 #8

P: n/a
dj wrote:
loufoque wrote:
damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.


Well, perhaps not philosophically, but it is the way 16-bit chars step
into c.


wchar_t is a mean, not a solution. It is a standard type of a size
"large enough to hold the largest character set supported by the
implementation's locale" [1]. However, how you use it, or whether you
want to use it, is left to you.
Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode.
That's how Microsoft Windows decided to work, some systems may do
otherwise. However, standard C++ doesn't mandate a specific encoding
for wchar_t.
Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.


That's MS Windows way of doing things.
And it's not a very good way IMHO.


True, perhaps i am just too used to microsoft way of "adapting" things.


This is a dangerous thing: to expect platform-specific behavior to be
standard. This often happens when one learns about platform-specific
libraries and features before learning the language itself. You must be
aware of what's standard and what's not.
Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).


wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.


Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?


That's impossible in C++. You must look in your compiler's
documentation and find an appropriate type (and check again if another
version comes out). If you expect to port your program, use the
preprocessor to typedef these types.
I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.


Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.


P.S. Though MBCS works exactly that way, i think.


Actually, all the string type I know work that way. Advancing an
iterator goes to the next conceptual character, not necessarily
sizeof(char) bytes forward.
Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8?


You could use multiple chars to represent a character code.
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.


Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.


Character encoding is a tricky discussion in C++. However, there are
some good libraries that do the job well. Glib's ustring type is
excellent. Use it.

If you don't want it, you can either search for another portable string
library, use platform-specific features (yurk) or roll your down (shame
on you).
Jonathan

May 11 '06 #9

P: n/a
dj
Julián Albo wrote:
dj wrote:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.


Start by learning that C an C++ are different languages.


I know that, but the topic applies to both c and c++ (e.g. WCHAR and
wstring), so I mixed them in the text. Actually, I consider C++ as an
evolution of C, but hey, that's just my naive interpretation. You got me
on this one, though.
May 11 '06 #10

P: n/a
On Thu, 11 May 2006 17:00:57 +0200, dj <sm*******@lycos.com> wrote in
comp.lang.c++:
Alf P. Steinbach wrote:
* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a
beginner C++ programmer, only had no need to delve into character
encoding intricacies before.

In c/c++, the unicode characters are introduced by the means of
wchar_t type. Based on the presence of _UNICODE definition C functions
are macro'd to either the normal version or the one prefixed with w.


No, that's not standard C.

Because this is all standard c


It isn't.


I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?


[snip]

The functions for handling wide characters are indeed prefixed by 'w'.
But they do not necessarily have anything at all to do with UNICODE.
There is no guarantee that wchar_t is wide enough to hold even the old
UNICODE (16 bits), let alone the newest version (still 18 bits, or did
they increase it again?).

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
May 12 '06 #11

P: n/a
Jack Klein wrote:
The functions for handling wide characters are indeed prefixed by 'w'.
But they do not necessarily have anything at all to do with UNICODE.
There is no guarantee that wchar_t is wide enough to hold even the old
UNICODE (16 bits), let alone the newest version (still 18 bits, or did
they increase it again?).


You need at least 21 bits to store a UTF16 codepoint. (2 surrogates = 20
bits + plane 0 - 64k).
May 12 '06 #12

P: n/a
damjan wrote:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj

You don't specify what your application is, so it is hard to give
recommendations. As others have mentioned this is not really a C++
question but more of an application question.

Plenty of applications deal _internally_ with standard C++ characters
and std::string and then deal with the outside world via encoded
characters.

We do this, for instance, when our applications need to write XML for
external consumption. For these applications, we use the iconv library
to do character conversions. These calls are limited to the interface
we ue to our XML document creater. It all comes down, eventually,
to converting standard C++ types (ints, doubles, std::strings) into
their equivalent character encoding. Since the number of types is
limited, the number of calls to the above mentioned library is limited.

Depending on your application, this may or may not work for you.

May 13 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.