473,695 Members | 2,969 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode mess in c++

This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj
May 11 '06 #1
12 3033
* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because
this is all standard c
It isn't.

it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable.
No.

So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd
It is.

and
have not been able to produce such behaviour on windows os.
Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author
Uh.
asserts that wstring characters are 32-bit.
They're not (necessarily).

The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?


The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
May 11 '06 #2
damjan wrote :
In c/c++, the unicode characters are introduced by the means of wchar_t
type.
Wrong.
wchar_t has nothing to do with Unicode.

Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.
Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.
Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?


In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.
May 11 '06 #3
dj
loufoque wrote:
damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.


Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:
Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.


True, perhaps i am just too used to microsoft way of "adapting" things.
Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).


wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.


Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?
Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.


Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.


P.S. Though MBCS works exactly that way, i think.
I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book
from josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions
in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.


Don't be misled by the c in the title. wstring is an stl template class.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?


In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.


Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8? According to wikipedia, UCS-2 is fixed 16-bit unicode
encoding. Well, this sounds to me like a perfect match for the c
representation, but again, if w_char is not necessarily 16-bit ...
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.


Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.
May 11 '06 #4
dj wrote:
Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:


I understand that Petzold assumes his readers know the context. After all
"...program ming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
May 11 '06 #5
dj
Alf P. Steinbach wrote:
* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a
beginner C++ programmer, only had no need to delve into character
encoding intricacies before.

In c/c++, the unicode characters are introduced by the means of
wchar_t type. Based on the presence of _UNICODE definition C functions
are macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because this is all standard c


It isn't.


I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?

it should be platform independent and as much as I understand, all
unicode characters in c (and windows) are 16-bit (because w_char is
usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?


It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.


So which encoding does wchar "use" (i.e. is compatible with)?
I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable.


No.

So pointer increment on a string would sometimes progress by 1, 2 or 4
bytes. I think this is absurd


It is.

and have not been able to produce such behaviour on windows os.


Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author


Uh.
asserts that wstring characters are 32-bit.


They're not (necessarily).

The c++ stl book from josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?


The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.


OK, so my conclusion is that C's wchar and unicode really have no
logical connection. wchar is only a way to allow for 2- or more byte
character strings and there is no other way to handle 4-byte unicode
chars and various unicode encodings in visual c++ but to implement my
own library or search for one.
May 11 '06 #6
dj
Julián Albo wrote:
dj wrote:
Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:


I understand that Petzold assumes his readers know the context. After all
"...program ming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.


The title of the section i quoted from is "Wide Characters and C". The
"Wide Characters and Windows" is the next section. I am not a Windows
freak, just used to it most. My original question was about unicode and
c (it so emerged that a microsoft version of c). If you are bothered by
that, skip this thread next time.
May 11 '06 #7
dj wrote:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.


Start by learning that C an C++ are different languages.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
May 11 '06 #8
dj wrote:
loufoque wrote:
damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.


Well, perhaps not philosophically , but it is the way 16-bit chars step
into c.


wchar_t is a mean, not a solution. It is a standard type of a size
"large enough to hold the largest character set supported by the
implementation' s locale" [1]. However, how you use it, or whether you
want to use it, is left to you.
Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode.
That's how Microsoft Windows decided to work, some systems may do
otherwise. However, standard C++ doesn't mandate a specific encoding
for wchar_t.
Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.


That's MS Windows way of doing things.
And it's not a very good way IMHO.


True, perhaps i am just too used to microsoft way of "adapting" things.


This is a dangerous thing: to expect platform-specific behavior to be
standard. This often happens when one learns about platform-specific
libraries and features before learning the language itself. You must be
aware of what's standard and what's not.
Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).


wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.


Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?


That's impossible in C++. You must look in your compiler's
documentation and find an appropriate type (and check again if another
version comes out). If you expect to port your program, use the
preprocessor to typedef these types.
I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.


Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.


P.S. Though MBCS works exactly that way, i think.


Actually, all the string type I know work that way. Advancing an
iterator goes to the next conceptual character, not necessarily
sizeof(char) bytes forward.
Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8?


You could use multiple chars to represent a character code.
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.


Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.


Character encoding is a tricky discussion in C++. However, there are
some good libraries that do the job well. Glib's ustring type is
excellent. Use it.

If you don't want it, you can either search for another portable string
library, use platform-specific features (yurk) or roll your down (shame
on you).
Jonathan

May 11 '06 #9
dj
Julián Albo wrote:
dj wrote:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.


Start by learning that C an C++ are different languages.


I know that, but the topic applies to both c and c++ (e.g. WCHAR and
wstring), so I mixed them in the text. Actually, I consider C++ as an
evolution of C, but hey, that's just my naive interpretation. You got me
on this one, though.
May 11 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
17617
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir
30
2754
by: aurora | last post by:
I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up in unexpected places and only when a non-ASCII or unicode character first found its way into the system. Below is an example. The program may runs fine at the beginning. But as soon as an unicode character u'b' is introduced, the program boom...
48
4625
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
30
4841
by: Anon | last post by:
If Http headers specify the character encoding, what is the point of the Meta tag specifying it?
13
3300
by: Tomás | last post by:
Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const
19
3327
by: Thomas W | last post by:
I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag" but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using
2
6014
by: =?Utf-8?B?QWxleCBLLg==?= | last post by:
Hi all My TreeView has unicode and english labels. The treeview shows OK on the screen. When I am trying to get an item's label using TVM_GETITEM API message, the buffer returned by SendMessage always contains single-byte coded labels (ASCII) even though I use SendMessageW entry point. In other words, buffer is unicode string each character of which contains two ASCII letters of corresponding label.
5
1984
by: =?Utf-8?B?Q3JhaWcgSm9obnN0b24=?= | last post by:
I am in the process of converting an application to Unicode that is built with Visual C++ .NET 2003. On application startup in debug mode I get an exception. The problem appears to be that code with #ifndef _UNICDODE is executed in output.c, the library code for supporting printf functions. I need to how to get the code that is defined with _UNICODE to be executed instead. I have defined the _UNICODE constant in the project Properties and...
8
2407
by: mario | last post by:
I have checks in code, to ensure a decode/encode cycle returns the original string. Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc) == s mario
0
8583
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9002
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8833
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7673
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6500
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5841
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4349
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3021
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
1984
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.