unicode mess in c++

damjan

This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj

May 11 '06 #1

Subscribe Reply

3025

Alf P. Steinbach

* damjan:

This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because
this is all standard c
It isn't.

it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable.
No.

So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd
It is.

and
have not been able to produce such behaviour on windows os.
Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author
Uh.
asserts that wstring characters are 32-bit.
They're not (necessarily).

The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

May 11 '06 #2

loufoque

damjan wrote :

In c/c++, the unicode characters are introduced by the means of wchar_t
type.
Wrong.
wchar_t has nothing to do with Unicode.

Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.
Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.
Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.

May 11 '06 #3

loufoque wrote:

damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.

Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:

Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.
That's MS Windows way of doing things.
And it's not a very good way IMHO.

True, perhaps i am just too used to microsoft way of "adapting" things.

Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).

wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.

Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

P.S. Though MBCS works exactly that way, i think.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book
from josuttis explains virtually nothing on that matter.
If you want to code in C++, you shouldn't even try to search solutions
in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.

Don't be misled by the c in the title. wstring is an stl template class.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?

In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8? According to wikipedia, UCS-2 is fixed 16-bit unicode
encoding. Well, this sounds to me like a perfect match for the c
representation, but again, if w_char is not necessarily 16-bit ...
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.

Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.

May 11 '06 #4

Julián Albo

dj wrote:

Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:

I understand that Petzold assumes his readers know the context. After all
"...program ming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php

May 11 '06 #5

Alf P. Steinbach wrote:

* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a
beginner C++ programmer, only had no need to delve into character
encoding intricacies before.

In c/c++, the unicode characters are introduced by the means of
wchar_t type. Based on the presence of _UNICODE definition C functions
are macro'd to either the normal version or the one prefixed with w.
No, that's not standard C.

Because this is all standard c

It isn't.

I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?

it should be platform independent and as much as I understand, all
unicode characters in c (and windows) are 16-bit (because w_char is
usually typedef'd as unsigned short).
They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?

It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

So which encoding does wchar "use" (i.e. is compatible with)?

I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable.

No.

So pointer increment on a string would sometimes progress by 1, 2 or 4
bytes. I think this is absurd

It is.

and have not been able to produce such behaviour on windows os.

Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author

Uh.
asserts that wstring characters are 32-bit.

They're not (necessarily).

The c++ stl book from josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to
the right place. I've read petzold und believe that c simply uses
fixed 16-bit unicode, but how does that combine with the unicode
encodings?

The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.

OK, so my conclusion is that C's wchar and unicode really have no
logical connection. wchar is only a way to allow for 2- or more byte
character strings and there is no other way to handle 4-byte unicode
chars and various unicode encodings in visual c++ but to implement my
own library or search for one.

May 11 '06 #6

Julián Albo wrote:

dj wrote:
Well, perhaps not philosophically , but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:

I understand that Petzold assumes his readers know the context. After all
"...program ming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.

The title of the section i quoted from is "Wide Characters and C". The
"Wide Characters and Windows" is the next section. I am not a Windows
freak, just used to it most. My original question was about unicode and
c (it so emerged that a microsoft version of c). If you are bothered by
that, skip this thread next time.

May 11 '06 #7

Julián Albo

dj wrote:

I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.

Start by learning that C an C++ are different languages.

--
Salu2

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php

May 11 '06 #8

Jonathan Mcdougall

dj wrote:

loufoque wrote:
damjan wrote :
In c/c++, the unicode characters are introduced by the means of
wchar_t type.
Wrong.
wchar_t has nothing to do with Unicode.

Well, perhaps not philosophically , but it is the way 16-bit chars step
into c.

wchar_t is a mean, not a solution. It is a standard type of a size
"large enough to hold the largest character set supported by the
implementation' s locale" [1]. However, how you use it, or whether you
want to use it, is left to you.
Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode.
That's how Microsoft Windows decided to work, some systems may do
otherwise. However, standard C++ doesn't mandate a specific encoding
for wchar_t.

Based on the presence of _UNICODE definition C functions are macro'd
to either the normal version or the one prefixed with w.

That's MS Windows way of doing things.
And it's not a very good way IMHO.

True, perhaps i am just too used to microsoft way of "adapting" things.

This is a dangerous thing: to expect platform-specific behavior to be
standard. This often happens when one learns about platform-specific
libraries and features before learning the language itself. You must be
aware of what's standard and what's not.

Because this is all standard c, it should be platform independent and
as much as I understand, all unicode characters in c (and windows) are
16-bit (because w_char is usually typedef'd as unsigned short).

wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?

That's impossible in C++. You must look in your compiler's
documentation and find an appropriate type (and check again if another
version comes out). If you expect to port your program, use the
preprocessor to typedef these types.

I've even been convinced by others that the compiler calculates the
neccessary storage size of the unicode character, which may thus be
variable. So pointer increment on a string would sometimes progress by
1, 2 or 4 bytes. I think this is absurd and have not been able to
produce such behaviour on windows os.

Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

P.S. Though MBCS works exactly that way, i think.

Actually, all the string type I know work that way. Advancing an
iterator goes to the next conceptual character, not necessarily
sizeof(char) bytes forward.
Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8?

You could use multiple chars to represent a character code.

The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.

Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.

Character encoding is a tricky discussion in C++. However, there are
some good libraries that do the job well. Glib's ustring type is
excellent. Use it.

If you don't want it, you can either search for another portable string
library, use platform-specific features (yurk) or roll your down (shame
on you).
Jonathan

May 11 '06 #9

Julián Albo wrote:

dj wrote:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.

Start by learning that C an C++ are different languages.

I know that, but the topic applies to both c and c++ (e.g. WCHAR and
wstring), so I mixed them in the text. Actually, I consider C++ as an
evolution of C, but hey, that's just my naive interpretation. You got me
on this one, though.

May 11 '06 #10

Similar topics

17612

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir

Python

2739

unicode encoding usablilty problem

by: aurora | last post by:

I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up in unexpected places and only when a non-ASCII or unicode character first found its way into the system. Below is an example. The program may runs fine at the beginning. But as soon as an unicode character u'b' is introduced, the program boom...

Python

4617

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

4834

unicode meta tag, http header

by: Anon | last post by:

If Http headers specify the character encoding, what is the point of the Meta tag specifying it?

HTML / CSS

3293

Portable Code that supports Unicode

by: Tomás | last post by:

Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const

C / C++

3320

Unicode/ascii encoding nightmare

by: Thomas W | last post by:

I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word "fødselsdag". I stored the string as "fødselsdag" but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using

Python

6013

TVM_GETITEM does not return unicode label

by: =?Utf-8?B?QWxleCBLLg==?= | last post by:

Hi all My TreeView has unicode and english labels. The treeview shows OK on the screen. When I am trying to get an item's label using TVM_GETITEM API message, the buffer returned by SendMessage always contains single-byte coded labels (ASCII) even though I use SendMessageW entry point. In other words, buffer is unicode string each character of which contains two ASCII letters of corresponding label.

C# / C Sharp

1978

Building a Unicode application.

by: =?Utf-8?B?Q3JhaWcgSm9obnN0b24=?= | last post by:

I am in the process of converting an application to Unicode that is built with Visual C++ .NET 2003. On application startup in debug mode I get an exception. The problem appears to be that code with #ifndef _UNICDODE is executed in output.c, the library code for supporting printf functions. I need to how to get the code that is defined with _UNICODE to be executed instead. I have defined the _UNICODE constant in the project Properties and...

.NET Framework

2405

unicode(s, enc).encode(enc) == s ?

by: mario | last post by:

I have checks in code, to ensure a decode/encode cycle returns the original string. Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc) == s mario

Python

8272

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8205

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8713

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8370

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8514

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7206

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4094

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4208

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1817

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP