By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,876 Members | 1,206 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,876 IT Pros & Developers. It's quick & easy.

Portable Code that supports Unicode

P: n/a

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.

How does one go about using Unicode in portable code? Something like the
following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).

Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

-Tomás


Feb 28 '06 #1
Share this Question
Share on Google+
13 Replies


P: n/a
Tomás wrote:
Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Why are you using char* instead of std::basic_string<char_type>?
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
How does one go about using Unicode in portable code? Something like the
following?:
Unicode is still not part of the standard, so it is not portable.
typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).
Not all Unicode is 16 bit, and not all 16 bit encodings are Unicode.
wchar_t is often not suitable for Unicode.

Until I was sure what I was doing, I would probably use:

class unicode_char {
/* wrap wchar_t */
}

typedef std::basic_string<unicode_char> ustring;
Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif
That's ugly and is not a modal to be copied. If you need Unicode
support, just support Unicode.

Anyway, this is merely a way of supporting wide and narrow characters,
not encodings.
class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?


I think you need to decide what exactly it is you are doing, and read up
on Unicode.

So far you have only demonstrated wide and narrow character support, and
nothing to do with encodings.

You need to decide on an internal representation, and then you need to
provide mappings to your OS of choice, probably through stream operators
and facets. I don't know what your definition of portable is.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...
Feb 28 '06 #2

P: n/a
Tomas wrote:
(Unicode is 16-
bit).


Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.
Feb 28 '06 #3

P: n/a
loufoque wrote:
Tomas wrote:
(Unicode is 16-
bit).

Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.


What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.
I second that.

UTF-16 is also a big waste of time IMHO.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.

Feb 28 '06 #4

P: n/a
Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Unicode is still not part of the standard, so it is not portable.


Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Making the OS display the characters correctly is another thing.

It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.
Feb 28 '06 #5

P: n/a
loufoque wrote:
Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.


Yeah, sorry. I'm not helping the confusion. I actually started with
"charset" and expanded it as a scanned through for mistakes. D'oh!
Unicode is still not part of the standard, so it is not portable.


Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.


Of course, but there is no native support. In order to get full Unicode
support, you need a rather large library, or at least a decent framework
in which to stick encodings.
Making the OS display the characters correctly is another thing.
....that was my point.
It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.


Indeed, which is fine for internal use, it's the outside world which is
the problem. That's where standardisation (and support) needs to be.

Thanks for the clarifications.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...
Feb 28 '06 #6

P: n/a
Tomás posted:

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

How does one go about using Unicode in portable code? Something like
the following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is
16- bit).

Furthermore, how do you go about making your code in such a way that it
can use either normal characters or wide characters. Microsoft do it
something like the following: (You define the UNICODE macro if you're
using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?

-Tomás

I always try to keep my posts implementation independant... but anywho
here's what I'm doing:

(About to drift off-topic...)

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

There's two flavours of each Windows function, the ASCII one and the Unicode
one, for instance:

SetWindowTextA ( ASCII version )
SetWindowTextW ( Unicode version )

A person can use my control by adding a header file and source file to their
project. Like this:

#inclue <control.hpp>
using namespace Control;

int main()
{
PlaceCtrlOnDialog();
}
Anyway, the whole point is that I while I want the control to support
Unicode, I also want it to support ASCII. I think the best way to do this is
to have a project-wide preprocessor directive such as UNICODE. Then, I could
have:

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

const Character* GetAuthorName()
{
return StringLiteral("Tomás");
}
You may not think it's the most beautiful code, but it achieves its
objective.

Any thoughts?
-Tomás
Feb 28 '06 #7

P: n/a
Gianni Mariani a écrit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.


This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.
Feb 28 '06 #8

P: n/a
loufoque wrote:
Gianni Mariani a écrit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.


Ah. I thought you were referring to Unicode terminology.

The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

Then again, when you look at the requirements for Unicode's composing
characters, it's a problem as well, for any encoding.

G
Feb 28 '06 #9

P: n/a
TB
Tomás skrev:
<snip>
#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx


#define StringLiteral(x) L##x

<snip>

--
TB @ SWEDEN
Feb 28 '06 #10

P: n/a
Gianni Mariani a écrit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.


A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.
Feb 28 '06 #11

P: n/a
TomАs wrote:
Tomás posted:
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?


I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.


<OT>

If by "portable" you mean "running on any Windows system using its
native character encoding" and you are willing to have two binaries, one
for 9x/Me and one for 2k/XP, why don't you use the solution that
Microsoft has already created?

Use 'TCHAR' for characters, 'std::basic_string<TCHAR>' for strings, and
enclose string literals in '_T'.

I think you are unnecessarily trying to reinvent the wheel.

</OT>
Mar 1 '06 #12

P: n/a
loufoque wrote:
Gianni Mariani a écrit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.


I didn't say "hard", I said "non trivial", i.e. it's not a simple
increment or decrement of a pointer.

utf-16 is especially hard if the data is a mix of endianness since you
would need to check for embedded BOM's unless this string is normalized.
Mar 2 '06 #13

P: n/a

I've gone as far as to let both character sets be used at the same time.
Any opinions and suggestions welcome.

#include <iostream>
using std::cout;
using std::endl;

#define Literal(x) StringLiteral( x, L##x )
/*
The macro creates an anonymous object of type "StringLiteral".
It passes two arguments to its constructor: the char version
of the string, and the wchar_t version of the string.
*/

class StringLiteral
{
private:

const char* const p_c;
const wchar_t* const p_w;

public:

StringLiteral( const char* const c, const wchar_t* const w)
: p_c(c), p_w(w) {}

operator const char*() { return p_c; }

operator const wchar_t*() { return p_w; }

};

void GiveMeAnsiString(const char* p)
{
cout << "Ansi!" << endl;
}

void GiveMeUnicodeString(const wchar_t* p)
{

cout << "Unicode!" << endl;
}

int main()
{

GiveMeAnsiString( Literal("Amn't I a pretty string!") );

GiveMeUnicodeString( Literal("Amn't I a pretty string!") );
}
-Tomás
Mar 2 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.