Portable Code that supports Unicode

Tomás

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.

How does one go about using Unicode in portable code? Something like the
following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).

Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

-Tomás

Feb 28 '06 #1

Subscribe Post Reply

3247

Ben Pope

Tomás wrote:

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Why are you using char* instead of std::basic_string<char_type>?
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
How does one go about using Unicode in portable code? Something like the
following?:
Unicode is still not part of the standard, so it is not portable.
typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).
Not all Unicode is 16 bit, and not all 16 bit encodings are Unicode.
wchar_t is often not suitable for Unicode.

Until I was sure what I was doing, I would probably use:

class unicode_char {
/* wrap wchar_t */
}

typedef std::basic_string<unicode_char> ustring;
Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif
That's ugly and is not a modal to be copied. If you need Unicode
support, just support Unicode.

Anyway, this is merely a way of supporting wide and narrow characters,
not encodings.
class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

I think you need to decide what exactly it is you are doing, and read up
on Unicode.

So far you have only demonstrated wide and narrow character support, and
nothing to do with encodings.

You need to decide on an internal representation, and then you need to
provide mappings to your OS of choice, probably through stream operators
and facets. I don't know what your definition of portable is.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...

Feb 28 '06 #2

loufoque

Tomas wrote:

(Unicode is 16-
bit).

Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.

Feb 28 '06 #3

Gianni Mariani

loufoque wrote:

Tomas wrote:
(Unicode is 16-
bit).

Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.
I second that.

UTF-16 is also a big waste of time IMHO.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.

Feb 28 '06 #4

loufoque

Ben Pope a écrit :

WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Unicode is still not part of the standard, so it is not portable.

Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Making the OS display the characters correctly is another thing.

It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.

Feb 28 '06 #5

Ben Pope

loufoque wrote:

Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Yeah, sorry. I'm not helping the confusion. I actually started with
"charset" and expanded it as a scanned through for mistakes. D'oh!

Unicode is still not part of the standard, so it is not portable.

Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Of course, but there is no native support. In order to get full Unicode
support, you need a rather large library, or at least a decent framework
in which to stick encodings.
Making the OS display the characters correctly is another thing.
....that was my point.
It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.

Indeed, which is fine for internal use, it's the outside world which is
the problem. That's where standardisation (and support) needs to be.

Thanks for the clarifications.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...

Feb 28 '06 #6

Tomás

Tomás posted:

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

How does one go about using Unicode in portable code? Something like
the following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is
16- bit).

Furthermore, how do you go about making your code in such a way that it
can use either normal characters or wide characters. Microsoft do it
something like the following: (You define the UNICODE macro if you're
using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?

-Tomás

I always try to keep my posts implementation independant... but anywho
here's what I'm doing:

(About to drift off-topic...)

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

There's two flavours of each Windows function, the ASCII one and the Unicode
one, for instance:

SetWindowTextA ( ASCII version )
SetWindowTextW ( Unicode version )

A person can use my control by adding a header file and source file to their
project. Like this:

#inclue <control.hpp>
using namespace Control;

int main()
{
PlaceCtrlOnDialog();
}
Anyway, the whole point is that I while I want the control to support
Unicode, I also want it to support ASCII. I think the best way to do this is
to have a project-wide preprocessor directive such as UNICODE. Then, I could
have:

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

const Character* GetAuthorName()
{
return StringLiteral("Tomás");
}
You may not think it's the most beautiful code, but it achieves its
objective.

Any thoughts?
-Tomás

Feb 28 '06 #7

loufoque

Gianni Mariani a écrit :

What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.

Feb 28 '06 #8

Gianni Mariani

loufoque wrote:

Gianni Mariani a Ã©crit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.

Ah. I thought you were referring to Unicode terminology.

The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

Then again, when you look at the requirements for Unicode's composing
characters, it's a problem as well, for any encoding.

G

Feb 28 '06 #9

Tomás skrev:
<snip>

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx

#define StringLiteral(x) L##x

<snip>

--
TB @ SWEDEN

Feb 28 '06 #10

loufoque

Gianni Mariani a écrit :

The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.

Feb 28 '06 #11

Martin Vejnar

TomÐs wrote:

TomÃ¡s posted:
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

<OT>

If by "portable" you mean "running on any Windows system using its
native character encoding" and you are willing to have two binaries, one
for 9x/Me and one for 2k/XP, why don't you use the solution that
Microsoft has already created?

Use 'TCHAR' for characters, 'std::basic_string<TCHAR>' for strings, and
enclose string literals in '_T'.

I think you are unnecessarily trying to reinvent the wheel.

</OT>

Mar 1 '06 #12

Gianni Mariani

loufoque wrote:

Gianni Mariani a Ã©crit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.

I didn't say "hard", I said "non trivial", i.e. it's not a simple
increment or decrement of a pointer.

utf-16 is especially hard if the data is a mix of endianness since you
would need to check for embedded BOM's unless this string is normalized.

Mar 2 '06 #13

Tomás

I've gone as far as to let both character sets be used at the same time.
Any opinions and suggestions welcome.

#include <iostream>
using std::cout;
using std::endl;

#define Literal(x) StringLiteral( x, L##x )
/*
The macro creates an anonymous object of type "StringLiteral".
It passes two arguments to its constructor: the char version
of the string, and the wchar_t version of the string.
*/

class StringLiteral
{
private:

const char* const p_c;
const wchar_t* const p_w;

public:

StringLiteral( const char* const c, const wchar_t* const w)
: p_c(c), p_w(w) {}

operator const char*() { return p_c; }

operator const wchar_t*() { return p_w; }

};

void GiveMeAnsiString(const char* p)
{
cout << "Ansi!" << endl;
}

void GiveMeUnicodeString(const wchar_t* p)
{

cout << "Unicode!" << endl;
}

int main()
{

GiveMeAnsiString( Literal("Amn't I a pretty string!") );

GiveMeUnicodeString( Literal("Amn't I a pretty string!") );
}
-Tomás

Mar 2 '06 #14

by: alederer | last post by:

Hallo! Does anybody know a parser generator that supports unicode (UTF-16), and is based on something like ICU. The parser is used in a platform independent and cross-platform communicating...

C / C++

Pear::DB, mysqli Is is Portable?

by: webguynow | last post by:

I'm trying to build a good DB Layer on top of Pear::DB Are there any forums or knowledge base sites on this direct topic ? I've been using the documentation at:...

PHP

Portable Test for long long Support

by: Mark Shelor | last post by:

Problem: find a portable way to determine whether a compiler supports the "long long" type of C99. I thought I had this one solved with the following code: #include <limits.h> #ifdef...

C / C++

writing platform-portable code in vc++

by: Abubakar | last post by:

Hi, we are finding out ways in which we could develop libraries that could be written in c++ in a way that we can use them in windows, linux, and mac os. We want to write portable code, so that it...

.NET Framework

Portable 'lowercase' function for stl string?

by: Steve Edwards | last post by:

Hi, I'm re-writing some code that had relied on some platform/third-party dependent utility functions, as I want to make it more portable. Is there a standard C/C++/stl routine for changing an stl...

C / C++

strftime replacement which supports Unicode format strings?

by: Dennis Benzinger | last post by:

Is there a library with a strftime replacement which supports Unicode format strings? Bye, Dennis

Python

A python IDE for teaching that supports cyrillic i/o

by: Kirill Simonov | last post by:

Hi, Could anyone suggest me a simple IDE suitable for teaching Python as a first programming language to high school students? It is necessary that it has a good support for input/output in...

Python

portable typeof macro

by: rkk | last post by:

Hi, Is there an equivalent typeof macro/method to determine the type of a variable in runtime & most importantly that works well with most known C compilers? gcc compiler supports typeof()...

C / C++

A Portable C Compiler

by: jacob navia | last post by:

http://slashdot.org/ "The leaner, lighter, faster, and most importantly, BSD Licensed, Compiler PCC has been imported into OpenBSD's CVS and NetBSD's pkgsrc. The compiler is based on the...

C / C++

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Portable Code that supports Unicode

Similar topics