473,224 Members | 1,393 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,224 software developers and data experts.

Portable Code that supports Unicode


Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.

How does one go about using Unicode in portable code? Something like the
following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).

Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

-Tomás


Feb 28 '06 #1
13 3231
Tomás wrote:
Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Why are you using char* instead of std::basic_string<char_type>?
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
How does one go about using Unicode in portable code? Something like the
following?:
Unicode is still not part of the standard, so it is not portable.
typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).
Not all Unicode is 16 bit, and not all 16 bit encodings are Unicode.
wchar_t is often not suitable for Unicode.

Until I was sure what I was doing, I would probably use:

class unicode_char {
/* wrap wchar_t */
}

typedef std::basic_string<unicode_char> ustring;
Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif
That's ugly and is not a modal to be copied. If you need Unicode
support, just support Unicode.

Anyway, this is merely a way of supporting wide and narrow characters,
not encodings.
class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?


I think you need to decide what exactly it is you are doing, and read up
on Unicode.

So far you have only demonstrated wide and narrow character support, and
nothing to do with encodings.

You need to decide on an internal representation, and then you need to
provide mappings to your OS of choice, probably through stream operators
and facets. I don't know what your definition of portable is.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...
Feb 28 '06 #2
Tomas wrote:
(Unicode is 16-
bit).


Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.
Feb 28 '06 #3
loufoque wrote:
Tomas wrote:
(Unicode is 16-
bit).

Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.


What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.
I second that.

UTF-16 is also a big waste of time IMHO.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.

Feb 28 '06 #4
Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Unicode is still not part of the standard, so it is not portable.


Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Making the OS display the characters correctly is another thing.

It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.
Feb 28 '06 #5
loufoque wrote:
Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.


Yeah, sorry. I'm not helping the confusion. I actually started with
"charset" and expanded it as a scanned through for mistakes. D'oh!
Unicode is still not part of the standard, so it is not portable.


Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.


Of course, but there is no native support. In order to get full Unicode
support, you need a rather large library, or at least a decent framework
in which to stick encodings.
Making the OS display the characters correctly is another thing.
....that was my point.
It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.


Indeed, which is fine for internal use, it's the outside world which is
the problem. That's where standardisation (and support) needs to be.

Thanks for the clarifications.

Ben Pope
--
I'm not just a number. To many, I'm known as a string...
Feb 28 '06 #6
Tomás posted:

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

How does one go about using Unicode in portable code? Something like
the following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};
Would you use "wchar_t", or would you use "unsigned short"? (Unicode is
16- bit).

Furthermore, how do you go about making your code in such a way that it
can use either normal characters or wide characters. Microsoft do it
something like the following: (You define the UNICODE macro if you're
using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};
What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?

-Tomás

I always try to keep my posts implementation independant... but anywho
here's what I'm doing:

(About to drift off-topic...)

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

There's two flavours of each Windows function, the ASCII one and the Unicode
one, for instance:

SetWindowTextA ( ASCII version )
SetWindowTextW ( Unicode version )

A person can use my control by adding a header file and source file to their
project. Like this:

#inclue <control.hpp>
using namespace Control;

int main()
{
PlaceCtrlOnDialog();
}
Anyway, the whole point is that I while I want the control to support
Unicode, I also want it to support ASCII. I think the best way to do this is
to have a project-wide preprocessor directive such as UNICODE. Then, I could
have:

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

const Character* GetAuthorName()
{
return StringLiteral("Tomás");
}
You may not think it's the most beautiful code, but it achieves its
objective.

Any thoughts?
-Tomás
Feb 28 '06 #7
Gianni Mariani a écrit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.


This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.
Feb 28 '06 #8
loufoque wrote:
Gianni Mariani a écrit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.


Ah. I thought you were referring to Unicode terminology.

The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

Then again, when you look at the requirements for Unicode's composing
characters, it's a problem as well, for any encoding.

G
Feb 28 '06 #9
TB
Tomás skrev:
<snip>
#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx


#define StringLiteral(x) L##x

<snip>

--
TB @ SWEDEN
Feb 28 '06 #10
Gianni Mariani a écrit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.


A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.
Feb 28 '06 #11
TomАs wrote:
Tomás posted:
Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?


I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.


<OT>

If by "portable" you mean "running on any Windows system using its
native character encoding" and you are willing to have two binaries, one
for 9x/Me and one for 2k/XP, why don't you use the solution that
Microsoft has already created?

Use 'TCHAR' for characters, 'std::basic_string<TCHAR>' for strings, and
enclose string literals in '_T'.

I think you are unnecessarily trying to reinvent the wheel.

</OT>
Mar 1 '06 #12
loufoque wrote:
Gianni Mariani a écrit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.


I didn't say "hard", I said "non trivial", i.e. it's not a simple
increment or decrement of a pointer.

utf-16 is especially hard if the data is a mix of endianness since you
would need to check for embedded BOM's unless this string is normalized.
Mar 2 '06 #13

I've gone as far as to let both character sets be used at the same time.
Any opinions and suggestions welcome.

#include <iostream>
using std::cout;
using std::endl;

#define Literal(x) StringLiteral( x, L##x )
/*
The macro creates an anonymous object of type "StringLiteral".
It passes two arguments to its constructor: the char version
of the string, and the wchar_t version of the string.
*/

class StringLiteral
{
private:

const char* const p_c;
const wchar_t* const p_w;

public:

StringLiteral( const char* const c, const wchar_t* const w)
: p_c(c), p_w(w) {}

operator const char*() { return p_c; }

operator const wchar_t*() { return p_w; }

};

void GiveMeAnsiString(const char* p)
{
cout << "Ansi!" << endl;
}

void GiveMeUnicodeString(const wchar_t* p)
{

cout << "Unicode!" << endl;
}

int main()
{

GiveMeAnsiString( Literal("Amn't I a pretty string!") );

GiveMeUnicodeString( Literal("Amn't I a pretty string!") );
}
-Tomás
Mar 2 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: alederer | last post by:
Hallo! Does anybody know a parser generator that supports unicode (UTF-16), and is based on something like ICU. The parser is used in a platform independent and cross-platform communicating...
1
by: webguynow | last post by:
I'm trying to build a good DB Layer on top of Pear::DB Are there any forums or knowledge base sites on this direct topic ? I've been using the documentation at:...
5
by: Mark Shelor | last post by:
Problem: find a portable way to determine whether a compiler supports the "long long" type of C99. I thought I had this one solved with the following code: #include <limits.h> #ifdef...
6
by: Abubakar | last post by:
Hi, we are finding out ways in which we could develop libraries that could be written in c++ in a way that we can use them in windows, linux, and mac os. We want to write portable code, so that it...
30
by: Steve Edwards | last post by:
Hi, I'm re-writing some code that had relied on some platform/third-party dependent utility functions, as I want to make it more portable. Is there a standard C/C++/stl routine for changing an stl...
1
by: Dennis Benzinger | last post by:
Is there a library with a strftime replacement which supports Unicode format strings? Bye, Dennis
8
by: Kirill Simonov | last post by:
Hi, Could anyone suggest me a simple IDE suitable for teaching Python as a first programming language to high school students? It is necessary that it has a good support for input/output in...
20
by: rkk | last post by:
Hi, Is there an equivalent typeof macro/method to determine the type of a variable in runtime & most importantly that works well with most known C compilers? gcc compiler supports typeof()...
42
by: jacob navia | last post by:
http://slashdot.org/ "The leaner, lighter, faster, and most importantly, BSD Licensed, Compiler PCC has been imported into OpenBSD's CVS and NetBSD's pkgsrc. The compiler is based on the...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
by: jimatqsi | last post by:
The boss wants the word "CONFIDENTIAL" overlaying certain reports. He wants it large, slanted across the page, on every page, very light gray, outlined letters, not block letters. I thought Word Art...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.