wide characters: "illusion of portability"?

Jonathan Mcdougall

I started using boost's filesystem library a
couple of days ago. In its FAQ, it states

"Wide-character names would provide an illusion of
portability where portability does not in fact
exist. Behavior would be completely different on
operating systems (Windows, for example) that
support wide-character names, than on systems
which don't (POSIX). Providing functionality that
appears to provide portability but in fact
delivers only implementation-defined behavior is
highly undesirable. Programs would not even be
portable between library implementations on the
same operating system, let alone portable to
different operating systems.

The C++ standards committee Library Working Group
discussed this in some detail both on the
committee's library reflector and at the Spring,
2002, meeting, and feels that (1) names based on
types other than char are extremely non-portable,
(2) there are no agreed upon semantics for
conversion between wide-character and
narrow-character names for file systems which do
not support wide-character name, and (3) even the
committee members most interested in
wide-character names are unsure that they are a
good idea in the context of a portable library.
(boost/libs/filesystem/doc/faq.htm)"

This surprised me, since I thought wide characters
were mandatory in production code.

In fact, of all the libraries I am using right
now, very few of them are Unicode compatible.
Therefore, I am stuck with two choices: abandon
Unicode for the profit of plain chars or do time
consuming and not necessarily valid conversions
between wide-character narrow-characters strings.

I have several questions about this.

1. Are C++ wide-character strings used in real
life, in production code?
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?
3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?
4. Will C++0x provide more means for using wide
and narrow strings, such as conversions and
transparency (converting "strings" into L"strings"
automatically, for example, providing standard
macros such as UNICODE)
5. Are wide characters meant to be used with
Unicode or are they provided for an
implemention-defined use?

Thank you,

Jonathan

Jul 23 '05 #1

Subscribe Reply

2646

Niels Dybdahl

> 1. Are C++ wide-character strings used in real

life, in production code?
Yes. I do in my applications. I started using wide-chars when I had to
handle text from a newspaper, which had characters outside the ISO8859-1
range. Now I more and more often have to handle XML sources which are UTF8
encoded. The easiest way to handle this is using wide-chars.
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?
I think that the programmer has to be aware whether he is handling
wide-chars, UTF8, ISO8859-1 or ASCII. So there is no need to make it
invisble for the programmer whether he uses the A or the W version of a
function.
3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?

A simple loop does probably handle special cases. Windows's codepage is not
ISO8859-1 and some of the characters need special handling.
So I would use a special written function for that purpose.

Best regards
Niels Dybdahl

Jul 23 '05 #2

Samuel Krempp

jo************* **@DELyahoo.ca (02 May 2005 06:56,
news:<uq******* ************@wa gner.videotron. net>) a écrit :

I started using boost's filesystem library a
couple of days ago. In its FAQ, it states

"Wide-character names would provide an illusion of
portability where portability does not in fact
exist. Behavior would be completely different on
operating systems (Windows, for example) that
support wide-character names, than on systems
I think you overlooked a detail here : what made this portability an
"illusion" is how different file-systems define file-names rules.

It doesn't mean use of wide-character strings in C++ is not portable, only
that using unicode filenames with various native filesystems is not..
(but boost::filesyst em started an "i18n" branch recently, and accessing
files by wide-char names seems to be on the menu, so someone probably took
some time to make that work for the most common platforms)
2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?
I think it is.
Posix systems let the user choose a locale (by setting $LANG, or various
sub-variables like LC_TYPE ..).

You can (hope to) get the user's environment locale with portable C++ :
std::locale userLocale("");
but then, there's no easy portable way to know whether this locale uses
UTF-8 for charset encoding or what. (On posix systems you can try to detect
whether "UTF-8" occurs in the useLocale.name( ) string)

in basic situations, you should *not* need to know, but just :
.. use wide-chars in your program, and wide streams
.. imbue the user's locale on all the wide streams you use and let them
handle conversions.
[ In fact wcout might not work as well as any other widestream .. I found
that imbuing on wcout was ignored, and setting the global locale :
std::global(use rLocale);
prior to using wcout was the only way to get the locale have any effect on
wcout ]

3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?
I think the expected way is to let the wide streams handle the conversions.
They use their locale's codecvt facet to convert the internal char_type
sequences to the external char encoding.

if you have to widen/narrow stuff yourself, you can use a locale's widen and
narrow function.

everything boils down to using the "right" locale for your situation.
(note boost - and other portable libraries - provide UTF-8 locales that
provide conversion from wchars holding unicode code-points to UTF-8 encoded
char sequences)
4. Will C++0x provide more means for using wide
and narrow strings, such as conversions and
transparency (converting "strings" into L"strings"
automatically, for example, providing standard
macros such as UNICODE)
that's already handled by current standard, but this conversion is not
canonical, different locales can mean different conversions, so this
depends on the locale.
A locale's codecvt<wchar_t , char, mbstate_t> facet serves that purpose.

For more details on locales, check Stroustrup's Appendix D :
http://www.research.att.com/~bs/3rd_loc0.html
5. Are wide characters meant to be used with
Unicode or are they provided for an
implemention-defined use?

mostly everything is implementation-defined when it comes to locales and
wide-chars..
the values in wchar_t are most of the times "unicode" (UTF-32) code-points,
but check your compiler's documentation if you have to rely on it .. For
instance, gcc-3.4 lets you modify that with command-line option
-fexec-wide-charset, and uses UTF-32 by default.

The way I see it, you can either :
1. use the compiler's native encoding of wide characters, along with the
native locales, and let your compiler's library do its work. In this case,
you don't care what the values are in those wchar_t, as long as it matches
what the locales expect. (and it should !).

2. enforce your own wide-char encoding (on a 4+ bytes type), and your own
conversions (with a 3rd party facet, or set of functions), without ever
using the compiler's native locale and wide IO features.

3. if you want to mix native stuff with third-party tools : set-up the
proper native-to-UTF-32 conversion system (e.g. make a header which tests
compiler-specific and std::library-specific macros, and does the proper
conversion, or aborts, or whatever. In most of the cases, the proper
conversion is keeping the wchar_t values untouched) and apply that
conversion between native calls and third-party UTF-32 calls.

--
Samuel

Jul 23 '05 #3

Jonathan Mcdougall

Samuel Krempp wrote:

jo************* **@DELyahoo.ca (02 May 2005 06:56,
news:<uq******* ************@wa gner.videotron. net>) a écrit :

I started using boost's filesystem library a
couple of days ago. In its FAQ, it states

"Wide-character names would provide an illusion of
portability where portability does not in fact
exist. Behavior would be completely different on
operating systems (Windows, for example) that
support wide-character names, than on systems

I think you overlooked a detail here : what made this portability an
"illusion" is how different file-systems define file-names rules.

It doesn't mean use of wide-character strings in C++ is not portable, only
that using unicode filenames with various native filesystems is not..
(but boost::filesyst em started an "i18n" branch recently, and accessing
files by wide-char names seems to be on the menu, so someone probably took
some time to make that work for the most common platforms)

Oh. I thought it was a general statement.

2. Is it a good idea to let the user choose
between Unicode and ASCII in a library in a
transparent way (such as Microsoft's -A and -W
versions of all functions)?

I think it is.

Good.

3. What is the best way to convert wide strings to
and from narrow strings? System-dependent
functions? A simple loop converting char's to
wchar_t's?

I think the expected way is to let the wide streams handle the conversions.
They use their locale's codecvt facet to convert the internal char_type
sequences to the external char encoding.

if you have to widen/narrow stuff yourself, you can use a locale's widen and
narrow function.

I wasn't aware of these functions. Would
something like

char s[6] = {"hello"};
wchar_t w[6] = {0};

std::locale locale("");
std::use_facet< std::ctype<wcha r_t> >(locale)
.widen(s, s+6, w);

work?
The way I see it, you can either :
1. use the compiler's native encoding of wide characters, along with the
native locales, and let your compiler's library do its work. In this case,
you don't care what the values are in those wchar_t, as long as it matches
what the locales expect. (and it should !).
I know we're getting off-topic, you may prefer not
to answer.

I am working on a kind of id3 editor. Id3
informations may be encoded in ASCII (iso-8859-1)
or in unicode (utf-8 or utf-16). On a compiler
using, for example, MBCS for wide characters, what
could standard C++ do for me?
2. enforce your own wide-char encoding (on a 4+ bytes type), and your own
conversions (with a 3rd party facet, or set of functions), without ever
using the compiler's native locale and wide IO features.
That would definitly be bad for code reuse :)
3. if you want to mix native stuff with third-party tools : set-up the
proper native-to-UTF-32 conversion system (e.g. make a header which tests
compiler-specific and std::library-specific macros, and does the proper
conversion, or aborts, or whatever. In most of the cases, the proper
conversion is keeping the wchar_t values untouched) and apply that
conversion between native calls and third-party UTF-32 calls.

So adopt a single encoding in my code and provide
conversions (adapters) for other libraries?
Thank you,

Jonathan

Jul 23 '05 #4

Similar topics

3993

The illusion of "portability"

by: jacob navia | last post by:

In this group there is a bunch of people that call themselves 'regulars' that insist in something called "portability". Portability for them means the least common denominator. Write your code so that it will compile in all old and broken compilers, preferably in such a fashion that it can be moved with no effort from the embedded system in the coffe machine to the 64 bit processor in your desktop.

C / C++

10137

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9989

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9927

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9812

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6640

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5268

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5405

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3510

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2788

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General