473,840 Members | 1,585 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Converting between Unicode and default locale

Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".

Thanks,
Keith MacDonald
[snip, before replying directly]
Jul 19 '05 #1
22 5520

"Keith MacDonald" <ke***@text-snip-pad.com> wrote in message
news:bl******** ***********@new s.demon.co.uk.. .
Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.

Also note that depending upon your platform's byte size,
not all Unicode values will necessarily fit into type
'char'.

-Mike
Jul 19 '05 #2

"Mike Wahler" <mk******@mkwah ler.net> wrote in message news:ok******** *********@newsr ead3.news.pas.e arthlink.net...
I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.


Mulstibyte is using more than one char to encode a character.
wchar_t is fixed size wide characters. But I knew what you
meant.

Yes, it's a major defect in the internationaliz ation support.
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.
Jul 19 '05 #3
On Fri, 26 Sep 2003 21:21:38 +0100, Keith MacDonald wrote:
Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


Try mbstowcs/wcstombs.
--
Aaron Isotton
http://www.isotton.com/

Jul 19 '05 #4
Ron Natalie wrote:
"Mike Wahler" <mk******@mkwah ler.net> wrote in message news:ok******** *********@newsr ead3.news.pas.e arthlink.net...

I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.

Mulstibyte is using more than one char to encode a character.
wchar_t is fixed size wide characters. But I knew what you
meant.

Yes, it's a major defect in the internationaliz ation support.
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.


Except that some vendors use utf-16 and some use ucs-4 as their what_t
type. UTF-16 usually breaks a whole bunch of assumptions on what a
whar_t type is supposed to be.

On platforms that use utf-16, the complexity of processing ucs-4 or
utf-16 characters is equivalent so it makes sense to only support utf-8.

If you know your code is ONLY dealing with utf-8 characters, you can
make processing utf-8 characters very efficient by inlining some of the
code thats deals with utf-8.
Jul 19 '05 #5
"Ron Natalie" <ro*@sensor.com > wrote in message
news:3f******** *************@n ews.newshosting .com...

"Mike Wahler" <mk******@mkwah ler.net> wrote in message news:ok******** *********@newsr ead3.news.pas.e arthlink.net...
I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.
Mulstibyte is using more than one char to encode a character.


Right.
wchar_t is fixed size wide characters.
Right.
But I knew what you
meant.
I meant what I said. (Actually I suppose L&K meant it,
I'm only repeating it).

What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.

Ref: Langer & Kreft 2.3, p 113

If you feel I'm misunderstandin g, please do clarify.

Yes, it's a major defect in the internationaliz ation support.
Yes, I agree. Didn't folks work hard to create a
standard character set which could accomodate virtually
all written languages?
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.


I haven't had to deal with international issues yet, but I
know that it's only a matter of time, and I'd sure like
some Unicode support so I can practice ahead of time.

Any time I spend more than a few minutes with my nose
inside the L&K book, I come away with my head swimming. :-)

-Mike
Jul 19 '05 #6

"Aaron Isotton" <aa***@isotton. com> wrote in message news:pa******** *************** *****@isotton.c om...
Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


Try mbstowcs/wcstombs.
--

Unfortunately that is not adequate for the windows environment.
In actuality, it is impossible to properly use UNICODE filenames with
the standard C++ library on windows.

I have not been able to make any inroads with the standardization people
about doing something about this.
Jul 19 '05 #7

"Gianni Mariani" <gi*******@mari ani.ws> wrote in message news:bl******** @dispatch.conce ntric.net...

Except that some vendors use utf-16 and some use ucs-4 as their what_t
type. UTF-16 usually breaks a whole bunch of assumptions on what a
whar_t type is supposed to be.
Immaterial to the problem. The standard library is broken even if your
wchar_t is 32 bits.
On platforms that use utf-16, the complexity of processing ucs-4 or
utf-16 characters is equivalent so it makes sense to only support utf-8.
I do not agree. And windows doesn't provide an implicit char to wchar_t
translation in the system interfaces (utf-8) or otherwise. It's immaterial
to the fact that wchar_t might become a multi-wide-byte encoding. The
standard library does not provide the hooks necessary to fully support
wchar_t such as you might have.
If you know your code is ONLY dealing with utf-8 characters, you can
make processing utf-8 characters very efficient by inlining some of the
code thats deals with utf-8.


The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.
Jul 19 '05 #8

"Mike Wahler" <mk******@mkwah ler.net> wrote in message news:oX******** *********@newsr ead3.news.pas.e arthlink.net...
What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.
I'm not understanding what you are saying. There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.

Ref: Langer & Kreft 2.3, p 113


I don't have the book.

Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage" really should be
distinct types and not overloaded on char. This is the
price we pay for working in an American-centric industry
I guess.
Jul 19 '05 #9
"Ron Natalie" <ro*@sensor.com > wrote in message
news:3f******** *************@n ews.newshosting .com...

"Mike Wahler" <mk******@mkwah ler.net> wrote in message news:oX******** *********@newsr ead3.news.pas.e arthlink.net...
What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.
I'm not understanding what you are saying.


I'm not sure I'm conveying the info correctly.
I've include a quote from L&K below.
There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.

Ref: Langer & Kreft 2.3, p 113
I don't have the book.


Angelika Langer & Klaus Kreft,
"Standard C++ IOStreams and Locales,"
Chapter 2, "The Architecture of IOStreams"
Section 2.3, "Character Types and Character Traits",
page 113:

<quote>

MULTIBYTE FILES

CHARACTER TYPE. Multibye files contain characters in a
multibyte encoding. Different from one-byte or wide-character
encodings, multibyte characters do not have the same size.
A single multibyte character can have a length of 1, 2, 3, or
more bytes. Obviously, none of the built-in character types,
char or wchar_t, is large enough to hold any character of a
given multibyte encoding. For this reason, multibyte characters
contained in a multibyte file are chopped into units of one
byte each. The wide-character file stream extracts data from
the multibyte file byte by byte, interprets the byte sequence,
finds out which and how many bytes form a multibyte character,
identifies the character, and translates it to a wide-character <<===
encoding.

Due to the decomposition of the multibytes into one- byte
units, the type of characters exchanged between the transport
layer and a multibyte file is char.

CHARACTER ENCODING. The encoding of characters exchanged
between the transport layer and a multibyte file can be any
multibyte encoding. Ite depends wholly on the content of the
multibyte file. As wide-character file streams internally
represent characters as units of type wchar_t encoded in the
programming environment's wide-character encoding, a code
conversion is always necessary. The code conversion is per-
formed by the stream buffer's code conversion facet. There
is no default conversion defined. It all depends on the code
conversion facet contained in the stream buffer's locale object,
which initially is the current global locale.

In sum, the external character representation of wide-
character file streams is that of the units transferred to and
from a multibyte file. Its character type is char, and the
encoding depends on the stream's code conversion facet.
</quote>
The above implies to me that in order to access a multibyte
file, one needs to use a basic(i/o)stream<wchar_ t>. Am I
missing something or assuming too much?
Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage"
I don't think that's part of this issue. They describe
abstract 'character types', about which a stream obtains
pertinent information via 'character traits' types.
really should be
distinct types and not overloaded on char.
I don't know what you mean here. I don't see L&K
mention either "basic character type" or "smallest
addressible unit of storage," or "overloadin g on char."
They talk about how iostreams is templatized on a
'character type', which can be either of the built-in
types char or wchar_t, or some other invented character
type which meets the requirements imposed by iostreams
(defines EOF value, etc).
This is the
price we pay for working in an American-centric industry
I guess.


What about this do you feel is "American-centric"?

Thanks for your input.

-Mike
Jul 19 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1674
by: Howard Lightstone | last post by:
I *foolishly* started a Python project (3 years ago) with considering Unicode issues. Now, I want to resolve future problems with international versions of my software. The key point here is Tkinter. I believe (from reading this list) that I can expect that SOME returned text may be Unicode (depending on content and Windows locale settings). Would it be best to just (somehow) force all text into Unicode or would it be "better" to...
19
11903
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it to, but because its for example read from _winreg which returns unicode. You do an os.listdir(directory). Note that all filenames returned are now unicode. (Change introduced I believe in 2.3).
9
13437
by: Charles F McDevitt | last post by:
I'm trying to upgrade some old code that used old iostreams. At one place in the code, I have a path/filename in a wchar_t string (unicode utf-16). I need to open an ifstream to that file. But the open() on ifstream only takes char * strings (mbcs?). In old iostreams, I could _wopen() the file, get the filedesc, and call attach() on the ifstream.
7
4208
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different
8
2267
by: sonald | last post by:
Hi, I am using python2.4.1 I need to pass russian text into python and validate the same. Can u plz guide me on how to make my existing code support the russian text. Is there any module that can be used for unicode support in python? Incase of decimal numbers, how to handle "comma as a decimal point"
2
2140
by: John Nagle | last post by:
Regular expressions are compiled in ASCII mode unless Unicode mode is specified to "rc.compile". The difference is that regular expressions in ASCII mode don't recognize things like Unicode whitespace, even when applied to Unicode strings. For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is a form of whitespace. It's the Unicode equivalent of HTML's "&nbsp;". This can create some strange bugs. Is the current default...
24
3399
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything without 'utf8' in it, then things start to go downhill: 2a. The app assumes unicode objects internally. i.e. Whenever there is
10
3283
by: himanshu.garg | last post by:
Hi, The following std c++ program does not output the unicode character.:- %./a.out en_US.UTF-8 Infinity:
29
2130
by: Ioannis Vranos | last post by:
Hi, I am currently learning QT, a portable C++ framework which comes with both a commercial and GPL license, and which provides conversion operations to its various types to/from standard C++ types. For example its QString type provides a toWString() that returns a std::wstring with its Unicode contents. So, since wstring supports the largest character set, why do we need explicit Unicode types in C++?
0
9860
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9699
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10922
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10660
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10301
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9440
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5874
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4076
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3138
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.