473,320 Members | 1,939 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Converting between Unicode and default locale

Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".

Thanks,
Keith MacDonald
[snip, before replying directly]
Jul 19 '05 #1
22 5421

"Keith MacDonald" <ke***@text-snip-pad.com> wrote in message
news:bl*******************@news.demon.co.uk...
Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.

Also note that depending upon your platform's byte size,
not all Unicode values will necessarily fit into type
'char'.

-Mike
Jul 19 '05 #2

"Mike Wahler" <mk******@mkwahler.net> wrote in message news:ok*****************@newsread3.news.pas.earthl ink.net...
I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.


Mulstibyte is using more than one char to encode a character.
wchar_t is fixed size wide characters. But I knew what you
meant.

Yes, it's a major defect in the internationalization support.
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.
Jul 19 '05 #3
On Fri, 26 Sep 2003 21:21:38 +0100, Keith MacDonald wrote:
Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


Try mbstowcs/wcstombs.
--
Aaron Isotton
http://www.isotton.com/

Jul 19 '05 #4
Ron Natalie wrote:
"Mike Wahler" <mk******@mkwahler.net> wrote in message news:ok*****************@newsread3.news.pas.earthl ink.net...

I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.

Mulstibyte is using more than one char to encode a character.
wchar_t is fixed size wide characters. But I knew what you
meant.

Yes, it's a major defect in the internationalization support.
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.


Except that some vendors use utf-16 and some use ucs-4 as their what_t
type. UTF-16 usually breaks a whole bunch of assumptions on what a
whar_t type is supposed to be.

On platforms that use utf-16, the complexity of processing ucs-4 or
utf-16 characters is equivalent so it makes sense to only support utf-8.

If you know your code is ONLY dealing with utf-8 characters, you can
make processing utf-8 characters very efficient by inlining some of the
code thats deals with utf-8.
Jul 19 '05 #5
"Ron Natalie" <ro*@sensor.com> wrote in message
news:3f*********************@news.newshosting.com. ..

"Mike Wahler" <mk******@mkwahler.net> wrote in message news:ok*****************@newsread3.news.pas.earthl ink.net...
I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.
Mulstibyte is using more than one char to encode a character.


Right.
wchar_t is fixed size wide characters.
Right.
But I knew what you
meant.
I meant what I said. (Actually I suppose L&K meant it,
I'm only repeating it).

What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.

Ref: Langer & Kreft 2.3, p 113

If you feel I'm misunderstanding, please do clarify.

Yes, it's a major defect in the internationalization support.
Yes, I agree. Didn't folks work hard to create a
standard character set which could accomodate virtually
all written languages?
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.


I haven't had to deal with international issues yet, but I
know that it's only a matter of time, and I'd sure like
some Unicode support so I can practice ahead of time.

Any time I spend more than a few minutes with my nose
inside the L&K book, I come away with my head swimming. :-)

-Mike
Jul 19 '05 #6

"Aaron Isotton" <aa***@isotton.com> wrote in message news:pa****************************@isotton.com...
Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".


Try mbstowcs/wcstombs.
--

Unfortunately that is not adequate for the windows environment.
In actuality, it is impossible to properly use UNICODE filenames with
the standard C++ library on windows.

I have not been able to make any inroads with the standardization people
about doing something about this.
Jul 19 '05 #7

"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...

Except that some vendors use utf-16 and some use ucs-4 as their what_t
type. UTF-16 usually breaks a whole bunch of assumptions on what a
whar_t type is supposed to be.
Immaterial to the problem. The standard library is broken even if your
wchar_t is 32 bits.
On platforms that use utf-16, the complexity of processing ucs-4 or
utf-16 characters is equivalent so it makes sense to only support utf-8.
I do not agree. And windows doesn't provide an implicit char to wchar_t
translation in the system interfaces (utf-8) or otherwise. It's immaterial
to the fact that wchar_t might become a multi-wide-byte encoding. The
standard library does not provide the hooks necessary to fully support
wchar_t such as you might have.
If you know your code is ONLY dealing with utf-8 characters, you can
make processing utf-8 characters very efficient by inlining some of the
code thats deals with utf-8.


The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.
Jul 19 '05 #8

"Mike Wahler" <mk******@mkwahler.net> wrote in message news:oX*****************@newsread3.news.pas.earthl ink.net...
What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.
I'm not understanding what you are saying. There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.

Ref: Langer & Kreft 2.3, p 113


I don't have the book.

Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage" really should be
distinct types and not overloaded on char. This is the
price we pay for working in an American-centric industry
I guess.
Jul 19 '05 #9
"Ron Natalie" <ro*@sensor.com> wrote in message
news:3f*********************@news.newshosting.com. ..

"Mike Wahler" <mk******@mkwahler.net> wrote in message news:oX*****************@newsread3.news.pas.earthl ink.net...
What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.
I'm not understanding what you are saying.


I'm not sure I'm conveying the info correctly.
I've include a quote from L&K below.
There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.

Ref: Langer & Kreft 2.3, p 113
I don't have the book.


Angelika Langer & Klaus Kreft,
"Standard C++ IOStreams and Locales,"
Chapter 2, "The Architecture of IOStreams"
Section 2.3, "Character Types and Character Traits",
page 113:

<quote>

MULTIBYTE FILES

CHARACTER TYPE. Multibye files contain characters in a
multibyte encoding. Different from one-byte or wide-character
encodings, multibyte characters do not have the same size.
A single multibyte character can have a length of 1, 2, 3, or
more bytes. Obviously, none of the built-in character types,
char or wchar_t, is large enough to hold any character of a
given multibyte encoding. For this reason, multibyte characters
contained in a multibyte file are chopped into units of one
byte each. The wide-character file stream extracts data from
the multibyte file byte by byte, interprets the byte sequence,
finds out which and how many bytes form a multibyte character,
identifies the character, and translates it to a wide-character <<===
encoding.

Due to the decomposition of the multibytes into one- byte
units, the type of characters exchanged between the transport
layer and a multibyte file is char.

CHARACTER ENCODING. The encoding of characters exchanged
between the transport layer and a multibyte file can be any
multibyte encoding. Ite depends wholly on the content of the
multibyte file. As wide-character file streams internally
represent characters as units of type wchar_t encoded in the
programming environment's wide-character encoding, a code
conversion is always necessary. The code conversion is per-
formed by the stream buffer's code conversion facet. There
is no default conversion defined. It all depends on the code
conversion facet contained in the stream buffer's locale object,
which initially is the current global locale.

In sum, the external character representation of wide-
character file streams is that of the units transferred to and
from a multibyte file. Its character type is char, and the
encoding depends on the stream's code conversion facet.
</quote>
The above implies to me that in order to access a multibyte
file, one needs to use a basic(i/o)stream<wchar_t>. Am I
missing something or assuming too much?
Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage"
I don't think that's part of this issue. They describe
abstract 'character types', about which a stream obtains
pertinent information via 'character traits' types.
really should be
distinct types and not overloaded on char.
I don't know what you mean here. I don't see L&K
mention either "basic character type" or "smallest
addressible unit of storage," or "overloading on char."
They talk about how iostreams is templatized on a
'character type', which can be either of the built-in
types char or wchar_t, or some other invented character
type which meets the requirements imposed by iostreams
(defines EOF value, etc).
This is the
price we pay for working in an American-centric industry
I guess.


What about this do you feel is "American-centric"?

Thanks for your input.

-Mike
Jul 19 '05 #10
Ron Natalie wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...
.... The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.

Did you consider just implementing a utf-8 specific string library as an
alternative ?


Jul 19 '05 #11

"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bl********@dispatch.concentric.net...
Ron Natalie wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...

...
The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.

Did you consider just implementing a utf-8 specific string library as an
alternative ?


How would that enable streaming of the characters?

-Mike
Jul 19 '05 #12
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message ....
How would that enable streaming of the characters?


Explain how it does not.

Jul 19 '05 #13

"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bl********@dispatch.concentric.net...
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message

...

How would that enable streaming of the characters?


Explain how it does not.


I asked you first. :-)

Internationalization and character sets are not
something I claim expertise in. I'm in this
thread to try to learn a thing or two myself.
Toward that end, I offered a quote from L&K
with my interpretation, and asked that any
misconceptions be pointed out.

-Mike
Jul 19 '05 #14
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bl********@dispatch.concentric.net...
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message


...
How would that enable streaming of the characters?


Explain how it does not.

I asked you first. :-)

Internationalization and character sets are not
something I claim expertise in. I'm in this
thread to try to learn a thing or two myself.
Toward that end, I offered a quote from L&K
with my interpretation, and asked that any
misconceptions be pointed out.


Well, I don't really know what you're trying to do.

However, I can say that Unicode is a very complex beast.

You have issues like:

Composed characters

Bidirectional strings

Multiple representations of the same characters

Language tags

Ligatures

Private use characters

++ more

Let's compare:

issue |utf-8 | utf-16 | utf-32
----------------------------------------
endian | no | yes | yes
ascii-is-ascii | yes | no | no
is-multi-*unit* | yes | yes | mostly no
is compact | yes | kind of| no
is stateful | no | yes | yes

The problem is that a true internationalization library is far more
complex than what is provided by the C++ standard and on top of that,
you don't really need to deal with all these issues all the time.

For most uninternationalized applications, not even touching the legacy
single byte code and pushing multibyte data through it will render
exactly the right results.

So, the real question is. What kind of issues are you dealing with that
require internationalization ?


Jul 19 '05 #15

"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bl********@dispatch.concentric.net...
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message
news:bl********@dispatch.concentric.net...
Mike Wahler wrote:

"Gianni Mariani" <gi*******@mariani.ws> wrote in message

...

How would that enable streaming of the characters?

Explain how it does not.

I asked you first. :-)

Internationalization and character sets are not
something I claim expertise in. I'm in this
thread to try to learn a thing or two myself.
Toward that end, I offered a quote from L&K
with my interpretation, and asked that any
misconceptions be pointed out.


Well, I don't really know what you're trying to do.


I'm trying to understand. I'm not the OP, I'm not
trying to solve any particular problem concerning
this issue, but anticipating that eventually I will
need to.
However, I can say that Unicode is a very complex beast.
I believe you. :-)
You have issues like:

Composed characters

Bidirectional strings

Multiple representations of the same characters

Language tags

Ligatures

Private use characters

++ more

Let's compare:

issue |utf-8 | utf-16 | utf-32
----------------------------------------
endian | no | yes | yes
ascii-is-ascii | yes | no | no
is-multi-*unit* | yes | yes | mostly no
is compact | yes | kind of| no
is stateful | no | yes | yes

The problem is that a true internationalization library is far more
complex than what is provided by the C++ standard
This is essentially the assertion (although it was specifically
about Unicode/wchar_t) that I was passing along from my reading of
L&K.
and on top of that,
you don't really need to deal with all these issues all the time.
Well, no, I wouldn't think so.
For most uninternationalized applications, not even touching the legacy
single byte code and pushing multibyte data through it will render
exactly the right results.
If I understand what you're saying, I agree, of course not.
A conversion is needed.
So, the real question is. What kind of issues are you dealing with that >

require internationalization ?

None at the moment. But I want to prepare myself for
when I do need to deal with internationalized software.
You know, planning ahead.

Hopefully, Ron will read the L&K quote I posted
and perhaps further enlighten us.

I sure wish Anglika and/or Klaus would participate here,
but I'm sure they're rather busy people. :-)

-Mike
Jul 19 '05 #16
Mike Wahler wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message ....
So, the real question is. What kind of issues are you dealing with that >


require internationalization ?

None at the moment. But I want to prepare myself for
when I do need to deal with internationalized software.
You know, planning ahead.


So if you're interested in some general internationalization (i18n from
now on) knowledge, this is a summary for you. I have only looked
briefly at the C++ i18n support a long time ago (pre standard) and it
was woefully short - I have not looked at it as it stands now. But then
again, it is taylored to issues specific with command line applications.
Products that require i18n support are usually far more complex.

Here is a list of i18n issues/no particular order:

a) Character sets.
- ascii compatible codepages (EUC*, utf-8 etc)
- ascii INcompatible codepages (SJIS, BIG5, ISO2022)
- conversion between codepages
- incompatible codepages
- code-set detection
- normalized forms (most composed form, most decomposed form)

b) User messages (message catalogues)
- language format issues
- pluralization of a message
e.g. "1 packet received" vs "2 packets received";

c) Locale specific processing
- time zones
- time display format
- dates of significance (New Year, Chinese New Year)
- numerical display format
- string collation
- spell checking
- tax rules
- accounting rules
- legal issues (privacy - limitations on DB)
- encryption limitations
- language and location
- Spanish as spoken in the USA
- French as spoken in Canada
- Multiple versions of "Spanish" in Spain
- Basque, Catalan or Galician
- telephone formats
- address formats
- building codes

d) Multi-lingual issues
- collation ?
- spell checking

e) Bidirectional issues

f) Composed characters (Thai, Vietnamese)

g) Keyboards
- keyboard layouts
- input methods (phonetic to symbolic transformation)
- multi-lingual input

h) Displaying characters
- fonts
- mixing fonts to display text
- bidirectional text
- vertical text
- select a region of text
- display a region of selected text
- select a word

i) National borders
- recognition of state (e.g. PRC,ROC)

j) Graphics
- icons
- cultutally sensitive issues
- images with embedded text
- images with faces showing - Women with hair showing.
- culturally unacceptable images
- Product names that have culturally offensive meaning

k) Colors
- colors that are "offensive"
- colors that show alarm - RED indicates failure

+++ lots more.

Most applications never deal with some of the issues above.

It is interesting to note that many of these issues mesh with other
features the application. For example, product description and product
location and cell phones - " press the 'Send' key ". It is critical to
NEVER EVER write code that goes like

if ( product.location == SPAIN )
{
if ( prodict.language == BASQUE )
{
// basque menu unavailable use english
Menu = menu.feautues("SPAIN").language("EN");
...

Optimal/best in class practice for I18N is somthing like:

CONTEXT.location = SPAIN;
CONTEXT.language = BASQUE;
CONTEXT.product = MODEL755_CELL_PHONE;
CONTEXT.display = NTSC_SIZE;
... // whatever attributes change the product behaviour

...

Menu = FetchFromDB( MENU, CONTEXT );

...

Response = FetchFromDB( RESPONSE, CONTEXT );

// Menu suitable for this context is now in Menu.

The routine FetchFromDB performs a search in a database/dictionary of
menus for this product in the current context. The rules for choosing
the correct item are *also* stored in the database. The object returned
may be more complex, for example, it may be a set of TAX rules for a
financial application.

In general, anyone doing anything regarding serious i18n support does
not use the ones provided by the OS because it is usually inadequate and
the standards process is basically too slow to change. Not only that,
the "standard" model used to "i18nize" does not lend itself well to very
complex applications which is more and more the case today.

A case in point, Internet Explorer does not use the OS provided langauge
support, the developers needed somthing far richer and developed MLang:
http://msdn.microsoft.com/workshop/misc/mlang/mlang.asp

I have found that it is far easier to i18nize an application that has no
support for i18n that to unwind some broken i18n support.

I can rant forever on this topic. As you can see, it's not somthing you
can learn from a few posts on comp.lang.c++.

Jul 19 '05 #17
Well, my question has certainly generated a lot of responses, but not the
kind I was hoping for. Clearly, I was being completely naive to expect the
standard library to include this facility, but I am completely disheartened
not to have found a single working example of how to do code conversion in
streams using 3rd party libraries, such as iconv. Presumably this is
because nobody does it that way.

[RANT] It seems crazy that after a decade of Unicode use, C++ still requires
everyone to reinvent the wheel and do it their own way. I think that the
standards committee is being too precious about this. I know that Unicode
is a moving target, but UCS-2 would suffice for 95% of my requirements - and
100% for those who don't know the difference between it and UTF-16. After
all, the C++ char type doesn't even support the full British English
character set (never mind those of the rest of Europe), without using
non-standard compiler options to make char unsigned. Please, anything is
better than nothing! [/RANT]

- Keith MacDonald

"Mike Wahler" <mk******@mkwahler.net> wrote in message
news:Nz*****************@newsread4.news.pas.earthl ink.net...
"Ron Natalie" <ro*@sensor.com> wrote in message
news:3f*********************@news.newshosting.com. ..

"Mike Wahler" <mk******@mkwahler.net> wrote in message

news:oX*****************@newsread3.news.pas.earthl ink.net...
What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.


I'm not understanding what you are saying.


I'm not sure I'm conveying the info correctly.
I've include a quote from L&K below.
There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.

Ref: Langer & Kreft 2.3, p 113


I don't have the book.


Angelika Langer & Klaus Kreft,
"Standard C++ IOStreams and Locales,"
Chapter 2, "The Architecture of IOStreams"
Section 2.3, "Character Types and Character Traits",
page 113:

<quote>

MULTIBYTE FILES

CHARACTER TYPE. Multibye files contain characters in a
multibyte encoding. Different from one-byte or wide-character
encodings, multibyte characters do not have the same size.
A single multibyte character can have a length of 1, 2, 3, or
more bytes. Obviously, none of the built-in character types,
char or wchar_t, is large enough to hold any character of a
given multibyte encoding. For this reason, multibyte characters
contained in a multibyte file are chopped into units of one
byte each. The wide-character file stream extracts data from
the multibyte file byte by byte, interprets the byte sequence,
finds out which and how many bytes form a multibyte character,
identifies the character, and translates it to a wide-character <<===
encoding.

Due to the decomposition of the multibytes into one- byte
units, the type of characters exchanged between the transport
layer and a multibyte file is char.

CHARACTER ENCODING. The encoding of characters exchanged
between the transport layer and a multibyte file can be any
multibyte encoding. Ite depends wholly on the content of the
multibyte file. As wide-character file streams internally
represent characters as units of type wchar_t encoded in the
programming environment's wide-character encoding, a code
conversion is always necessary. The code conversion is per-
formed by the stream buffer's code conversion facet. There
is no default conversion defined. It all depends on the code
conversion facet contained in the stream buffer's locale object,
which initially is the current global locale.

In sum, the external character representation of wide-
character file streams is that of the units transferred to and
from a multibyte file. Its character type is char, and the
encoding depends on the stream's code conversion facet.
</quote>
The above implies to me that in order to access a multibyte
file, one needs to use a basic(i/o)stream<wchar_t>. Am I
missing something or assuming too much?
Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage"


I don't think that's part of this issue. They describe
abstract 'character types', about which a stream obtains
pertinent information via 'character traits' types.
really should be
distinct types and not overloaded on char.


I don't know what you mean here. I don't see L&K
mention either "basic character type" or "smallest
addressible unit of storage," or "overloading on char."
They talk about how iostreams is templatized on a
'character type', which can be either of the built-in
types char or wchar_t, or some other invented character
type which meets the requirements imposed by iostreams
(defines EOF value, etc).
This is the
price we pay for working in an American-centric industry
I guess.


What about this do you feel is "American-centric"?

Thanks for your input.

-Mike

Jul 19 '05 #18

"Mike Wahler" <mk******@mkwahler.net> wrote in message news:Nz*****************@newsread4.news.pas.earthl ink.net...
. The wide-character file stream extracts data from
the multibyte file byte by byte, interprets the byte sequence,
finds out which and how many bytes form a multibyte character,
identifies the character, and translates it to a wide-character <<===
encoding.
This describes converting a multibyte file to a wide char stream. Of
course, if your system is natively wchar_t base NO SUCH TRANSLATION
is required. This only happens when the underlying file conventions are
multibyte.
The above implies to me that in order to access a multibyte
file, one needs to use a basic(i/o)stream<wchar_t>. Am I
missing something or assuming too much?
That would be the convenient way of doing it. You can also
open it with a non-wide stream and manage the multibyte sequences.
This is why the file positions in the stream class aren't necessarily
just character offsets, they may also encode some multibyte state
information. It is perfectly acceptible to work with the data in multibyte
state.
Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage"
I don't think that's part of this issue. They describe
abstract 'character types', about which a stream obtains
pertinent information via 'character traits' types.


No, just a related personal gripe. It would solve some problems if you could
just widen char. For example, on Windows NT (and it's descendents), the
inate character size is really 16. It would make more sense if ALL the interfaces:
filenames, argv[], the what() strings in exceptions, etc... were 16 bits. You'd
just make char 16 bits. However, if you did this, you'd lose the ability to
address 8 bit things because char plays double duty.
I don't know what you mean here. I don't see L&K
mention either "basic character type" or "smallest
addressible unit of storage," or "overloading on char."
That's the C++ standard.
What about this do you feel is "American-centric"?


Because C++ makes two horrendous assumptions:

That you can fit whatever native strings you want in the "char" type
and that is also the smallest memory unit you want to address.

That you can mollify those who have need for larger characters by
telling them they can just convert their large characters to a multibyte
sequence. Unfortunately this conversion might be absent (as it is in
the current NT interfaces) or non-unique (as it is in the WIN32 on 98/Me)
interfaces.
Jul 19 '05 #19

"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...
Ron Natalie wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...

...
The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.

Did you consider just implementing a utf-8 specific string library as an
alternative ?


Not just STRINGS. I had to implement fstreams, the exception interfaces,
and the arguments to main, to name a few. Frankly, if I had it to do again,
these things would take some variant of basic_string rather than char* as well.
Jul 19 '05 #20
Keith MacDonald wrote:
Well, my question has certainly generated a lot of responses, but not the
kind I was hoping for. Clearly, I was being completely naive to expect the
standard library to include this facility, but I am completely disheartened
not to have found a single working example of how to do code conversion in
streams using 3rd party libraries, such as iconv. Presumably this is
because nobody does it that way.

[RANT] It seems crazy that after a decade of Unicode use, C++ still requires
everyone to reinvent the wheel and do it their own way. I think that the
standards committee is being too precious about this. I know that Unicode
is a moving target, but UCS-2 would suffice for 95% of my requirements - and
100% for those who don't know the difference between it and UTF-16. After
all, the C++ char type doesn't even support the full British English
character set (never mind those of the rest of Europe), without using
non-standard compiler options to make char unsigned. Please, anything is
better than nothing! [/RANT]


You RANT is mostly justified.

However, there are a number of libraries that provide the support you
asking for.

If you have the energy to propose a revision to the C++ standard then do
so but it's a very complex problem to get right. In regards to just
UCS-2 support, you would probably not have anyone on the standards
comittee agree on that.
Jul 19 '05 #21

"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...
However, there are a number of libraries that provide the support you
asking for.

The problem is that you can't even implement this without redefining/extending
the C++ standard library classes. The problem is that wchar_t is incompletely
supported in the C++ library, so even if you were to fix up everything in your
implementation, you'd still have to add non-conforming extensions.
Jul 19 '05 #22
Ron Natalie wrote:
"Gianni Mariani" <gi*******@mariani.ws> wrote in message news:bl********@dispatch.concentric.net...

However, there are a number of libraries that provide the support you
asking for.


The problem is that you can't even implement this without redefining/extending
the C++ standard library classes. The problem is that wchar_t is incompletely
supported in the C++ library, so even if you were to fix up everything in your
implementation, you'd still have to add non-conforming extensions.


An option is not to use what_t at all. Stick to multibyte. Perform all
the processing in utf-8 multibyte. (you need to make sure you provide
support to convert any incoming strings to utf-8.

Even for UTF-32 you need to deal with multi-"unit" issues because of
composing characters. I don't remember specifically what the 10646
standard says but processing text with composed characters has many of
the same restrictions as multibyte characters (keeping them together).

Processing utf-16 or utf-32, you have issues with endianness or managing
the byte-order-mark which makes it a stateful encoding. This breaks a
whole bunch of subltle assumptions about the indexability of files. No
such problem exists with utf-8.

It just makes a whole lot of sense to use utf-8 everywhere when possible.

Jul 19 '05 #23

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Howard Lightstone | last post by:
I *foolishly* started a Python project (3 years ago) with considering Unicode issues. Now, I want to resolve future problems with international versions of my software. The key point here is...
19
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...
9
by: Charles F McDevitt | last post by:
I'm trying to upgrade some old code that used old iostreams. At one place in the code, I have a path/filename in a wchar_t string (unicode utf-16). I need to open an ifstream to that file. ...
7
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the...
8
by: sonald | last post by:
Hi, I am using python2.4.1 I need to pass russian text into python and validate the same. Can u plz guide me on how to make my existing code support the russian text. Is there any module...
2
by: John Nagle | last post by:
Regular expressions are compiled in ASCII mode unless Unicode mode is specified to "rc.compile". The difference is that regular expressions in ASCII mode don't recognize things like Unicode...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
10
by: himanshu.garg | last post by:
Hi, The following std c++ program does not output the unicode character.:- %./a.out en_US.UTF-8 Infinity:
29
by: Ioannis Vranos | last post by:
Hi, I am currently learning QT, a portable C++ framework which comes with both a commercial and GPL license, and which provides conversion operations to its various types to/from standard C++...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.