473,385 Members | 1,445 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Wide characters and streams

>From thread
http://groups.google.com/group/comp....d767efa42df516

"P.J. Plauger" <p...@dinkumware.comwrites:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.
I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::ofstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

#include "stdafx.h" // This header is empty
#include <iostream>
#include <conio.h>
#include <fstream>

int wmain(int /*argc*/, wchar_t* /*argv*/[])
{
std::wcout << L"Hello world!" << std::endl;
// Surname with AE ligature
std::wcout << L"Hello Kirit S\x00e6lensminde" << std::endl;
// Kirit transliterated (probably badly) into Greek
std::wcout << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
// Kirit transliterated into Thai
std::wcout << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

//if ( std::wcout )
// std::cout << "\nstd::wcout still good" << std::endl;
//else
// std::cout << "\nstd::wcout gone bad" << std::endl;

_cputws( L"\n\n\n" );
_cputws( L"Hello Kirit S\x00e6lensminde\n" ); // AE ligature
_cputws( L"Hello \x039a\x03b9\x03c1\x03b9\x03c4\n" ); // Greek
_cputws( L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17\n" ); // Thai

std::wofstream wout1( "test1.txt" );
wout1 << L"12345" << std::endl;

//if ( wout1 )
// std::cout << "\nwout1 still good" << std::endl;
//else
// std::cout << "\nwout1 gone bad" << std::endl;

std::wofstream wout2( "test2.txt" );
wout2 << L"Hello world!" << std::endl;
wout2 << L"Hello Kirit S\x00e6lensminde" << std::endl;
wout2 << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
wout2 << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

//if ( wout2 )
// std::cout << "\nwout2 still good" << std::endl;
//else
// std::cout << "\nwout2 gone bad" << std::endl;

return 0;
}
I've compiled this on MSVC Studio 2003 and it reports the following
command line switches on a debug build (i.e. Unicode defined as the
character set and wchar_t as a built-in type):

/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm
/EHsc /RTC1 /MLd /Zc:wchar_t /Zc:forScope /Yu"stdafx.h"
/Fp"Debug/wcout.pch" /Fo"Debug/" /Fd"Debug/vc70.pdb" /W3 /nologo /c
/Wp64 /ZI /TP

If I run this directly from the IDE then it clearly does some odd
narrowing of the output as the Greek cputws() line displays:

Hello ????t

Which to me looks like a failure in the character substitution from
Unicode to what I presume is some OEM encoding. Now don't get wrong, I
think this is a poor default situation for running something on a
Unicode platform (this is on Windows 2003 Server), but it does seem to
be beside the point for this discussion.

If I run it from a command prompt with Unicode I/O turned on (cmd.exe
/u) then the output is somewhat more encouraging, but not a lot:

Hello world!
Hello Kirit Sµlensminde
Hello
Hello Kirit Sælensminde
Hello Κιριτ
Hello คีริท

The _cputws calls all work as I would expect, but std::wcout doesn't
work at all. Worse uncommenting the stream tests shows that there is an
error on std::wcout rendering it unusable from then on. Note also that
it has translated the AE ligature into what looks to me like a Greek
lower case mu. The Greek capital kappa has wedged the stream.

The two txt files are interesting. test1.txt is seven bytes long,
exactly the half the size I would naively expect and test2.txt is 45
bytes long. Exactly the length I'd expect from a char stream that only
went up to, but didn't include, the Greek capital kappa.

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

What we've done is to use our own implementation of a UTF-16 to UTF-8
converter (that we know works properly as it drives our web interfaces)
and just send that sequence to a std::ofstream. We've had to more or
less give up on meaningful and pipeable console output.
K

Sep 30 '06 #1
3 3177
"Kirit Slensminde" <ki****************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
>From thread
http://groups.google.com/group/comp....d767efa42df516

"P.J. Plauger" <p...@dinkumware.comwrites:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.
I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::ofstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.

I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.

The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.

What we've done is to use our own implementation of a UTF-16 to UTF-8
converter (that we know works properly as it drives our web interfaces)
and just send that sequence to a std::ofstream. We've had to more or
less give up on meaningful and pipeable console output.

[pjp] That's one way out, yes.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Oct 2 '06 #2

P.J. Plauger wrote:
"Kirit Slensminde" <ki****************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
From thread
http://groups.google.com/group/comp....d767efa42df516

"P.J. Plauger" <p...@dinkumware.comwrites:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::ofstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.
Indeed that is pretty stupid. I don't mind stupid defaults so long as
they are described in the documentation, but the documentation of
std::wofstream or std::wcout makes no mention of this. I notice though
that std::wstringstream doesn't seem to suffer this problem.

As far as std::wcout goes though there must be something else going on
as well or the AE ligature would not have been mangled to a Greek mu.
This would seem to imply that using a codecvt that passed through
UTF-16 would not work or is it the existing codecvt that is performing
the miss-transliteration?

I can't help but think that a lot of the frustration could be very
simply resolved by just properly documenting what the libraries do and
putting that documentation where people will see it.
>
I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.

As a practical matter I don't understand how wchar_t streams can be
seen as anything but broken (in the 'not working' sense) on this
platform if I have to write my own codecvt implementation or buy one in
so that I can write UTF-16 files.

It seems bizarre that an assertion that the streams aren't broken is
compatible with the fact that they cannot be used in what must be a
very common (if not the most common) use case. An inability to write
UTF-16 to the console sure seems broken to me and an implementation
that writes UTF-16 streams as you describe surely can't be described as
'working' for any practical purpose.
>
The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.
Actually if the default codecvt was simply a null, do nothing UTF-16 to
UTF-16 that would be fine too.

We did notice that writing a codecvt implementation is no trivial task.
We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
it to work.

Looking at the comments in our source it seems that there was some
confusion about what do_length should return. I think the standard says
it should return the number of bytes, but the documentation we were
using at the time seemed to imply that it should return the number of
wchar_t. The documentation we're now using looks to have been changed,
but I'm not sure I can work out from the wording what it is saying
should be returned.

This is something that we may revisit.
On your web site, is "compleat" some joke that I'm not getting?

And thanks for taking the time to answer. It's certainly cleared up a
lot about what is going on.
K

Oct 3 '06 #3
"Kirit Slensminde" <ki****************@gmail.comwrote in message
news:11**********************@i3g2000cwc.googlegro ups.com...

P.J. Plauger wrote:
"Kirit Slensminde" <ki****************@gmail.comwrote in message
news:11**********************@h48g2000cwc.googlegr oups.com...
From thread
http://groups.google.com/group/comp....d767efa42df516

"P.J. Plauger" <p...@dinkumware.comwrites:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::ofstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get
some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.
Indeed that is pretty stupid. I don't mind stupid defaults so long as
they are described in the documentation, but the documentation of
std::wofstream or std::wcout makes no mention of this. I notice though
that std::wstringstream doesn't seem to suffer this problem.

As far as std::wcout goes though there must be something else going on
as well or the AE ligature would not have been mangled to a Greek mu.
This would seem to imply that using a codecvt that passed through
UTF-16 would not work or is it the existing codecvt that is performing
the miss-transliteration?

[pjp] The whole problem is the stupid default conversion. Our C++
library has always used the fgetwc/fputwc machinery from the C
library for the default wchar_t codecvt facet. Thus, we more or less
inherit whatever decision a compiler vendor has chosen for C.
(Unless, of course, that vendor has also licensed our C library,
in which case you get UTF-16/UTF-8 by default.)

But remember that what you see is also determined by the display
software, which is outside the purview of C and C++. Sometimes
that's not what you expect, so extended character sets get curdled
in surprising ways on their way to your eyeballs.
---

I can't help but think that a lot of the frustration could be very
simply resolved by just properly documenting what the libraries do and
putting that documentation where people will see it.

[pjp] I agree that these decisions could be better highlighted.
---
I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.
As a practical matter I don't understand how wchar_t streams can be
seen as anything but broken (in the 'not working' sense) on this
platform if I have to write my own codecvt implementation or buy one in
so that I can write UTF-16 files.

[pjp] If they don't do what you want, then they are broken to you.
---

It seems bizarre that an assertion that the streams aren't broken is
compatible with the fact that they cannot be used in what must be a
very common (if not the most common) use case. An inability to write
UTF-16 to the console sure seems broken to me and an implementation
that writes UTF-16 streams as you describe surely can't be described as
'working' for any practical purpose.

[pjp] The common use case of today is not the one that was common
a decade or more ago, when some of these decisions were made. The
default conversion is doubtless overdue for revision.
---
The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.
Actually if the default codecvt was simply a null, do nothing UTF-16 to
UTF-16 that would be fine too.

[pjp] For some people.
---

We did notice that writing a codecvt implementation is no trivial task.
We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
it to work.

[pjp] It's the hardest codecvt facet of all to write. In fact, it's
officially impossible, since codecvt was "designed" to do 1-N code
conversions, and UTF-16/UTF-8 is M-N. No Standard C++ library except
ours will even give you a fighting chance, and it's a fiendishly
difficult coding problem even then.
---

Looking at the comments in our source it seems that there was some
confusion about what do_length should return. I think the standard says
it should return the number of bytes, but the documentation we were
using at the time seemed to imply that it should return the number of
wchar_t. The documentation we're now using looks to have been changed,
but I'm not sure I can work out from the wording what it is saying
should be returned.

[pjp] The description of codecvt in the C++ Standard is murky, to
put it politely.
---

This is something that we may revisit.

On your web site, is "compleat" some joke that I'm not getting?

[pjp] "Compleat" is an older spelling of "complete". See, for
example, the noted 17th century book, "The Compleat Angler or
the Contemplative man's Recreation."
---

And thanks for taking the time to answer. It's certainly cleared up a
lot about what is going on.

[pjp] Welcome.
---

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Oct 3 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jonathan Mcdougall | last post by:
I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....
15
by: Steve | last post by:
Hi, I've been charged with investigating the possibilities of internationalizing our C++ libraries. std::strings are used all over the place, and unfortunately a mixture of...
3
by: yazan jab | last post by:
Is it true that Multibyte characters are : char arrays (witch represent a string from the basic characters set). In this case Wide characters are the way for encoding characters from the...
1
by: Anitha Adusumilli | last post by:
Hi Can someone pls explain the usage of wide characters and tchar? Also, what should I be careful about, while coding in C, to make my code portable and suitable for internationalization? ( I...
1
by: Andy | last post by:
Can I mix wide and narrow character output to stdout? I seem to remember hearing this was not supported before but I can't find any reference to such a restriction now I actually need to do it! It...
9
by: anachronic_individual | last post by:
Hi all, Is there a standard library function to insert an array of characters at a particular point in a text stream without overwriting the existing content, such that the following data in...
4
by: thinktwice | last post by:
i'm using VC++6 IDE i know i could use macros like A2T, T2A, but is there any way more decent way to do this?
9
by: toton | last post by:
Hi, I have my program using wstring everywhere instead of string. Similarly I need to process some file, which contains unicode or ascii character. I need to stream them. Thus I use wifstream etc....
17
by: =?Utf-8?B?R2Vvcmdl?= | last post by:
Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.