By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,366 Members | 2,106 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,366 IT Pros & Developers. It's quick & easy.

imbue(locale) and file encoding

P: n/a
Hi,

since my previous post
<45***********************@newsspool1.arcor-online.netis still
unanswered I'd like to rephrase my question. In order to read/write a
wstring in UTF-8 encoding it is *not* sufficient to imbue the stream
with a locale like "de_DE.UTF-8". Doing so only takes care of facets of
decimal numbers and the like. Rather, one has to call
locale::global("de_DE.UTF-8"). Is this behaviour conforming to the
standard? And if so why? I mean why wouldn't wcin.imbue("de_DE.UTF-8")
make wcin accept UTF-8 multibyte characters while still allowing 5,7 to
be parsed as 5.7?

file wcintest.cc:
-------------
#include <iostream>
#include <string>
#include <locale>
using namespace std;

float f;
wstring euro;

int main(){
locale l("de_DE.UTF-8");
wcin.imbue(l);
locale::global(l); // (*)
wcin>>f>>euro;
wcout.imbue(locale("en_US.UTF-8"));
wcout<<f<<L" "<<euro<<endl;
}
-------------

Calling

$ echo "5,70 €" |./wcintest

in a UTF-8 environment gives

5.70 €

but only if the line marked (*) is present. Otherwise you only get

5.70

It seems as if the encoding part of the locale is ignored by the imbue
calls but I don't see why this should be the case.

I use g++ (GCC) 4.1.0 under linux (i386).

Ralf
Nov 15 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Currently I do not have linux here (at work) so I am only guessing. Did
you try to change locale of output to German locale?

wcout.imbue(l);

Maybe the euro sign is not accepted by US locale.

Nov 15 '06 #2

P: n/a
ondra.holub wrote:
Currently I do not have linux here (at work) so I am only guessing.
Did you try to change locale of output to German locale?

wcout.imbue(l);

Maybe the euro sign is not accepted by US locale.
The problem occurs earlier. The euro sign cannot be read from wcin
without the locale::global(l). Like I said wcin.imbue(l) does not seem
to honour the encoding part of the locale string. Probably, the encoding
can only be changed globally whereas the facets are specific to the
stream. But that's what puzzles me because I see no reason for this kind
of behaviour.

Ralf
Nov 15 '06 #3

P: n/a
Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(

With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).

Nov 15 '06 #4

P: n/a
ondra.holub wrote:
Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(
I would use cin, cout and string, but then there is the problem, that
string.size() and string.substr() do not work as expected.
With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).
Yeah it works, but I don't see the logic. Suppose you want to convert a
german utf8-encoded text file with floats and euro signs into a latin1
encoded file with en_US locale. Then you always have to change the
global locale before switching from reading from wcin to writing to
wcout or vice versa. If source and destination had the same encoding
then one imbue call for each stream would be sufficient. As I have found
nothing on the net that says "imbue calls do not care about encoding" I
suspect it might be a bug in my libstdc++ implementation of the
standard. It would be nice to know how other compilers/libraries deal
with that situation.

Another problem I encountered is that tolower() does not work on wchar_t
Umlauts although I use the correct global locale.

Ralf
Nov 16 '06 #5

P: n/a
I wrote:
>
Yeah it works, but I don't see the logic. Suppose you want to convert
a german utf8-encoded text file with floats and euro signs into a
latin1 encoded file with en_US locale. Then you always have to change
the global locale before switching from reading from wcin to writing
to wcout or vice versa.
I just found the following in Stroustrup (retranslated from German)

"Setting the global locale does not affect existing input/output
streams. The streams continue to use those locales that were assigned to
them using imbue() during their creation."

Ralf
Nov 16 '06 #6

P: n/a
I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:

#include <iostream>
#include <locale>

int main()
{
std::locale loc("german");
char s[] = "";

std::use_facet< std::ctype<char(loc).tolower(s, s + sizeof(s));
std::cout << s << std::endl;

std::use_facet< std::ctype<char(loc).toupper(s, s + sizeof(s));
std::cout << s << std::endl;

return 0;
}

Nov 16 '06 #7

P: n/a
ondra.holub wrote:
I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:
Okay this works (after modification), also with wchar_t. I also found a
solution in the c++ cookbook, templated functions

to[Upper|Lower](basic_string<C>,const locale & loc=locale())

which use use_facet. Interestingly they also have the problem that the
encoding part of the locale is not used unless the global locale
explicitly states that we use UTF-8. I'd really like to know whether is
confirming to the standard or a bug.

Ralf
Nov 16 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.