Connecting Tech Pros Worldwide Help | Site Map

imbue(locale) and file encoding

 
LinkBack Thread Tools Search this Thread
  #1  
Old November 15th, 2006, 08:05 AM
Ralf Goertz
Guest
 
Posts: n/a
Default imbue(locale) and file encoding

Hi,

since my previous post
<455440ad$0$30326$9b4e6d93@newsspool1.arcor-online.netis still
unanswered I'd like to rephrase my question. In order to read/write a
wstring in UTF-8 encoding it is *not* sufficient to imbue the stream
with a locale like "de_DE.UTF-8". Doing so only takes care of facets of
decimal numbers and the like. Rather, one has to call
locale::global("de_DE.UTF-8"). Is this behaviour conforming to the
standard? And if so why? I mean why wouldn't wcin.imbue("de_DE.UTF-8")
make wcin accept UTF-8 multibyte characters while still allowing 5,7 to
be parsed as 5.7?

file wcintest.cc:
-------------
#include <iostream>
#include <string>
#include <locale>
using namespace std;

float f;
wstring euro;

int main(){
locale l("de_DE.UTF-8");
wcin.imbue(l);
locale::global(l); // (*)
wcin>>f>>euro;
wcout.imbue(locale("en_US.UTF-8"));
wcout<<f<<L" "<<euro<<endl;
}
-------------

Calling

$ echo "5,70 €" |./wcintest

in a UTF-8 environment gives

5.70 €

but only if the line marked (*) is present. Otherwise you only get

5.70

It seems as if the encoding part of the locale is ignored by the imbue
calls but I don't see why this should be the case.

I use g++ (GCC) 4.1.0 under linux (i386).

Ralf

  #2  
Old November 15th, 2006, 08:45 AM
ondra.holub
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

Currently I do not have linux here (at work) so I am only guessing. Did
you try to change locale of output to German locale?

wcout.imbue(l);

Maybe the euro sign is not accepted by US locale.

  #3  
Old November 15th, 2006, 09:25 AM
Ralf Goertz
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

ondra.holub wrote:
Quote:
Currently I do not have linux here (at work) so I am only guessing.
Did you try to change locale of output to German locale?
>
wcout.imbue(l);
>
Maybe the euro sign is not accepted by US locale.
The problem occurs earlier. The euro sign cannot be read from wcin
without the locale::global(l). Like I said wcin.imbue(l) does not seem
to honour the encoding part of the locale string. Probably, the encoding
can only be changed globally whereas the facets are specific to the
stream. But that's what puzzles me because I see no reason for this kind
of behaviour.

Ralf
  #4  
Old November 15th, 2006, 08:25 PM
ondra.holub
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(

With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).

  #5  
Old November 16th, 2006, 06:35 AM
Ralf Goertz
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

ondra.holub wrote:
Quote:
Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(
I would use cin, cout and string, but then there is the problem, that
string.size() and string.substr() do not work as expected.
Quote:
With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).
Yeah it works, but I don't see the logic. Suppose you want to convert a
german utf8-encoded text file with floats and euro signs into a latin1
encoded file with en_US locale. Then you always have to change the
global locale before switching from reading from wcin to writing to
wcout or vice versa. If source and destination had the same encoding
then one imbue call for each stream would be sufficient. As I have found
nothing on the net that says "imbue calls do not care about encoding" I
suspect it might be a bug in my libstdc++ implementation of the
standard. It would be nice to know how other compilers/libraries deal
with that situation.

Another problem I encountered is that tolower() does not work on wchar_t
Umlauts although I use the correct global locale.

Ralf
  #6  
Old November 16th, 2006, 06:55 AM
Ralf Goertz
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

I wrote:
Quote:
>
Yeah it works, but I don't see the logic. Suppose you want to convert
a german utf8-encoded text file with floats and euro signs into a
latin1 encoded file with en_US locale. Then you always have to change
the global locale before switching from reading from wcin to writing
to wcout or vice versa.
I just found the following in Stroustrup (retranslated from German)

"Setting the global locale does not affect existing input/output
streams. The streams continue to use those locales that were assigned to
them using imbue() during their creation."

Ralf
  #7  
Old November 16th, 2006, 07:35 AM
ondra.holub
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:

#include <iostream>
#include <locale>

int main()
{
std::locale loc("german");
char s[] = "äÖü";

std::use_facet< std::ctype<char(loc).tolower(s, s + sizeof(s));
std::cout << s << std::endl;

std::use_facet< std::ctype<char(loc).toupper(s, s + sizeof(s));
std::cout << s << std::endl;

return 0;
}

  #8  
Old November 16th, 2006, 09:55 AM
Ralf Goertz
Guest
 
Posts: n/a
Default Re: imbue(locale) and file encoding

ondra.holub wrote:
Quote:
I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:
Okay this works (after modification), also with wchar_t. I also found a
solution in the c++ cookbook, templated functions

to[Upper|Lower](basic_string<C>,const locale & loc=locale())

which use use_facet. Interestingly they also have the problem that the
encoding part of the locale is not used unless the global locale
explicitly states that we use UTF-8. I'd really like to know whether is
confirming to the standard or a bug.

Ralf
 

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Popular Articles

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 220,662 network members.