da******@gmail.com wrote:
I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.
'std::ofstream' and 'std::ifstream' operate on 'char' objects. Assuming
an appropriate configuration is set up, these can be indeed converted
to UTF-8 but this is hardly really exciting: they would map to the
total of 256 characters.
It is important to understand that, at least conceptually, the standard
C++ library internally operates on characters where each character is
represented by one character object. Internally, no multi-byte
representation is supported. To cope with more than 256 characters,
you would use a different character type, e.g. 'wchar_t'. Effectively,
the idea was to use 'wchar_t' object to represent Unicode characters
which at the time when the standard C++ library was designed where
units of 16 bits and each unit represented an individual character.
Unfortunately, the Unicode people decided at some point that it would
be a really brilliant idea to throw all their fundamental assumptions
overboard and have combining character (i.e. suddenly some characters
were represented by more than one unit) and 20 bit characters. This
does not mix too well with C++, though: some compiler vendors had
decided that their 'wchar_t' shall be 16 bits wide and they are
essentially bound to this choice due to existing binary interfaces
already using 16 bit 'wchar_t's. To cope with these, the standard
library typically supports an internal UTF-16 representation although
most code actually uses the 'wchar_t' as UCS-2 entities, i.e. it does
not care about UTF-16, nor about combining characters.
Although the UTF-16 support somewhat muddies the water, in the context
of the standard C++ library you should best think in terms of
"characters", i.e. the entities used within a program which are stored
e.g. in 'std:basic_string<cT>' (each 'cT' representing one character),
and their "encoding", i.e. the entities ending up as bytes in a file.
You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.
I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.
I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings. I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence