472,778 Members | 2,516 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,778 software developers and data experts.

how to using codecvt to convert ascii<-->UTF-8 within std::ofstream

My Friends:

I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.

Something I found shows that, I need to have a proper codecvt to set
it. I need more information, maybe a small piece of code sample. Thank
you!

Regards,
David Xiao

Dec 19 '05 #1
8 14111
More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.

-David Xiao

Dec 19 '05 #2
<da******@gmail.com> wrote in message
news:11**********************@g14g2000cwa.googlegr oups.com...
More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.


You get one with our CoreX package, available at our web site.
You might also find a free one at Boost, if you can afford the
time to locate it and make it work in your environment.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #3

da******@gmail.com wrote:
My Friends:

I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.


Probably no need to, if your Subject: line is correct. UTF-8 is an
8-bits superset of ASCII to deal with non-ASCII characters. If you
have an ASCII text, each 7-bits ASCII character is represented by a
single char in C++, and this value coincides with the UTF-8 value.

Of course, C++ compilers may use EBCDIC both internally for char
and/or externally for text files, which breaks the nice model - but in
that case an ASCII -> UTF-8 codecvt obviously won't work either.

HTH,
Michiel Salters

Dec 19 '05 #4
I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::ofstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.

Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...

Regards, David Xiao

Dec 19 '05 #5
da******@gmail.com wrote:
I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(...) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.


'std::ofstream' and 'std::ifstream' operate on 'char' objects. Assuming
an appropriate configuration is set up, these can be indeed converted
to UTF-8 but this is hardly really exciting: they would map to the
total of 256 characters.

It is important to understand that, at least conceptually, the standard
C++ library internally operates on characters where each character is
represented by one character object. Internally, no multi-byte
representation is supported. To cope with more than 256 characters,
you would use a different character type, e.g. 'wchar_t'. Effectively,
the idea was to use 'wchar_t' object to represent Unicode characters
which at the time when the standard C++ library was designed where
units of 16 bits and each unit represented an individual character.

Unfortunately, the Unicode people decided at some point that it would
be a really brilliant idea to throw all their fundamental assumptions
overboard and have combining character (i.e. suddenly some characters
were represented by more than one unit) and 20 bit characters. This
does not mix too well with C++, though: some compiler vendors had
decided that their 'wchar_t' shall be 16 bits wide and they are
essentially bound to this choice due to existing binary interfaces
already using 16 bit 'wchar_t's. To cope with these, the standard
library typically supports an internal UTF-16 representation although
most code actually uses the 'wchar_t' as UCS-2 entities, i.e. it does
not care about UTF-16, nor about combining characters.

Although the UTF-16 support somewhat muddies the water, in the context
of the standard C++ library you should best think in terms of
"characters", i.e. the entities used within a program which are stored
e.g. in 'std:basic_string<cT>' (each 'cT' representing one character),
and their "encoding", i.e. the entities ending up as bytes in a file.

You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings. I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence
Dec 19 '05 #6
<da******@gmail.com> wrote in message
news:11**********************@g43g2000cwa.googlegr oups.com...
I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::ofstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.
Our CoreX library has 80-odd codecvt facets, among them are ones
that convert:

-- between CP936 and UCS-2

-- between CP949 and UCS-2

-- between UTF-8 and UCS-2

You can use them with istream/ostream for file I/O or with an
in-memory string-to-string converter that we also supply.
Sounds like exactly what you need.
Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...


Welcome.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #7
"Dietmar Kuehl" <di***********@yahoo.com> wrote in message
news:40*************@individual.net...
da******@gmail.com wrote: You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations
We have a rich set of pairwise transformations that can be chained
together in several useful ways.
but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings.
Yep.
I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).


Luckily, the OP clarified that he has no need for UTF-16. That
simplifies matters a bit.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #8
Thanks, Gentlemen.

I believe I got idea from all your posts. I also believe that i18n of
STL has a "graceful gap" as well.

Thanks again for your information!

Rgrds, David Xiao

Dec 20 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Libra | last post by:
Hi all, I'm trying to find a php class or application for doing searches within a pdf file. I need a repository for pdf files and I want to be able to perform researches within these files. I...
1
by: Greg Scharlemann | last post by:
I am attempting to upload a picture to a webserver and create a thumbnail from the picture uploaded. The problem comes in when I attempt to create an Image object from the File object (which is...
2
by: guy | last post by:
Hello All, I'm have a VB function that connect to SQL SERVER , get's information and returns the relavant string. using this function within VB application (say cmdbutton) works great, but when...
0
by: KathyB | last post by:
Hi, I would be grateful for any guidance in how to achieve the following. I have an aspx page in which I transform an xmlDocument instance to the browser...so (as I understand it) that page...
5
by: billelev | last post by:
Hi there. I need to perform a number of financial calculations within a database I am creating. Rather than writing my own functions, I figured it would be worthwhile to use the functions that...
2
by: =?Utf-8?B?SGFubmVzIFN0ZWlua2U=?= | last post by:
Hello, I'm trying to use the command "#pragma omp parallel for" within a thread but each thread that is created (and stopped again) produces a memory leak of about 44kB. If the entry for...
1
by: Bill Tinker | last post by:
Hi I would appreciate any help that would shed some light on this problem... I have a com dll that is an API to another application. The initialize routine of the object requires the path...
2
by: Constantine AI | last post by:
I am wanting to import CSV files into Access, which isn't a problem at the moment the code i have is as follows: Dim strSQL As String Dim CSVTable As String Dim FilePath As String Dim Result As...
4
by: micky125 | last post by:
Hi all, i've been pondering with an idea for my system to allow the user to make reports depending on what he needs. His input screen will be a series of select options corresponding to the table...
0
by: Rina0 | last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.