473,713 Members | 2,494 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

how to using codecvt to convert ascii<-->UTF-8 within std::ofstream

My Friends:

I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(... ) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.

Something I found shows that, I need to have a proper codecvt to set
it. I need more information, maybe a small piece of code sample. Thank
you!

Regards,
David Xiao

Dec 19 '05 #1
8 14324
More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.

-David Xiao

Dec 19 '05 #2
<da******@gmail .com> wrote in message
news:11******** **************@ g14g2000cwa.goo glegroups.com.. .
More information provided here:
My envionment is MS VC2003 along with build-in STL library.

And I am confused by codecvt<>.... Very appreciated if someone can
provide a codecvt<> that do "multibyte <--> UTF-8" conversion.


You get one with our CoreX package, available at our web site.
You might also find a free one at Boost, if you can afford the
time to locate it and make it work in your environment.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #3

da******@gmail. com wrote:
My Friends:

I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(... ) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.


Probably no need to, if your Subject: line is correct. UTF-8 is an
8-bits superset of ASCII to deal with non-ASCII characters. If you
have an ASCII text, each 7-bits ASCII character is represented by a
single char in C++, and this value coincides with the UTF-8 value.

Of course, C++ compilers may use EBCDIC both internally for char
and/or externally for text files, which breaks the nice model - but in
that case an ASCII -> UTF-8 codecvt obviously won't work either.

HTH,
Michiel Salters

Dec 19 '05 #4
I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::ofstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.

Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...

Regards, David Xiao

Dec 19 '05 #5
da******@gmail. com wrote:
I am using std::ofstream (as well as ifstream), I hope that when i
wrote in some std::string(... ) with locale, ofstream can convert to
UTF-8 encoding and save file to disk. So does ifstream.


'std::ofstream' and 'std::ifstream' operate on 'char' objects. Assuming
an appropriate configuration is set up, these can be indeed converted
to UTF-8 but this is hardly really exciting: they would map to the
total of 256 characters.

It is important to understand that, at least conceptually, the standard
C++ library internally operates on characters where each character is
represented by one character object. Internally, no multi-byte
representation is supported. To cope with more than 256 characters,
you would use a different character type, e.g. 'wchar_t'. Effectively,
the idea was to use 'wchar_t' object to represent Unicode characters
which at the time when the standard C++ library was designed where
units of 16 bits and each unit represented an individual character.

Unfortunately, the Unicode people decided at some point that it would
be a really brilliant idea to throw all their fundamental assumptions
overboard and have combining character (i.e. suddenly some characters
were represented by more than one unit) and 20 bit characters. This
does not mix too well with C++, though: some compiler vendors had
decided that their 'wchar_t' shall be 16 bits wide and they are
essentially bound to this choice due to existing binary interfaces
already using 16 bit 'wchar_t's. To cope with these, the standard
library typically supports an internal UTF-16 representation although
most code actually uses the 'wchar_t' as UCS-2 entities, i.e. it does
not care about UTF-16, nor about combining characters.

Although the UTF-16 support somewhat muddies the water, in the context
of the standard C++ library you should best think in terms of
"characters ", i.e. the entities used within a program which are stored
e.g. in 'std:basic_stri ng<cT>' (each 'cT' representing one character),
and their "encoding", i.e. the entities ending up as bytes in a file.

You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings. I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).
--
<mailto:di***** ******@yahoo.co m> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence
Dec 19 '05 #6
<da******@gmail .com> wrote in message
news:11******** **************@ g43g2000cwa.goo glegroups.com.. .
I'm sorry for the incorrect wording in subject line...
I believe Michiel Salters 's words are right that, UTF-8 define its
"character" as variant length. For 7-bit char case, UTF8 compatiable
with ASCII.

Actually I am dealing with asian languages. For example, CP936 (chinese
GBK) or CP949(korea language). And I am looking for a way to deal with
it:

CONVERT locale(......) <-->UTF-8.

It better be graceful, working with std::ofstream ( I guess that will
need a codecvt<>) Of course, the ugly way is use some code to convert
the whole file.
Our CoreX library has 80-odd codecvt facets, among them are ones
that convert:

-- between CP936 and UCS-2

-- between CP949 and UCS-2

-- between UTF-8 and UCS-2

You can use them with istream/ostream for file I/O or with an
in-memory string-to-string converter that we also supply.
Sounds like exactly what you need.
Thanks P.J.Plauger for the suggestion. I found one codecvt<> in boost,
but it seems working on UTF-8<-->UTF-16.
Anyway, I am follow this thread with attention...


Welcome.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #7
"Dietmar Kuehl" <di***********@ yahoo.com> wrote in message
news:40******** *****@individua l.net...
da******@gmail. com wrote: You seem to have some internal multi-byte encoding which you
apparently want to write to some other multi-byte encoding with the
latter being UTF-8. At least, this is what I gather from your articles
and the subject of your articles: as Michiel correctly noted, you can
dump ASCII using the C locale (i.e. no conversion at all) into a file
and you would have a valid UTF-8 representation of your ASCII text.
One of the fundamental design decisions of Unicode which they haven't
thrown overboard (at least not when I last looked; I wouldn't put it
beyond them to do otherwise, though) is that each valid ASCII text is
a valid UTF-8 text with exactly the same interpretation.

I don't know whether Dinkumware's library really supports conversion
of arbitrary internal representations into (more or less arbitrary)
external representations
We have a rich set of pairwise transformations that can be chained
together in several useful ways.
but I would use the following approach anyway:
- Convert your multi-byte encoded text into a sequence of characters
using the normal internal representation, probably using an
appropriate code conversion facet.
- Use the characters internally in your internal processing, probably
taking care neither to rip combined characters nor UTF-16 character
apart.
- Have a suitable code conversion facet convert the internal
representation into whatever suitable encoding you want to use, e.g.
UTF-16.

I'm pretty sure that Dinkumware's library does the appropriate
conversions between an internal character representation and various
external encodings.
Yep.
I think there are also free alternatives but I
don't know any of them off-hand although I guess that the code
conversion facet you found at Boost does just the right thing: it
probably uses UTF-16 as the internal representation for characters
and converts between this character representation (although, from
a purist view this is not a suitable character representation at all)
and the UTF-8 encoding. You might need to find a code conversion
facet from whatever other encoding you are using to the internal
encoding (probably UTF-16 on Windows machines and UCS-4 on many other
systems).


Luckily, the OP clarified that he has no need for UTF-16. That
simplifies matters a bit.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Dec 19 '05 #8
Thanks, Gentlemen.

I believe I got idea from all your posts. I also believe that i18n of
STL has a "graceful gap" as well.

Thanks again for your information!

Rgrds, David Xiao

Dec 20 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1601
by: Libra | last post by:
Hi all, I'm trying to find a php class or application for doing searches within a pdf file. I need a repository for pdf files and I want to be able to perform researches within these files. I found some post suggesting to use pdftotext or similar and to store the entire text on the DB, but I wonder if exists something else or, maybe, any (GPL) project already running.
1
10837
by: Greg Scharlemann | last post by:
I am attempting to upload a picture to a webserver and create a thumbnail from the picture uploaded. The problem comes in when I attempt to create an Image object from the File object (which is the location of the uploaded picture)...I get the following error: java.lang.NoClassDefFoundError at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:130) at java.awt.Toolkit$2.run(Toolkit.java:712) at...
2
2429
by: guy | last post by:
Hello All, I'm have a VB function that connect to SQL SERVER , get's information and returns the relavant string. using this function within VB application (say cmdbutton) works great, but when trying to activate the function from asp page, i'm getting nothing. can anyone help here ? here is the VB CODE : =============================================================================
0
1278
by: KathyB | last post by:
Hi, I would be grateful for any guidance in how to achieve the following. I have an aspx page in which I transform an xmlDocument instance to the browser...so (as I understand it) that page contains the xmlDocument AND the transformed instance client-side in the browser. The main xml element used in this doc is a <step> element. For each step element I will create an "Anomaly" button using xsl to create an <input> button in html and...
5
4100
by: billelev | last post by:
Hi there. I need to perform a number of financial calculations within a database I am creating. Rather than writing my own functions, I figured it would be worthwhile to use the functions that already exist within excel. I have found sample code on the web for calling an excel function (being new to Access I tend to borrow bits and pieces from anywhere and everywhere): Function fXLCoupDayBS(dtmSettlement As Date, dtmMaturity As Date,...
2
2435
by: =?Utf-8?B?SGFubmVzIFN0ZWlua2U=?= | last post by:
Hello, I'm trying to use the command "#pragma omp parallel for" within a thread but each thread that is created (and stopped again) produces a memory leak of about 44kB. If the entry for "Stack Reserver Size" in Properties->ConfigurationsProperties->Linker->System is increased, the memory leak grows up to 10MB and more dependent on the stack reserve size.
1
1300
by: Bill Tinker | last post by:
Hi I would appreciate any help that would shed some light on this problem... I have a com dll that is an API to another application. The initialize routine of the object requires the path where the data files are located, to be passed in as a parameter. This works fine when the data folder in on the same machine running the aspx page. However, if the path is in the form of a UNC (eg. \\FS01\ProjectData) the object fails with an Invalid...
2
4850
by: Constantine AI | last post by:
I am wanting to import CSV files into Access, which isn't a problem at the moment the code i have is as follows: Dim strSQL As String Dim CSVTable As String Dim FilePath As String Dim Result As String strSQL = "DELETE * from csvordlin" CSVTable = "csvordlin" FilePath = InputBox("Please Enter a Path File for the CSV Location!", "Criteria Required")
4
10139
by: micky125 | last post by:
Hi all, i've been pondering with an idea for my system to allow the user to make reports depending on what he needs. His input screen will be a series of select options corresponding to the table columns. There will be the same number of select statements on screen as there are tables / columns in my database. e.g Table 1 = id, date, age (etc.....) Table 2 = make model (etc....) ----- ----- Table N = etc ________________________
0
8795
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9306
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9168
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9068
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6621
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5943
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4462
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
2510
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2103
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.