473,570 Members | 2,929 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

c++ support for unicode, utf-8, encode/decode, ifstream, wstream?


Hi,
I have an UNICODE text file endcoded in UTF-8.

I should store the UNICODE strings in my program for example in
std::wstring right? To be able to work on them normally, so that
std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
byte of UNICODE encoded string.

How do I read a text from UTF-8 file into std::wstring? I need to do
some conversion right? from utf-8 to internal format used by
std::wstring (probably UCS-2 or -4 right?)

Also, how to save back the string, and how to manipulate it (like,
replace 4-th character, just str[4]=(wchar)'x' ?)

Thanks



Jan 20 '06 #1
12 9874
TB
Rafał Maj Raf256 sade:
Hi,
I have an UNICODE text file endcoded in UTF-8.

I should store the UNICODE strings in my program for example in
std::wstring right? To be able to work on them normally, so that
std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
byte of UNICODE encoded string.

How do I read a text from UTF-8 file into std::wstring? I need to do
some conversion right? from utf-8 to internal format used by
std::wstring (probably UCS-2 or -4 right?)

Also, how to save back the string, and how to manipulate it (like,
replace 4-th character, just str[4]=(wchar)'x' ?)


Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing. The conversion process is quite easy to write.
The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).

--
TB @ SWEDEN
Jan 20 '06 #2
TB wrote:
Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing.
How? Arent there ready to use functions/classes doing that? In std,
perhaps in boost?
The conversion process is quite easy to write. The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).


Hm.. so which class is best to store any-language text string then?

Jan 20 '06 #3
"Rafal Maj Raf256" <us************ *******@raf256. com.invalid> wrote in
message news:dq******** **@inews.gazeta .pl...
TB wrote:
Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing.


How? Arent there ready to use functions/classes doing that? In std,
perhaps in boost?


You'll find a few codecvt facets (the critters you need) in various places,
but for a complete set of all that you're likely to need -- ready made,
tested, and supported -- see our CoreX library.
The conversion process is quite easy to write.
No it isn't. At least not correctly and robustly.
The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).


Hm.. so which class is best to store any-language text string then?


Depends on your goals. In truth and reality, you can still get away quite
nicely with UCS-2. Effectively, you ignore the exotic characters with
code values above 0xffff more recently added. Your input converter
then treats as erroneous any UTF-8 sequence that specifies a code
value that's too big. But if you feel the need to support the complete
Unicode set in its current form, you need to convert UTF-8 to UTF-16
internally, and accept the fact that characters can occupy either one or
two storage elements. Whatever your choice, CoreX has the
conversion tools you need to carry it out.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Jan 20 '06 #4
TB
Rafał Maj Raf256 sade:
TB wrote:
Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing.


How? Arent there ready to use functions/classes doing that? In std,
perhaps in boost?
The conversion process is quite easy to write.

The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).


Hm.. so which class is best to store any-language text string then?


If 'unsigned int' is 4 bytes on your machine, write a unicode
implementation based on that primitive, or use an already available
framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.

--
TB @ SWEDEN
Jan 20 '06 #5
"TB" <TB@void.com> wrote in message
news:43******** **************@ taz.nntpserver. com...
Rafal Maj Raf256 sade:
TB wrote:
Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing.


How? Arent there ready to use functions/classes doing that? In std,
perhaps in boost?
The conversion process is quite easy to write.

The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).


Hm.. so which class is best to store any-language text string then?


If 'unsigned int' is 4 bytes on your machine, write a unicode
implementation based on that primitive, or use an already available
framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


Yep. It includes UTF-8 to UCS-4 too. And it's templatized on the
internal character type. Forgot to mention that.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Jan 20 '06 #6
On Fri, 20 Jan 2006 17:13:37 +0100, TB <TB@void.com> wrote:
If 'unsigned int' is 4 bytes on your machine, write a unicode
implementati on based on that primitive, or use an already available
framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
Plauger. Bit of a difference I think.

"If you have ten thousand regulations you destroy
all respect for the law." - Winston Churchill
Jan 21 '06 #7
"JustBoo" <Ju*****@BooWho .com> wrote in message
news:e5******** *************** *********@4ax.c om...
On Fri, 20 Jan 2006 17:13:37 +0100, TB <TB@void.com> wrote:
If 'unsigned int' is 4 bytes on your machine, write a unicode
implementatio n based on that primitive, or use an already available
framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
Plauger. Bit of a difference I think.


Really? In what way? I certainly *advocate* using an already
available framework, as did TB. If you can get a free one that
does the job (and it's still sufficiently "free" after you locate
it, download it, figure out how to build it, integrate it into
your product, deal with the surprises, and test it to your
satisfaction) by all means do so. I also *advocate* using CoreX,
if you're sufficiently professional that USD 90 is cheaper than
the above parenthetical exercise costs you in your time and
peace of mind.

But if you think I *advocate* something just because I make
ninety bucks off it, then by all means avoid anything that's
$OLD and stick with open soure. Just don't measure me by
your standards.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Jan 21 '06 #8
On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
<pj*@dinkumware .com> wrote:
"JustBoo" <Ju*****@BooWho .com> wrote in message
news:e5******* *************** **********@4ax. com...
Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
Plauger. Bit of a difference I think.
Really? In what way? I certainly *advocate* using an already
available framework, as did TB.
In what way? Well, from the simple *fact* you make money selling
your products here. I make that the point of this paragraph. Can you
deny that? It is a fact. Leave emotion out of it. Leave capitalism out
of it. Leave your perception that this is an insult out of it, and all
the rest. You sell your products here. I could truly *careless*
whether you do or not. But you do. Please do not enumerate all the
good you do for the free and not-so-free world by doing this. You sell
them here on a consistent basis. Period. As the sun comes up every
morning, it's just the obvious truth. Now please pay attention; in the
*context* of this thread, I thought it important to point this out to
the poster. That's it.

Before getting your hackles up please read further.
If you can get a free one that
does the job (and it's still sufficiently "free" after you locate
it, download it, figure out how to build it, integrate it into
your product, deal with the surprises, and test it to your
satisfaction ) by all means do so. I also *advocate* using CoreX,
if you're sufficiently professional that USD 90 is cheaper than
the above parenthetical exercise costs you in your time and
peace of mind.
Fairly ironic that. I'm certain you don't remember, but *I have
recommended* people look at your products on a regular basis, in
this ng and others. Even in the real world. And ready for this, I have
used precisely the exact same logic to justify the recommendation
when *attacked* for doing so. That is the very definition of irony.

[Note: I usually leave out the snarky remark about "sufficient ly
professional" though.] :-)
But if you think I *advocate* something just because I make
ninety bucks off it, then by all means avoid anything that's
$OLD and stick with open soure.
Wow, an ocean's worth of assumption and presumptions to boot. You
think me a socialist? Bwha. I'm a stone-cold capitalist. You've
assumed far too much. <chuckling> You've read far too much into my
simple statement of fact.

And yes, I do think you advocate it because you make money from it.
Welcome to the commerce of the human race. It's just human nature.
I'll leave it up to you to decide if that is an insult or not.

Trend your own posts, seriously. Look at what you respond to and what
you always recommend. I believe you to be of a scientific mentality
and if you are honest with yourself you will see truth. Noting more
nothing less.

And in the end, so what. As I'm sure one of your arguments would/will
be: people are free to buy it or not, and you're making them aware of
its existence. And there you have it. Try to read this post without
emotion and perhaps you'll see my intent.
Just don't measure me by
your standards.

[Insult acknowledged but not accepted; like a refused package]

Once again, you assume far too much. Especially given that I simply
pointed out that you sell products, which is true. Does being a
capitalist bother you? Guilt perhaps? Note those are questions, not
assumptions.

"I didn't fight my way to the top of the food chain to be a
vegetarian."

Have a *prosperous* week. :-)
Jan 22 '06 #9
"JustBoo" <Ju*****@BooWho .com> wrote in message
news:h6******** *************** *********@4ax.c om...
On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
<pj*@dinkumware .com> wrote:
"JustBoo" <Ju*****@BooWho .com> wrote in message
news:e5****** *************** ***********@4ax .com...
Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
Plauger. Bit of a difference I think.
Really? In what way? I certainly *advocate* using an already
available framework, as did TB.


In what way? Well, from the simple *fact* you make money selling
your products here. I make that the point of this paragraph. Can you
deny that?


Uh, no.
[extensive rant elided]


Got it. Now chill out.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Jan 22 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
9522
by: Aditya Ivaturi | last post by:
We have a CMS which is written is based on php & mysql. Recently we received a request to support multiple languages so that sites in that particular laguage can be created. I did some search on the google and it seems I have to build in multibyte support for php and mysql. Mbstring (http://us3.php.net/mbstring) claims to support multiple...
4
2697
by: gabor | last post by:
hi, today i made some tests... i tested some unicode symbols, that are above the 16bit limit (gothic:http://www.unicode.org/charts/PDF/U10330.pdf) .. i played around with iconv and so on, so at the end i created an utf8 encoded text file,
1
2156
by: krammer | last post by:
Hello, I have the following questions that I have not been able to find any *good* answers for. Your help would me much appreciated!, fyi, I am a Java XML guy and I have no experience with SGML so my questions will probably be XML biased. 1) Is is possible to have Unicode text inside an SGML file? an example would be something like...
3
7376
by: Kieran Green | last post by:
Greetings, We are building an application written for Windows in C++ which uses OLEDB to connect to AIX DB2 8.2. Our app stores all string data in the wchar_t datatype, which generates dynamic SQL, typically with bound parameters DBTYPE_WSTR, and so is a Unicode app. We don't know whether to use the vargraphic datatype for storing...
11
5181
by: Roger Leigh | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The program listed below demonstrates the use of wcsftime() and std::time_put<wchar_t> which is a C++ wrapper around it. (I know this isn't C; but the "problem" lies in the C library implementation of wcsftime()). I'm not sure if this is a platform-dependent feature or part of the C standard....
2
9759
by: hezhenjie | last post by:
Hi, all: I just need to parse a unicode file, and assume to get data one line by one line. I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work normally on Windows platform. However, when migrate it to Linux platform, issue occurs. Linux only has fopen() function, and fgetws() could not correctly get lines, in fact, it gets...
8
16469
by: Divick | last post by:
Hi all, can somebody tell how much std::wstring is supported across different compilers on different platforms? AFAIK std::string is supported by almost all C++ compilers and almost all platforms, is that also the case with wstring? Another related question that I have is, is it advisable to use wstring than string for unicode support? To...
8
2248
by: sonald | last post by:
Hi, I am using python2.4.1 I need to pass russian text into python and validate the same. Can u plz guide me on how to make my existing code support the russian text. Is there any module that can be used for unicode support in python? Incase of decimal numbers, how to handle "comma as a decimal point"
18
620
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
3
3890
by: =?Utf-8?B?QWxleGFuZGVy?= | last post by:
Hi! I don't know why, but I want to read a file, change some of the content, and want to write this new content in another file. The problem is, that it contains unicode text. My code is: System.IO.StreamReader reader = new System.IO.StreamReader(this.openFileDialog1.FileName,Encoding.Unicode);
0
7633
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7944
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8149
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
6323
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5523
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5247
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3681
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2131
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
971
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.