473,569 Members | 2,571 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

std::string vs. Unicode UTF-8

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger
--

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 13 '05 #1
32 49668
> The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.


It is much easier to handle unicode strings with wchar_t internally and
there is much less confusion about whether the string is ANSI or UTF8
encoded. So I have started using wchar_t wherever I can and I only use UTF8
for external communication.

Niels Dybdahl
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #2
Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


UTF-8 is only an encoding, why to you think a strings internal to the
program should be represented as UTF-8? Makes more sense to me to
translate to or from UTF-8 when you input or output strings from your
program. C++ already has the framework in place for that.

john

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #3

Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Correct. Also you can't print it or anything else.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
Ok.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? It already is - using e.g. wchar_t. I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. It is not limited.Of course there is
also the wchar_t variant, but actually I don't like that.
So you'd like to have Unicode support. And you realize you already have
it. But you don't like it. Why?
Wolfgang Draxinger
--

/Peter

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #4
On Tue, 13 Sep 2005 04:20:30 GMT, wd********@dark stargames.de
(Wolfgang Draxinger) wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Not only that, but substr(), operator[] etc. pose equally
"interestin g" problems.
To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


People use std::string in many different ways. You can even store
binary data with embedded null characters in it. I don't know for
sure, but I believe there are already proposals in front of the C++
standards committee for what you suggest. In the meantime, it might
make more sense to use a third-party UTF-8 string class if that is
what you mainly use it for. IBM has released the ICU library as open
source, for example, and it is widely used these days.

--
Bob Hairgrove
No**********@Ho me.com

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #5

"Wolfgang Draxinger" <wd********@dar kstargames.de> wrote in message
news:q2******** ***@darkstargam es.dnsalias.net ...
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


That's why people have std::wstring :)

Ben
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #6
Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Yup. That's what happens when you use the wrong tool.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays.
There's much more to internationaliz ation than Unicode. Requiring
std::string to be Unicode aware (presumably that means UTF-8 aware)
would impose implementation overhead that's not needed for the kinds of
things it was designed for, like the various ISO 8859 code sets. In
general, neither string nor wstring knows anything about multi-character
encodings. That's for efficiency. Do the translation on input and output.

Of course there is
also the wchar_t variant, but actually I don't like that.


That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #7

Wolfgang Draxinger schreef:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Usually correct, but not always. A char is a byte in C++, but
a byte might not be an octet. UTF-8 is of course octet-based.
The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.


wchar_t isn't always Unicode, either. There's a proposal to add an
extra unicode char type, and that probably will include std::ustring

However, that is probably a 20+bit type. Unicode itself assigns
numbers to characters, and the numbers have exceeded 65536.
UTF-x means Unicode Transformation Format - x. These formats
map each number to one or more x-bit values. E.g. UTF-8 maps
the number of each unicode character to an octet sequence,
with the additional property that the 0 byte isn't used for
anything but number 0.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #8
msalters wrote:
Wolfgang Draxinger schreef:
[...] However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.
It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.
Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.


I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #9
On 14 Sep 2005 14:40:21 GMT, "kanze" <ka***@gabi-soft.fr> wrote:

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.


I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)


RFC 3629 says it this way:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit characters."

Note that UTF-8 is intended to _encode_ a larger space, its primary purpose
being the compatibily of the encoded format with "applicatio ns and protocols"
that assume 8- or 7-bit characters. This suggests to me that UTF-8 was devised
so that Unicode text can be _passed through_ older protocols that only
understand 8- or 7-bit characters by encoding it at the input, and later
decoding it at the output to recover the original data.

If you want to _manipulate_ Unicode characters, however, why not deal with
them in their native, unencoded space? wchar_t is guaranteed to be wide enough
to contain all characters in all supported locales in the implementation, and
each character will have an equal size in memory.

-dr
Sep 15 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
3387
by: James Vanns | last post by:
I want to be able to print out (and read in) characters with accents (for example French and Italian text). So far I have this: std::locale lang (getenv ("LANG")); which seems to set the locale correctly, say to it_IT.utf8 (on UNIX). However, when reading in text from a file using:
12
28142
by: Flzw | last post by:
How to convert a std::string to a WCHAR* ? is there any methods or something ? I can't find. Thanks
9
22544
by: vsgdp | last post by:
Hi, Is there a unicode equivalent to std::string?
24
17451
by: Marcus Kwok | last post by:
Hello, I am working on cleaning up some code that I inherited and was wondering if there is anything wrong with my function. I am fairly proficient in standard C++ but I am pretty new to the .NET managed C++. It seems to work fine, but everyone knows that programs with errors can still appear to "work fine" :) I am working with VS .NET...
14
12166
by: rohitpatel9999 | last post by:
Hi While developing any software, developer need to think about it's possible enhancement for international usage and considering UNICODE. I have read many nice articles/items in advanced C++ books (Effective C++, More Effective C++, Exceptional C++, More Exceptional C++, C++ FAQs, Addison Wesley 2nd Edition) Authors of these books...
10
10099
by: Jeffrey Walton | last post by:
Hi All, I've done a little homework (I've read responses to similar from P.J. Plauger and Dietmar Kuehl), and wanted to verify with the Group. Below is what I am performing (Stroustrup's Appendix D recommendation won't compile in Microsoft VC++ 6.0). My question is in reference to MultiByte Character Sets. Will this code perform as...
14
24248
by: Mosfet | last post by:
Hi, what is the most efficient way of doing a case insensitive comparison ? I am trying to write a universal String class and I am stuck with the case insensitive part : TCHAR is a char in MultiByte String env (MBCS) and wchar_t if UNICODE #if defined(WIN32) || defined(UNDER_CE)
4
6190
by: barnum | last post by:
Hi, I have a std::string which I know is UTF-8 encoded. How can I make a System::String^ from it? I tried UTF8Encoding class, but it wants a Byte array, and I don't know how to get that from a std::string. Thanks for any help!
8
13734
by: Edson Manoel | last post by:
I have some C++ unmanaged code that takes std::string& arguments (as reference), and fills them (possibly growing the string). I want to call this code through PInvoke (DllImport), possibly using wrapper layers in unmanaged C++ and C#. I've thought about two approaches: 1) To pass a StringBuilder, this is converted to a char* in C++,...
0
7710
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7625
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8144
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
6313
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5519
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3677
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3666
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2128
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1235
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.