473,836 Members | 1,549 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

std::string vs. Unicode UTF-8

I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger
--

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 13 '05 #1
32 49740
> The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.


It is much easier to handle unicode strings with wchar_t internally and
there is much less confusion about whether the string is ANSI or UTF8
encoded. So I have started using wchar_t wherever I can and I only use UTF8
for external communication.

Niels Dybdahl
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #2
Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


UTF-8 is only an encoding, why to you think a strings internal to the
program should be represented as UTF-8? Makes more sense to me to
translate to or from UTF-8 when you input or output strings from your
program. C++ already has the framework in place for that.

john

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #3

Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Correct. Also you can't print it or anything else.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.
Ok.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? It already is - using e.g. wchar_t. I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. It is not limited.Of course there is
also the wchar_t variant, but actually I don't like that.
So you'd like to have Unicode support. And you realize you already have
it. But you don't like it. Why?
Wolfgang Draxinger
--

/Peter

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #4
On Tue, 13 Sep 2005 04:20:30 GMT, wd********@dark stargames.de
(Wolfgang Draxinger) wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Not only that, but substr(), operator[] etc. pose equally
"interestin g" problems.
To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


People use std::string in many different ways. You can even store
binary data with embedded null characters in it. I don't know for
sure, but I believe there are already proposals in front of the C++
standards committee for what you suggest. In the meantime, it might
make more sense to use a third-party UTF-8 string class if that is
what you mainly use it for. IBM has released the ICU library as open
source, for example, and it is widely used these days.

--
Bob Hairgrove
No**********@Ho me.com

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #5

"Wolfgang Draxinger" <wd********@dar kstargames.de> wrote in message
news:q2******** ***@darkstargam es.dnsalias.net ...
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.

To address this problem the GTKmm bindings for the GTK+ toolkit
have implemented a own string class Glib::ustring
<http://tinyurl.com/bxpu4> which takes care of UTF-8 in strings.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.

Wolfgang Draxinger


That's why people have std::wstring :)

Ben
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #6
Wolfgang Draxinger wrote:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Yup. That's what happens when you use the wrong tool.

The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays.
There's much more to internationaliz ation than Unicode. Requiring
std::string to be Unicode aware (presumably that means UTF-8 aware)
would impose implementation overhead that's not needed for the kinds of
things it was designed for, like the various ISO 8859 code sets. In
general, neither string nor wstring knows anything about multi-character
encodings. That's for efficiency. Do the translation on input and output.

Of course there is
also the wchar_t variant, but actually I don't like that.


That's unfortunate, since it's exactly what wchar_t and wstring were
designed for. What is your objection to them?

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #7

Wolfgang Draxinger schreef:
I understand that it is perfectly possible to store UTF-8 strings
in a std::string, however doing so can cause some implicaions.
E.g. you can't count the amount of characters by length() |
size(). Instead one has to iterate through the string, parse all
UTF-8 multibytes and count each multibyte as one character.
Usually correct, but not always. A char is a byte in C++, but
a byte might not be an octet. UTF-8 is of course octet-based.
The question is, wouldn't it be logical to make std::string
Unicode aware in the next STL version? I18N is an important
topic nowadays and I simply see no logical reason to keep
std::string as limited as it is nowadays. Of course there is
also the wchar_t variant, but actually I don't like that.


wchar_t isn't always Unicode, either. There's a proposal to add an
extra unicode char type, and that probably will include std::ustring

However, that is probably a 20+bit type. Unicode itself assigns
numbers to characters, and the numbers have exceeded 65536.
UTF-x means Unicode Transformation Format - x. These formats
map each number to one or more x-bit values. E.g. UTF-8 maps
the number of each unicode character to an octet sequence,
with the additional property that the 0 byte isn't used for
anything but number 0.

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.

HTH,
Michiel Salters

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #8
msalters wrote:
Wolfgang Draxinger schreef:
[...] However, that is probably a 20+bit type. Unicode itself
assigns numbers to characters, and the numbers have exceeded
65536. UTF-x means Unicode Transformation Format - x. These
formats map each number to one or more x-bit values.
E.g. UTF-8 maps the number of each unicode character to an
octet sequence, with the additional property that the 0 byte
isn't used for anything but number 0.
It has a lot more additional properties than that. Like the
fact that you can immediately tell whether a byte is a single
byte character, the first byte of a multibyte sequence, or a
following byte in a multibyte sequence, without looking beyond
just that byte.
Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.


I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:st*****@ ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

Sep 14 '05 #9
On 14 Sep 2005 14:40:21 GMT, "kanze" <ka***@gabi-soft.fr> wrote:

Now, these formats are intended for data transfer and not data
processing. That in turn means UTF-8 should go somewhere in
<iostream>, if it's added.


I don't know where you find that these formats are intended just
for data transfer. Depending on what the code is doing (and the
text it has to deal with), the ideal solution may be UTF-8,
UTF-16 or UTF-32. For most of what I do, UTF-8 would be more
appropriate, including internally, than any of the other
formats. (It's also required in some cases.)


RFC 3629 says it this way:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit characters."

Note that UTF-8 is intended to _encode_ a larger space, its primary purpose
being the compatibily of the encoded format with "applicatio ns and protocols"
that assume 8- or 7-bit characters. This suggests to me that UTF-8 was devised
so that Unicode text can be _passed through_ older protocols that only
understand 8- or 7-bit characters by encoding it at the input, and later
decoding it at the output to recover the original data.

If you want to _manipulate_ Unicode characters, however, why not deal with
them in their native, unencoded space? wchar_t is guaranteed to be wide enough
to contain all characters in all supported locales in the implementation, and
each character will have an equal size in memory.

-dr
Sep 15 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
3414
by: James Vanns | last post by:
I want to be able to print out (and read in) characters with accents (for example French and Italian text). So far I have this: std::locale lang (getenv ("LANG")); which seems to set the locale correctly, say to it_IT.utf8 (on UNIX). However, when reading in text from a file using:
12
28224
by: Flzw | last post by:
How to convert a std::string to a WCHAR* ? is there any methods or something ? I can't find. Thanks
9
22570
by: vsgdp | last post by:
Hi, Is there a unicode equivalent to std::string?
24
17494
by: Marcus Kwok | last post by:
Hello, I am working on cleaning up some code that I inherited and was wondering if there is anything wrong with my function. I am fairly proficient in standard C++ but I am pretty new to the .NET managed C++. It seems to work fine, but everyone knows that programs with errors can still appear to "work fine" :) I am working with VS .NET 2003; I am unable to upgrade to 2005 at this time, so I cannot use the newer syntax or features. ...
14
12208
by: rohitpatel9999 | last post by:
Hi While developing any software, developer need to think about it's possible enhancement for international usage and considering UNICODE. I have read many nice articles/items in advanced C++ books (Effective C++, More Effective C++, Exceptional C++, More Exceptional C++, C++ FAQs, Addison Wesley 2nd Edition) Authors of these books have not considered UNICODE. So many of their
10
10143
by: Jeffrey Walton | last post by:
Hi All, I've done a little homework (I've read responses to similar from P.J. Plauger and Dietmar Kuehl), and wanted to verify with the Group. Below is what I am performing (Stroustrup's Appendix D recommendation won't compile in Microsoft VC++ 6.0). My question is in reference to MultiByte Character Sets. Will this code perform as expected? I understand every problem has a simple and elegant solution that is wrong.
14
24290
by: Mosfet | last post by:
Hi, what is the most efficient way of doing a case insensitive comparison ? I am trying to write a universal String class and I am stuck with the case insensitive part : TCHAR is a char in MultiByte String env (MBCS) and wchar_t if UNICODE #if defined(WIN32) || defined(UNDER_CE)
4
6225
by: barnum | last post by:
Hi, I have a std::string which I know is UTF-8 encoded. How can I make a System::String^ from it? I tried UTF8Encoding class, but it wants a Byte array, and I don't know how to get that from a std::string. Thanks for any help!
8
13769
by: Edson Manoel | last post by:
I have some C++ unmanaged code that takes std::string& arguments (as reference), and fills them (possibly growing the string). I want to call this code through PInvoke (DllImport), possibly using wrapper layers in unmanaged C++ and C#. I've thought about two approaches: 1) To pass a StringBuilder, this is converted to a char* in C++, the wrapper code converts the char* to a std::string (copy), and in the
0
9825
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9671
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10558
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10257
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9387
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6981
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5651
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4459
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3116
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.