473,504 Members | 13,746 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

multibyte characters

TK
Hi,

how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
....
....

Thanks for help.

o-o

Thomas
Nov 15 '07 #1
13 3505
TK <to****@web.dewrites:
how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
...
...
wchar_t c;

with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

--
Ben.
Nov 15 '07 #2
Ben Bacarisse wrote:
TK <to****@web.dewrites:
>how can I handle multibyte characters like ä, ü (german vowel mutation)?

This does't work:

switch(c)

case 'ä':
... some action
break;
case 'ü':
... some action
break;
...
...

wchar_t c;

with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.

This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.
Nov 15 '07 #3
Gianni Mariani wrote:
>>
with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.


This all depends on which character encoding is being used. wchat_t is
not necessarily a Unicode character.
And L'...' doesn't generate a MULTIBYTE character. It makes a wide
character.
Nov 15 '07 #4
On Nov 15, 3:00 pm, Ben Bacarisse <ben.use...@bsb.me.ukwrote:
TK <tok...@web.dewrites:
how can I handle multibyte characters like ä, ü (german vowel mutation)?
This does't work:
switch(c)
case 'ä':
... some action
break;
case 'ü':
... some action
break;
...
...
wchar_t c;
with L'ä' or L'\u00e4'. Using UCNs (the \uxxxx syntax) is probably
more portable.
It depends (and his question is opening a can of worms). If
he's not interested in internationalization---the program will
only be used in German speaking areas, then using wide
characters is overkill. Maybe. Independantly of the question
wchar_t vs. char, the very first question is what encoding he is
using at execution time, and what encoding the compiler supposes
he is using. If, for example, he is using ISO 8859-1
everywhere, exactly what he has written might actually work---it
works with all the compilers I have here at work (where
everything is ISO 8859-1): g++, Sun CC and VC++. It probably
won't work on my Unix system at home, because there I use UTF-8.
If his environment uses UTF-8 anywhere, he'll have to find a
different solution: in UTF-8, 'ä' is a multi-byte character
(0xC3, 0xA4).

The solution he should probably adopt depends a lot on context.
If he can get away with only the characters in ISO 8859-1 (which
is sufficient for German---but he might have to handle proper
names with other characters in them), it's definitely easier to
code. If in addition he can configure his editor so that it
also writes all files in ISO 8859-1 (":set fileencoding=latin1"
in vim), and he is using one of the compilers I use (Sun CC, g++
or VC++), then he can even write the Umlauts in his source code
(but IMHO, that's pushing things a bit---I'd just use 0xE4,
etc.). If he has to deal with other characters, or with files
which might use other encodings, the problem becomes more
difficult. I usually use UTF-8, even internally, but which
encoding format to choose depends somewhat on what you are doing
with the text, and probably to some degree on the compiler as
well: for some jobs, you'll absolutely want UTF-32 (which means
using int32_t, and not wchar_t). Of course, if he's using
UTF-8, something like the above would have to be written using
an if/else chain, and not as a switch. If this only occurs
once, and there are only three or four cases in the switch, it's
no big deal; if it occurs in a lot of places, that's probably a
sign that UTF-8 is not the correct choice for your application.

Regardless of the solution chosen, you have to consider four
encodings: that in the files you are reading and writing, that
which you use internally, that which the compiler assumes you
are using, and if you use the umlauted characters in your
source, that which the compiler uses to read your sources. Note
that L'\u00E4' isn't a panacea either. The compiler will
translate it into the a-Umlaut in whatever encoding it thinks
you are using internally. If the encoding it thinks you are
using is the one you are actually using, fine. If not,
however... If you know that you want to use Unicode, UTF-32
format, for example, your only portable solution is something
like:

typedef uint32_t UTF32Char ;
UTF32Char const aUmlaut = 0x00E4 ;
// ...

Of course, if you do this, you'll probably have to reimplement
large parts of iostream and locale as well.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Nov 16 '07 #5
TK
Thanks for help.

o-o

Thomas
Nov 16 '07 #6
On 2007-11-16 05:50:47 -0500, James Kanze <ja*********@gmail.comsaid:
>
typedef uint32_t UTF32Char ;
In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Nov 16 '07 #7
On Fri, 16 Nov 2007 08:24:37 -0500, Pete Becker wrote:
On 2007-11-16 05:50:47 -0500, James Kanze <ja*********@gmail.comsaid:

typedef uint32_t UTF32Char ;

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.
Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

Andreas
--
Dr. Andreas Dehmel Ceterum censeo
FLIPME(ed.enilno-t@nouqraz) Microsoft esse delendam
http://www.zarquon.homepage.t-online.de (Cato the Much Younger)
Nov 16 '07 #8
On 2007-11-16 13:48:08 -0500, Andreas Dehmel
<bl*******************@spamgourmet.comsaid:
On Fri, 16 Nov 2007 08:24:37 -0500, Pete Becker wrote:
>On 2007-11-16 05:50:47 -0500, James Kanze <ja*********@gmail.comsaid:
>>>
typedef uint32_t UTF32Char ;

In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Nov 16 '07 #9
On Fri, 16 Nov 2007 14:08:04 -0500, Pete Becker wrote:
On 2007-11-16 13:48:08 -0500, Andreas Dehmel
<bl*******************@spamgourmet.comsaid:
On Fri, 16 Nov 2007 08:24:37 -0500, Pete Becker wrote:
On 2007-11-16 05:50:47 -0500, James Kanze <ja*********@gmail.comsaid:
typedef uint32_t UTF32Char ;
In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.
Let's hope the next standard will also provide comprehensive transcoding
functionality between arbitrary encodings -- as part of the standard (lib)
I mean, not as part of another library -- because without that any string
types/classes it defines will be almost completely useless. And let's
also hope the new file-IO interface will understand these classes as well,
otherwise ditto.

There's no new file-IO interface under discussion. As for the current
one, basic_fstream, it already deals with codecvt facets, and that's
the mechanism for translating between character encodings. There are
some new convenience classes for common conversions (see
www.versatilecoding.com for a quick overview), and there will be
builtin codecvt facets for a few common conversions in support of
Unicode.
You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.

Andreas
--
Dr. Andreas Dehmel Ceterum censeo
FLIPME(ed.enilno-t@nouqraz) Microsoft esse delendam
http://www.zarquon.homepage.t-online.de (Cato the Much Younger)
Nov 16 '07 #10
On 2007-11-16 16:32:04 -0500, Andreas Dehmel
<bl*******************@spamgourmet.comsaid:
>
You appear to be talking about the contents of files. As far as file-IO
is concerned I was talking about the names of files. The days when
filenames could be assumed to be US-ASCII or at least an 8-bit encoding
like the ISO-8859-* family are ancient history and ATM support for this
sort of thing is practically non-existent as far as the C/C++ standard
libs are concerned.
Wide-character file names were added to the specification for C++0x a
year or so ago. Might I suggest that you look at the draft standard
before complaining about what's not in it? The current draft is
available at
http://www.open-std.org/jtc1/sc22/wg...2007/n2461.pdf.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Nov 16 '07 #11
On Nov 16, 2:24 pm, Pete Becker <p...@versatilecoding.comwrote:
On 2007-11-16 05:50:47 -0500, James Kanze <james.ka...@gmail.comsaid:
typedef uint32_t UTF32Char ;
In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.
So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Nov 17 '07 #12
On 2007-11-17 04:51:48 -0500, James Kanze <ja*********@gmail.comsaid:
On Nov 16, 2:24 pm, Pete Becker <p...@versatilecoding.comwrote:
>On 2007-11-16 05:50:47 -0500, James Kanze <james.ka...@gmail.comsaid:
>>typedef uint32_t UTF32Char ;
>In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?
Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Nov 17 '07 #13
On 2007-11-17 08:29:38 -0500, Pete Becker <pe**@versatilecoding.comsaid:
On 2007-11-17 04:51:48 -0500, James Kanze <ja*********@gmail.comsaid:
>On Nov 16, 2:24 pm, Pete Becker <p...@versatilecoding.comwrote:
>>On 2007-11-16 05:50:47 -0500, James Kanze <james.ka...@gmail.comsaid:
>>>typedef uint32_t UTF32Char ;
>>In the future this won't be necessary. The next C++ standard will
provide char16_t and char32_t, and appropriate specializations of
std::string, for UTF-16- and UTF-32-encoded characters.

So I've heard.

I've not been following too closely: will the implementation
also be required to provide the appropriate specializations for
basic_iostream et al, the facets, and basic_string?

And will these types be "conditional" (only available if the
implementation decides to support the encoding), or required on
all implementations?

Support for UTF-8, UTF-16, and UTF-32 is required, at the level of
having those typedefs, having the appropriate specializations for
basic_string, and a handful of other things to support conversions.
There's a brief sketch at www.versatilecoding.com.
Whoops, char16_t and char32_t will be builtin types, not typedefs.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Nov 17 '07 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3048
by: lian | last post by:
Hi all, I want to write some UTF-8 Chinese characters to file with following php codes: <code> ....... $fp = fopen($filepath,'wb'); fwrite($fp,$utf8string,strlen($utf8string)); fclose($fp);...
5
3056
by: Ma Siva Kumar | last post by:
Running postgresql-7.3.2-3 which came with Red Hat 9.0. Created a database with unicode encoding (in psql) as below: create database leatherlink with encoding='unicode' template=leatherlinkdb;...
18
5587
by: Zygmunt Krynicki | last post by:
Hello I've browsed the FAQ but apparently it lacks any questions concenring wide character strings. I'd like to calculate the length of a multibyte string without converting the whole string. ...
3
6828
by: yazan jab | last post by:
Is it true that Multibyte characters are : char arrays (witch represent a string from the basic characters set). In this case Wide characters are the way for encoding characters from the...
3
1935
by: Simon Morgan | last post by:
Hi, The following code is meant to validate a string of multibyte characters by using mbcheck() to call mblen() on each character on the string passed to it. The problem is that it isn't working...
1
4532
by: miner49er | last post by:
Hi there, Here's my problem, please help - I think i'm going insane :-) I have written a DLL that returns Wide Char Unicode Chinese Strings. I have a 3rd party Graph control (OCX) that...
1
6003
by: Marcel Ruff | last post by:
Hi, i have the question on how to determine the string length of a wide string and a multibyte string: 1. Number of letters (one letter may use three bytes) 2. Number of bytes In the code...
0
1384
by: Munch | last post by:
my C program deals with single byte characters but now i want to fetch multibyte data stored in the datbase so what all changes i need to make to the code so that it handles multibyte data as well....
0
1343
by: Munch | last post by:
my C program deals with single byte characters but now i want to fetch multibyte data stored in the datbase so what all changes i need to make to the code so that it handles multibyte data as well. ...
0
7098
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7298
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
7017
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
5610
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5026
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3187
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3176
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
754
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
406
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.