Using wchar_t instead of char

Michael Brennan

I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

(I of course do not mean just the type wchar_t, but all of the things
in wide character land)

Thanks

--
Michael Brennan

Jul 8 '08 #1

Subscribe Post Reply

4311

CBFalconer

Michael Brennan wrote:

>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Jul 8 '08 #2

viza

On Tue, 08 Jul 2008 21:12:54 +0000, Michael Brennan wrote:

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible. Is
there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

wchar_t is 32 bits on my system. That's a lot of space to use when I
only need 7. Also, there aren't many well distributed apps using
wchar_t, just for one example: editors.

More fundamentally all sorts of I/O is done specifically in 8 bit bytes.
IP is 8 bit based, as are files under Linux and most other operating
systems. The problem is that it is very difficult to do a partial
changeover. Every application would spend half of its time and code
converting back and forth, and then what do you do when it doesn't go?
How long in wchar_t is a seven byte file? One, perhaps, but then you
have to add a whole load of error handling code to every part of the
program that interfaces with the char based world.

In C, memory is always dealt with in sizeof(char) units. Life might be
made easier for the C programmer in a UTF16/24/32 world by increasing
CHAR_BIT, but you still have the problems when you interface with the
rest of the world.

Jul 8 '08 #3

Ben Bacarisse

CBFalconer <cb********@yahoo.comwrites:

Michael Brennan wrote:
>>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

http://www.lysator.liu.se/c/rat/title.html

As soon as anyone with a copy to hand tells me otherwise, I will
withdraw, but then again maybe someone will back me up.

--
Ben.

Jul 9 '08 #4

Ben Bacarisse

Michael Brennan <br************@gmail.comwrites:

I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future
programs?

It is not a simple "use one or the other".

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.

But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.

(I of course do not mean just the type wchar_t, but all of the things
in wide character land)

--
Ben.

Jul 9 '08 #5

CBFalconer

Ben Bacarisse wrote:

CBFalconer <cb********@yahoo.comwrites:
>Michael Brennan wrote:
>>>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition
include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h and
<wctype.h(originally specified in AMD1)

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Jul 9 '08 #6

Nick Bowler

On Tue, 08 Jul 2008 21:02:34 -0400, CBFalconer wrote:

Ben Bacarisse wrote:
>CBFalconer <cb********@yahoo.comwrites:
>>Michael Brennan wrote:
I believe that wchar etc. are only available in C99. Using them may
seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h and
<wctype.h(originally specified in AMD1)

The headers specified in that excerpt and all functions declared within
are indeed new in AMD1/C99.

The type wchar_t (from <stddef.h>) was present in C90. Additionally, the
library functions mblen, mbtowc, wctomb, mbstowcs and wcstombs are
available from <stdlib.h>.

AMD1 is fairly widely implemented, anyway.

Jul 9 '08 #7

Michael Brennan

On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:

Michael Brennan <br************@gmail.comwrites:

>I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.

My mistake, please ignore what I said about that.

>I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future
programs?

It is not a simple "use one or the other".

No, I understand now that it's more complicated, unfortunantely.

>I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.

>But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.

This wasn't easy, I need to rely on C99 stuff and according to viza
programs will be inefficient. I always aim for writing portable
programs but I also need to be able to use CJK characters, so I'm not
really sure on what to do here.

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

--
Michael Brennan

Jul 9 '08 #8

viza

On Wed, 09 Jul 2008 11:19:57 +0000, Michael Brennan wrote:

On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
>Michael Brennan <br************@gmail.comwrites:

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find text
editors that can read and write the file more easily.

Just a thought. As you've realised there isn't a perfect solution.

viza

Jul 9 '08 #9

Rui Maciel

On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find
text editors that can read and write the file more easily.

Isn't UTF16 a variable-length format?
Rui Maciel

Jul 9 '08 #10

Ben Bacarisse

Michael Brennan <br************@gmail.comwrites:

<snip>

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.

--
Ben.

Jul 9 '08 #11

Richard Tobin

In article <48***********************@news.telepac.pt>,
Rui Maciel <ru********@gmail.comwrote:

>Isn't UTF16 a variable-length format?

Yes, though if you don't need to interpret the characters above 0xFFFF
you can pretend it isn't.

-- Richard

--
Please remember to mention me / in tapes you leave behind.

Jul 9 '08 #12

micans

On Jul 9, 1:13*am, Ben Bacarisse <ben.use...@bsb.me.ukwrote:

Michael Brennan <brennan.bri...@gmail.comwrites:
It is not a simple "use one or the other".

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. *Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. *For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.

Indeed. I've worked, a while ago, on code for index creation and
scanning,
porting it from an 8-bit character-set to Unicode. In that case, the
context
required the storage to be in UTF-8. In memory we would do on-the-fly
conversion to UTF-32 to do pattern matching, counting, normalization
(that's
a veritable Pandora's box) and whatever else was required.
For this we used IBM icu (international components for unicode), an
IBM-developed
library with a very permissive license that still seems to be actively
maintained.

Developing for Unicode does seem to require putting a lot of thought
in how the
application interacts with the environment, and the less assumptions
you pose on
the environment, the hairier it gets.

Stijn

Jul 9 '08 #13

William Ahern

viza <to******@gm-il.com.obviouschange.invalidwrote:

On Wed, 09 Jul 2008 11:19:57 +0000, Michael Brennan wrote:

On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
Michael Brennan <br************@gmail.comwrites:

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find text
editors that can read and write the file more easily.

There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
32-bits, 128-bits or 1024-bits is insufficient. I can encode many Latin-1
characters using two or more wchar_t objects. Even after normalization you
still can't fit _all_ such characters into a single wchar_t (whether 16-bits
or 32-bits or otherwise). Read about how Chinese, Japanese and Thai scripts
are encoded, and you'll begin to see the issues. You can get deceptively
close, but not all the way. It's simply impossible.

In other words, there's no easy way out. To do things properly, you need to
separate your applications into two distinct components. One which operates
on opaque byte streams, and another which employs a comprehensive multi-byte
string handling interface, like ICU, which provides logical operations on
string objects or streams. It's a pipe dream to think you'll ever be able to
use pointer-arithmetic to calculate "string length", or parse "words" with
iswspace() in a robust internationalized application.

Now, if you have a constrained environment that can make other certain
guarantees about data input, then use whatever. But wchar_t is not a generic
solution, not by a long shot. You can pretend it is. Lots of people do.
That's because they never get to hear the endless complaints from VARs in
Asia, and enjoy blissful ignorance.

Jul 9 '08 #14

William Ahern

Keith Thompson <ks***@mib.orgwrote:

Rui Maciel <ru********@gmail.comwrites:
On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:

What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find
text editors that can read and write the file more easily.
Isn't UTF16 a variable-length format?

Yes, but it's effectively fixed-length if you only use characters
within the "Basic Multilingual Plane".

Even if you can get away with counting "characters" in the BMP, how do you
parse words? It's a non-sense question, because you have to dispense with
such simplisitic textual constructs. The point is that unless you use UTF-16
as a glorified ASCII, you _should_ immediately start to think clearly about
what you're doing exactly with your strings. There's no simple solution.
Unless you go whole-hog with proper Unicode string handling, any answer must
be carefully tailored to the specific context of a project. Suggesting that
the BMP provides equivalence guarantees to traditional string handling is,
well, just plain wrong.

You still need to normalize input even to count characters in the
traditional fashion, at which point you've already linked in something other
than even the most bloated of libc's.

Jul 9 '08 #15

Richard Tobin

In article <fm************@wilbur.25thandClement.com>,
William Ahern <wi*****@wilbur.25thandClement.comwrote:

>There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
32-bits, 128-bits or 1024-bits is insufficient.

32 bits is plenty for Unicode.

A more accurate claim would be about the sufficiency of Unicode.

-- Richard
--
Please remember to mention me / in tapes you leave behind.

Jul 9 '08 #16

Michael Brennan

On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:

Michael Brennan <br************@gmail.comwrites:

<snip>
>I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.

Thank you, and everyone else!

--
Michael Brennan

Jul 10 '08 #17

Using wchar_t instead of char

Similar topics