473,396 Members | 1,872 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Using wchar_t instead of char

I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

(I of course do not mean just the type wchar_t, but all of the things
in wide character land)

Thanks

--
Michael Brennan

Jul 8 '08 #1
16 4311
Michael Brennan wrote:
>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?
I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.
Jul 8 '08 #2
On Tue, 08 Jul 2008 21:12:54 +0000, Michael Brennan wrote:
I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible. Is
there any reason I should not use wchar_t for all my future programs?

I am aware that on UNIX at least, if you use UTF-8, char works pretty
well. But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?
wchar_t is 32 bits on my system. That's a lot of space to use when I
only need 7. Also, there aren't many well distributed apps using
wchar_t, just for one example: editors.

More fundamentally all sorts of I/O is done specifically in 8 bit bytes.
IP is 8 bit based, as are files under Linux and most other operating
systems. The problem is that it is very difficult to do a partial
changeover. Every application would spend half of its time and code
converting back and forth, and then what do you do when it doesn't go?
How long in wchar_t is a seven byte file? One, perhaps, but then you
have to add a whole load of error handling code to every part of the
program that interfaces with the char based world.

In C, memory is always dealt with in sizeof(char) units. Life might be
made easier for the C programmer in a UTF16/24/32 world by increasing
CHAR_BIT, but you still have the problems when you interface with the
rest of the world.
Jul 8 '08 #3
CBFalconer <cb********@yahoo.comwrites:
Michael Brennan wrote:
>>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.
I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

http://www.lysator.liu.se/c/rat/title.html

As soon as anyone with a copy to hand tells me otherwise, I will
withdraw, but then again maybe someone will back me up.

--
Ben.
Jul 9 '08 #4
Michael Brennan <br************@gmail.comwrites:
I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.
I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.
I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future
programs?
It is not a simple "use one or the other".
I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.
Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.
But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?
It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.
(I of course do not mean just the type wchar_t, but all of the things
in wide character land)
--
Ben.
Jul 9 '08 #5
Ben Bacarisse wrote:
CBFalconer <cb********@yahoo.comwrites:
>Michael Brennan wrote:
>>>
I guess this question only applies to programming applications for
UNIX, Windows and similiar. If one develops something for an
embedded system I can understand that wchar_t would be unnecessary.

I wonder if there is any point in using char over wchar_t? I don't
see much code using wchar_t when reading other people's code (but
then I haven't really looked much) or when following this newsgroup.
To me it sounds reasonable to make sure your program can handle
multibyte characters so that it can be used at as many places as
possible. Is there any reason I should not use wchar_t for all my
future programs?

I am aware that on UNIX at least, if you use UTF-8, char works
pretty well. But if you use wchar_t you don't need to rely on UTF-8
and thus makes it more portable, correct?

I believe that wchar etc. are only available in C99. Using them
may seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:
I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition
include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h and
<wctype.h(originally specified in AMD1)

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.
Jul 9 '08 #6
On Tue, 08 Jul 2008 21:02:34 -0400, CBFalconer wrote:
Ben Bacarisse wrote:
>CBFalconer <cb********@yahoo.comwrites:
>>Michael Brennan wrote:
I believe that wchar etc. are only available in C99. Using them may
seriously reduce your code portability.

I don't have a real copy of ISO C90 (ANSI C 89) so I am winging it a
bit, but I am pretty sure that wchar_t was in there. C95 added some
more related things (all of which ended up in C99) but using wchar_t
should be very portable indeed[1]. Do you have a reference to C90
without wchar_t? All I can site is online versions of the ANSI
standard as a .txt file and the C90 rationale at:

I am basing it on this excerpt from the C99 standard (N869):

[#5] This edition replaces the previous edition, ISO/IEC
9899:1990, as amended and corrected by ISO/IEC
9899/COR1:1994, ISO/IEC 9899/COR2:1995, and ISO/IEC
9899/AMD1:1995. Major changes from the previous edition include:

-- restricted character set support in <iso646.h>
(originally specified in AMD1)

-- wide-character library support in <wchar.h and
<wctype.h(originally specified in AMD1)
The headers specified in that excerpt and all functions declared within
are indeed new in AMD1/C99.

The type wchar_t (from <stddef.h>) was present in C90. Additionally, the
library functions mblen, mbtowc, wctomb, mbstowcs and wcstombs are
available from <stdlib.h>.

AMD1 is fairly widely implemented, anyway.
Jul 9 '08 #7
On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
Michael Brennan <br************@gmail.comwrites:
>I guess this question only applies to programming applications for UNIX,
Windows and similiar. If one develops something for an embedded system
I can understand that wchar_t would be unnecessary.

I'd be very surprised if this were true, but I do not know much about
embedded systems. My audio player seems to support all sorts of
characters.
My mistake, please ignore what I said about that.
>I wonder if there is any point in using char over wchar_t? I don't see
much code using wchar_t when reading other people's code (but then I
haven't really looked much) or when following this newsgroup. To me it
sounds reasonable to make sure your program can handle multibyte
characters so that it can be used at as many places as possible.
Is there any reason I should not use wchar_t for all my future
programs?

It is not a simple "use one or the other".
No, I understand now that it's more complicated, unfortunantely.
>I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.
>But if you use wchar_t you don't need to rely on UTF-8 and thus
makes it more portable, correct?

It is one of the components you need. Another is to use C's locale
support. How portable you can be depends on what systems you are
targeting since not all of the features of C99's wide character
support are available on all compiler/library combinations. In fact,
the maximally portable set of things you can do with a wchar_t (or and
array of them) is very small. Here I hope an expert steps in a gives
you real experience-based wisdom about portable use of wide-character
support.
This wasn't easy, I need to rely on C99 stuff and according to viza
programs will be inefficient. I always aim for writing portable
programs but I also need to be able to use CJK characters, so I'm not
really sure on what to do here.

I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

--
Michael Brennan

Jul 9 '08 #8
On Wed, 09 Jul 2008 11:19:57 +0000, Michael Brennan wrote:
On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
>Michael Brennan <br************@gmail.comwrites:
I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?
What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find text
editors that can read and write the file more easily.

Just a thought. As you've realised there isn't a perfect solution.

viza
Jul 9 '08 #9
On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:
What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find
text editors that can read and write the file more easily.
Isn't UTF16 a variable-length format?
Rui Maciel
Jul 9 '08 #10
Michael Brennan <br************@gmail.comwrites:

<snip>
I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?
First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.

--
Ben.
Jul 9 '08 #11
In article <48***********************@news.telepac.pt>,
Rui Maciel <ru********@gmail.comwrote:
>Isn't UTF16 a variable-length format?
Yes, though if you don't need to interpret the characters above 0xFFFF
you can pretend it isn't.

-- Richard

--
Please remember to mention me / in tapes you leave behind.
Jul 9 '08 #12
On Jul 9, 1:13*am, Ben Bacarisse <ben.use...@bsb.me.ukwrote:
Michael Brennan <brennan.bri...@gmail.comwrites:
It is not a simple "use one or the other".
I am aware that on UNIX at least, if you use UTF-8, char works pretty
well.

Yes, but a truly portable program won't assume UTF-8. *Even if you can
assume it, converting to wide characters helps when you are doing lots
of character counting operations. *For example, finding the longest
match of a pattern is complex if you keep everything in a multi-byte
encoding like UTF-8.
Indeed. I've worked, a while ago, on code for index creation and
scanning,
porting it from an 8-bit character-set to Unicode. In that case, the
context
required the storage to be in UTF-8. In memory we would do on-the-fly
conversion to UTF-32 to do pattern matching, counting, normalization
(that's
a veritable Pandora's box) and whatever else was required.
For this we used IBM icu (international components for unicode), an
IBM-developed
library with a very permissive license that still seems to be actively
maintained.

Developing for Unicode does seem to require putting a lot of thought
in how the
application interacts with the environment, and the less assumptions
you pose on
the environment, the hairier it gets.

Stijn
Jul 9 '08 #13
viza <to******@gm-il.com.obviouschange.invalidwrote:
On Wed, 09 Jul 2008 11:19:57 +0000, Michael Brennan wrote:
On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
Michael Brennan <br************@gmail.comwrites:
I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?
What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find text
editors that can read and write the file more easily.
There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
32-bits, 128-bits or 1024-bits is insufficient. I can encode many Latin-1
characters using two or more wchar_t objects. Even after normalization you
still can't fit _all_ such characters into a single wchar_t (whether 16-bits
or 32-bits or otherwise). Read about how Chinese, Japanese and Thai scripts
are encoded, and you'll begin to see the issues. You can get deceptively
close, but not all the way. It's simply impossible.

In other words, there's no easy way out. To do things properly, you need to
separate your applications into two distinct components. One which operates
on opaque byte streams, and another which employs a comprehensive multi-byte
string handling interface, like ICU, which provides logical operations on
string objects or streams. It's a pipe dream to think you'll ever be able to
use pointer-arithmetic to calculate "string length", or parse "words" with
iswspace() in a robust internationalized application.

Now, if you have a constrained environment that can make other certain
guarantees about data input, then use whatever. But wchar_t is not a generic
solution, not by a long shot. You can pretend it is. Lots of people do.
That's because they never get to hear the endless complaints from VARs in
Asia, and enjoy blissful ignorance.
Jul 9 '08 #14
Keith Thompson <ks***@mib.orgwrote:
Rui Maciel <ru********@gmail.comwrites:
On Wed, 09 Jul 2008 11:39:08 +0000, viza wrote:
What about UTF16 (probably as unsigned short)? It has the simplicity of
programming with fixed width characters and you will be able to find
text editors that can read and write the file more easily.
Isn't UTF16 a variable-length format?
Yes, but it's effectively fixed-length if you only use characters
within the "Basic Multilingual Plane".
Even if you can get away with counting "characters" in the BMP, how do you
parse words? It's a non-sense question, because you have to dispense with
such simplisitic textual constructs. The point is that unless you use UTF-16
as a glorified ASCII, you _should_ immediately start to think clearly about
what you're doing exactly with your strings. There's no simple solution.
Unless you go whole-hog with proper Unicode string handling, any answer must
be carefully tailored to the specific context of a project. Suggesting that
the BMP provides equivalence guarantees to traditional string handling is,
well, just plain wrong.

You still need to normalize input even to count characters in the
traditional fashion, at which point you've already linked in something other
than even the most bloated of libc's.

Jul 9 '08 #15
In article <fm************@wilbur.25thandClement.com>,
William Ahern <wi*****@wilbur.25thandClement.comwrote:
>There's no such thing as fixed-width Unicode characters. 8-bit, 16-bits,
32-bits, 128-bits or 1024-bits is insufficient.
32 bits is plenty for Unicode.

A more accurate claim would be about the sufficiency of Unicode.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
Jul 9 '08 #16
On 2008-07-09, Ben Bacarisse <be********@bsb.me.ukwrote:
Michael Brennan <br************@gmail.comwrites:

<snip>
>I currently have a program that reads names and birthdates from a file
and then does some calculations to show how many days left until their
birthday and so on. It works well, but I also need to have names in
Japanese in the file. My options are UTF-8 or wchar_t. I have to give up
a lot of portability by choosing either of them. Any recommendation on
which to choose?

First, C does not assume UTF-8 though it is clearly the most likely
multi-byte string encoding you will come across. When talking about
standard, portable, C the choice is about if, and when, to convert
between wide and multi-byte sequences.

Secondly, do you have a choice about the input? You suggest that it
is in a file, so you may have no choice about the input, but the
problem sounds like an assignment so maybe you get to choose the input
encoding.

Either way, it does not sound as if either the wasted space of always
using wide characters nor the extra complexity of having multi-byte
strings really matters for your application. If you get to choose,
pick one and be happy. If you don't get to choose, go with what is
mandated and don't convert.

When I say "pick one" I don't mean at random. Different environments
will favour different encodings. If your input will be prepared by an
editor that makes entering Japanese as wide characters easy, then that
would be a reason to choose wide character input.

In general, if your input is as muti-byte strings, keep it that way.
A typical reason to convert to wchar_t would be if you need to match it
against other data that is already wchar_t or if your processing
requires frequent access to single characters.

It is much more rare to convert data that is already wide to
multi-byte strings. You may save some space, you might not. You will
end up with slightly more complex character processing.
Thank you, and everyone else!

--
Michael Brennan

Jul 10 '08 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Adrian Cornish | last post by:
Hi all, Is there a portable way of transforming a wchar_t to a char and/or wstring to a string. Are there any gurantees for the layout of a wchar_t, like every other byte is a char? I am...
3
by: Julius Mong | last post by:
Hi all, I'm doing this: // Test char code wchar_t lookup = {0x8364, 0x5543, 0x3432, 0xabcd, 0xef01}; for (int x=0; x<5; x++) { wchar_t * string = (wchar_t*) malloc(sizeof(wchar_t)); string =...
0
by: John Graat | last post by:
Hi all, I've built the STLport-462 library on AIX-4.3.3 using gcc-3.3.2. No errors during compilation. However, during linking the following error occurs: ld: 0711-317 ERROR: Undefined symbol:...
1
by: Marcin Kalicinski | last post by:
wchar_t c1 = wchar_t('A'); wchar_t c2 = L'A'; Is c1 equal to c2? If they are not equal, how can I create wchar_t character representing the same character as some char value? cheers,...
2
by: anubis | last post by:
Heay, i've got this problem: http://rafb.net/paste/results/lpNgbn49.html i'm using wifstream to read utf-16 file and i've got this problem, that each byte is read into seperate char while...
3
by: Angus | last post by:
I can see how to get a char* but is it possible to get a wide char - eg wchar_t?
8
by: Rui Maciel | last post by:
I've just started learning how to use the wchar_t data type as the basis for Unicode strings and unfortunately I'm having quite a bit of problems, both in the C front and the Unicode front. In...
4
by: gw7rib | last post by:
I'm using a system in which TCHAR is typedef-ed to represent a character, in this case a wchar_t to hold Unicode characters, and LPCTSTR is typedef-ed to be a pointer to constant wchar_t. I presume...
4
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.