469,626 Members | 1,043 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,626 developers. It's quick & easy.

Multibyte VS. Wide

Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.

Both the ISO/IEC 9899:1999 and the libc info page (the gnu c library
documentation) are a little bit vague in this area.

I tend to believe the second explanation but want to make sure.

Yazan jaber
Nov 13 '05 #1
3 6507
In <a3**************************@posting.google.com > ya*********@yahoo.com (yazan jab) writes:
Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.


Neither is true, but the latter is closer to the truth. The definition
of the multibyte character is correct, but wide characters are not a
subset of the multibyte character encoding. They are wide enough to
represent *every* character from the extended character set.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #2
ya*********@yahoo.com (yazan jab) wrote:
# Is it true that
#
# Multibyte characters are : char arrays (witch represent a string from
# the basic characters set). In this case Wide characters are the way
# for encoding characters from the extended characters set.

For something like Unicode, the character codes range from 0 to 65535 (or 0 to
4 billion to include ideographs as single characters). A wide character
would be an integer sufficient to hold the character code as a fixed size
unit, either 16 or 32 bit integers (typically a short or a long). When you
use wchars for these code, you have the same advantage that you have for
ASCII and char: and n-character string will require exactly n+1 storage
units to store.

However there are still many old and useful programs designed only for char
width characters that would not be able to cope with wchar characters. Instead
of recoding and recompiling all that software, some clever and not so clever
ways have been invented to represent one large 16 or 32 bit characters as a
sequence of one or more 8-bit characters. UTF coding for example represents
16-bit Unicode as 1 to 3 8-bit multibyte characters. UTF has the additional
property that the ASCII subset of Unicode in UTF is the exact same byte
codings as the ASCII codes, and that a multibyte UTF character does not
include any bytes in the 0-127 range.

This means when old ASCII software is given a multibyte encoding like UTF, if
it simply passes through bytes 128-255 unchanged, it is upgraded without coding
changes to being new Unicode software as well.

The disadvantage of multibyte characters is that a n character Unicode string
can take anywhere from n+1 through 3n+1 char storage units; you won't know
with examining the actual characters.

--
Derk Gwen http://derkgwen.250free.com/html/index.html
Where do you get those wonderful toys?
Nov 13 '05 #3
On Thu, 06 Nov 2003 11:55:13 -0500, yazan jab wrote:
Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way for
encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide


It's important to distinquish between characters (or charsets) and
character encodings. They are two different things. A charset is a map
that defines which numeric value represents a particular glyph. A
character encoding defines how numeric values are serialized into a
stream of bytes. For example Unicode can be encoded as UTF-8 which which
is space effecient and provides compatibility with the ASCII and ISO-8859-1
charsets. Or it could be encoded as UCS4-LE which is not space effient
but it can be easier to do heavy text processing with it.

Here's a nice link about programming with extended charsets although it
is a little UTF-8/*nix centric:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Mike
Nov 13 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

18 posts views Thread by Zygmunt Krynicki | last post: by
3 posts views Thread by Jordan Abel | last post: by
1 post views Thread by miner49er | last post: by
1 post views Thread by Marcel Ruff | last post: by
reply views Thread by Munch | last post: by
13 posts views Thread by TK | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.