Byte size of characters when encoding

Vladimir

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

Jul 21 '05 #1

Subscribe Reply

3686

mikeb

Vladimir wrote:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?
Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

--
mikeb

Jul 21 '05 #2

Vladimir

> > Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.

Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

Jul 21 '05 #3

mikeb

Vladimir wrote:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb

Jul 21 '05 #4

Vladimir

> >>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
2.

Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8 bytes?

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

It's makes me crazy.
I don't understand.

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

If charCount means unicode 32 bit character:
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.

If charCount means unicode 16 bit character (Char structure):
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.

Suppose we have a string with length 5 (length in string menas count of
instances of stuct Char).
UTF8Encoding.GetMaxByteCount(stringInstance.Length ) returns 15.
But it's not true.

And.
May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
only 4 bytes?
Yes or not?

Look:

/*
UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
characters at all,
and no compression occurs-its performance is excellent. UTF?16 encoding is
also referred
to as Unicode encoding.

UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
characters
as 3 bytes, and some characters as 4 bytes. Characters with a value below
0x0080 are
compressed to 1 byte, which works very well for characters used in the
United States.
Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
well for
European and Middle Eastern languages. Characters of 0x0800 and above are
converted to
3 bytes, which works well for East Asian languages. Finally, surrogate
character pairs are
written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
less useful than
UTF?16 if you encode many characters with values of 0x0800 or above.
*/

Does it mean that each pair of characters in UTF-16 can't be occupy more
than 4 bytes in UTF-8?

Wait a minute.
It seams that I undestend something.

Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
2 bytes (in UTF-16 its occupy always 2 bytes).
Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
(in UTF-16 its occupy always 2 bytes).
Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
occupy always 4 bytes).

Right?

But then I think UTF8Encoding.GetMaxByteCount(charCount) must
returns charCount * 3.

Jul 21 '05 #5

Byte size of characters when encoding

Similar topics