By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,995 Members | 1,268 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,995 IT Pros & Developers. It's quick & easy.

Byte size of characters when encoding

P: n/a

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.
Jul 21 '05 #1
Share this Question
Share on Google+
43 Replies


P: n/a
Vladimir wrote:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?
Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn't it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

--
mikeb
Jul 21 '05 #2

P: n/a
> > Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?


Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.


Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?
Jul 21 '05 #3

P: n/a
Vladimir wrote:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?


Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?


It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb
Jul 21 '05 #4

P: n/a
> >>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8 bytes?


It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

char c1 = '\uFFFF';
char c2 = '\u1000';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.


It's makes me crazy.
I don't understand.

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

If charCount means unicode 32 bit character:
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.

If charCount means unicode 16 bit character (Char structure):
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.

Suppose we have a string with length 5 (length in string menas count of
instances of stuct Char).
UTF8Encoding.GetMaxByteCount(stringInstance.Length ) returns 15.
But it's not true.

And.
May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
only 4 bytes?
Yes or not?

Look:

/*
UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
characters at all,
and no compression occurs-its performance is excellent. UTF?16 encoding is
also referred
to as Unicode encoding.

UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
characters
as 3 bytes, and some characters as 4 bytes. Characters with a value below
0x0080 are
compressed to 1 byte, which works very well for characters used in the
United States.
Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
well for
European and Middle Eastern languages. Characters of 0x0800 and above are
converted to
3 bytes, which works well for East Asian languages. Finally, surrogate
character pairs are
written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
less useful than
UTF?16 if you encode many characters with values of 0x0800 or above.
*/

Does it mean that each pair of characters in UTF-16 can't be occupy more
than 4 bytes in UTF-8?

Wait a minute.
It seams that I undestend something.

Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
2 bytes (in UTF-16 its occupy always 2 bytes).
Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
(in UTF-16 its occupy always 2 bytes).
Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
occupy always 4 bytes).

Right?

But then I think UTF8Encoding.GetMaxByteCount(charCount) must
returns charCount * 3.
Jul 21 '05 #5

P: n/a
Vladimir <xo***@tut.by> wrote:
It's makes me crazy.
I don't understand.


I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #6

P: n/a
> I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.


How can we send the bug repprot?
And...

I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.
Jul 21 '05 #7

P: n/a
You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
internal representation - it uses UCS-2, where each character is expressed
as 2 bytes with the exception of characters larger than 0xFFFF which are
expressed as a sequence of two characters, called surrogate pair. So each
character in UCS-2 takes up two bytes but some Unicode characters have to be
expressed in pairs.

Jerry

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Vladimir <xo***@tut.by> wrote:
It's makes me crazy.
I don't understand.


I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #8

P: n/a
Jerry Pisk <je******@hotmail.com> wrote:
You're right, it is a bug, but the correct answer is not what you think it
is.
I think that depends on how you read the documentation.
In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2. As for the frameworks
internal representation - it uses UCS-2, where each character is expressed
as 2 bytes with the exception of characters larger than 0xFFFF which are
expressed as a sequence of two characters, called surrogate pair. So each
character in UCS-2 takes up two bytes but some Unicode characters have to be
expressed in pairs.


That's exactly what I thought. I believe GetMaxByteCount is meant to
return the maximum number of bytes for a sequence of 16-bit characters
though, where 2 characters forming a surrogate pair counts as 2
characters in the input. That way the maximum number of bytes required
to encode a string, for instance, is GetMaxByteCount(theString.Length).
Given that pretty much the whole of the framework works on the
assumption that a character is 16 bits and that surrogate pairs *are*
two characters, this seems more useful. It would be better if it were
more explicitly documented either way, however.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #9

P: n/a
Vladimir <xo***@tut.by> wrote:
I think it's just a bug. UnicodeEncoding is doing the right thing, but
UTF8Encoding should return charCount*3, not charCount*4.
How can we send the bug repprot?


I don't know the best way of submitting bugs for 1.1. I'll try to
remember to submit it as a Whidbey bug if I get the time to test it.
(Unfortunately time is something I'm short of at the moment.)
And...

I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.


I'm not entirely surprised, but it should at least be documented I
guess.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #10

P: n/a
> > I've found that new BitArray(int.length) causes overflow exception when
length in range from int.MaxValue - 30 to int.MaxValue.


I'm not entirely surprised, but it should at least be documented I
guess.


I think it should throw ArgumentOutOfRangeExcpetion, or (the best) handle
all rage from 0 to int.Max. It can be do easly.

Just replace (length + 31) / 32 to ((length % 32 == 0) ? (length / 32) :
(length / 32 + 1)).

And seems there is a problems in another constructors of BitArray.
Everywhere where used (length + 31) / 32, and lenth * 8.

For example BitArray (int[]).
It's obviously that it can handle array with length up to 67 108 864 only.
Therefore it should throw ArgumentOutOfRangeException.
But it does'nt at all.
Not sure, but I think it will throw overflow exception.
Jul 21 '05 #11

P: n/a
> You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2.


I think 4 IS the right answer.
Reading the RFC tells you that up to 4 bytes are used to represent the
range between 00010000-001FFFFF.
Well, Unicode stops at 10FFFF.
Anything longer than 4 bytes is incorrect unicode.
And utf8 encoders/decoders should be aware about this, otherwise
this can even lead to security vulnerabilities (like buffer overruns).
--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #12

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
You're right, it is a bug, but the correct answer is not what you think it
is. In UTF-8 a character can be up to 6 bytes, see
http://www.ietf.org/rfc/rfc2279.txt, chapter 2.


I think 4 IS the right answer.
Reading the RFC tells you that up to 4 bytes are used to represent the
range between 00010000-001FFFFF.


But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #13

P: n/a
I don't think so, just because .Net internally uses UCS-2 doesn't mean two
surrogate characters are two characters. They're a single character as far
as Unicode is concerned.

The whole issue comes down to the documentation not being very clear. It
says GetMaxByteCount takes the number of characters to encode but it doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case, if it's Unicode characters then 6 is the correct
value. I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal
representation here, I'm talking about what character data type actually
stands for).

Jerry

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Mihai N. <nm**************@yahoo.com> wrote:
> You're right, it is a bug, but the correct answer is not what you think
> it
> is. In UTF-8 a character can be up to 6 bytes, see
> http://www.ietf.org/rfc/rfc2279.txt, chapter 2.


I think 4 IS the right answer.
Reading the RFC tells you that up to 4 bytes are used to represent the
range between 00010000-001FFFFF.


But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #14

P: n/a
> The whole issue comes down to the documentation not being very clear. It
says GetMaxByteCount takes the number of characters to encode but it doesn't say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case, if it's Unicode characters then 6 is the correct value. I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal representation here, I'm talking about what character data type actually
stands for).


/*
Encoding Class

Remarks
Methods are provided to convert arrays and strings !of Unicode characters!
to
and from arrays of bytes encoded for a target code page.
*/

Therefore maximal characters count means Unicode (Utf-16) characters.

And seems implementation of ASCIIEncoding.GetBytes() does'nt know
anything about surrogat pairs. And for surrogate pair it returns two bytes.
Therefore maximal characters count does'nt mean ... you know.
Jul 21 '05 #15

P: n/a
Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally, it
uses UCS-2, which is different encoding.

Jerry

"Vladimir" <xo***@tut.by> wrote in message
news:OS**************@TK2MSFTNGP12.phx.gbl...
The whole issue comes down to the documentation not being very clear. It
says GetMaxByteCount takes the number of characters to encode but it

doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case, if it's Unicode characters then 6 is the

correct
value. I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the

internal
representation here, I'm talking about what character data type actually
stands for).


/*
Encoding Class

Remarks
Methods are provided to convert arrays and strings !of Unicode characters!
to
and from arrays of bytes encoded for a target code page.
*/

Therefore maximal characters count means Unicode (Utf-16) characters.

And seems implementation of ASCIIEncoding.GetBytes() does'nt know
anything about surrogat pairs. And for surrogate pair it returns two
bytes.
Therefore maximal characters count does'nt mean ... you know.

Jul 21 '05 #16

P: n/a
Jerry Pisk <je******@hotmail.com> wrote:
I don't think so, just because .Net internally uses UCS-2 doesn't mean two
surrogate characters are two characters. They're a single character as far
as Unicode is concerned.
But they're two characters as far as almost the whole of the rest of
the .NET API is concerned. String.Length will give you two characters,
and obviously if you've got a char array the surrogate will take up two
positions.
The whole issue comes down to the documentation not being very clear.
Agreed.
It says GetMaxByteCount takes the number of characters to encode but it doesn't
say in what encoding. If it's number of characters in UCS-2 then you're
right, 4 is the worst case
No, 3 is the worst case, isn't it?
if it's Unicode characters then 6 is the correct value.
Yes.
I'm not really sure what CLR says, if it treats character data as
Unicode or as UCS-2 encoded Unicode (and I'm not talking about the internal
representation here, I'm talking about what character data type actually
stands for).


Well, the System.Char data type is for a "Unicode 16-bit char" which
isn't terribly helpful, unfortunately. From the MSDN docs for
System.Char:

<quote>
The Char value type represents a Unicode character, also called a
Unicode code point, and is implemented as a 16-bit number ranging in
value from hexadecimal 0x0000 to 0xFFFF. A single Char cannot represent
a Unicode character that is encoded as a surrogate pair. However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.
</quote>

So the docs for GetMaxByteCount ought to be clear as to whether it's a
count of System.Chars or a count of full Unicode characters. I suspect
it's *meant* to be the former, but it should definitely be clearer.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #17

P: n/a
Vladimir <xo***@tut.by> wrote:
/*
Encoding Class

Remarks
Methods are provided to convert arrays and strings !of Unicode characters!
to
and from arrays of bytes encoded for a target code page.
*/

Therefore maximal characters count means Unicode (Utf-16) characters.
I don't think that's clear at all.
And seems implementation of ASCIIEncoding.GetBytes() does'nt know
anything about surrogat pairs.


I think in general the Encoding implementations don't guarantee to give
good results when they're passed characters which aren't in their
character set. Certainly ASCIIEncoding doesn't perform optimally in
such a situation.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #18

P: n/a
>> if it's Unicode characters then 6 is the correct value.

Yes. This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect.
However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.

This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.

--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #19

P: n/a
>> if it's Unicode characters then 6 is the correct value.

Yes. This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect, and
can be produced only by broken encoders.
However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.

This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.

--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #20

P: n/a
> But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.

True. But what is the main reason to use GetMaxByteCount?
Well, if I have a unicode string and want to alloc a buffer for the
result of the conversion. Then the typical use is
length_of_the_string * GetMaxByteCount()
Dealing with a string means I can get surrogates.

--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #21

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
if it's Unicode characters then 6 is the correct value.


Yes.

This is what I tried to combat earlier.
The Unicode range is 0-10FFFF
This means max 4 bytes. Anything above 4 is possible but incorrect, and
can be produced only by broken encoders.
However, a
String, which is a collection of Char objects, can represent a Unicode
character encoded as a surrogate pair.

This tells me it is aware of surrogates, and in fact uses utf16.
The Windows API for NT/2000/XP/2003 is UCS2, but .NET might be UTF16.


The string class itself isn't aware of surrogates, as far as I know.
The encoder needs to be aware in order to know how to encode them, but
the question is whether the count parameter should treat a surrogate
pair as two characters or one - given the rest of the API which
strongly leans towards them being two characters, that's what I think
should happen here. In either case, the documentation should be very
clear about this.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #22

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
But that can't be represented by a single .NET character. It requires a
surrogate pair, which I'd expect to count as two characters as far as
the input of GetMaxByteCount is concerned - after all, the Encoding
will see two distinct characters making up the surrogate pair when it's
asked to encode a string or char array.

True. But what is the main reason to use GetMaxByteCount?
Well, if I have a unicode string and want to alloc a buffer for the
result of the conversion. Then the typical use is
length_of_the_string * GetMaxByteCount()
Dealing with a string means I can get surrogates.


Yes - and if GetMaxByteCount assumes that the surrogates will count as
two characters, you can just use:

int maxSize = encoding.GetMaxByteCount(myString.Length);
byte[] buffer = new byte[maxSize];
....

String.Length reports surrogates as two characters. For instance:

using System;
using System.Text;

class Test
{
static void Main()
{
// Gothic letter AHSA, UTF-32 value of U+10330
string x = "\ud800\udf30";

Console.WriteLine (x.Length);
Console.WriteLine (Encoding.UTF8.GetBytes(x).Length);
}
}

Making UTF8Encoding.GetMaxByteCount(count) return count*3 will always
work with the type of code given earlier for creating a new buffer, and
will lead to less wastage than returning count*4.

It's only if String.Length counted surrogates as single characters that
you'd need to return count*4.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #23

P: n/a
> Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally,
it
uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.
Jul 21 '05 #24

P: n/a
Vladimir <xo***@tut.by> wrote:
Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally,
it uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.


The difference is that UCS-2 can only encode Unicode characters 0-
0xffff. UTF-16 can encode the whole of Unicode.

I'm not *entirely* clear, but I believe that the difference is fairly
minimal in .NET itself, unless you view the characters which form
surrogate pairs as invalid UCS-2 characters (pass on that one, I'm
afraid). If you had a 32-bit character data type to start with,
however, a correct UCS-2 encoding would reject characters above 0xffff,
whereas a correct UTF-16 encoding would cope.

I guess another way of looking at it (please someone, correct me if I'm
wrong!) is that although each character in .NET is only UCS-2, strings
are sometimes regarded as UTF-16. (It's the "sometimes" which is the
problem, here.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #25

P: n/a
> Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally,
it
uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.
Jul 21 '05 #26

P: n/a
Vladimir <xo***@tut.by> wrote:
Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16 internally,
it uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.


The difference is that UCS-2 can only encode Unicode characters 0-
0xffff. UTF-16 can encode the whole of Unicode.

I'm not *entirely* clear, but I believe that the difference is fairly
minimal in .NET itself, unless you view the characters which form
surrogate pairs as invalid UCS-2 characters (pass on that one, I'm
afraid). If you had a 32-bit character data type to start with,
however, a correct UCS-2 encoding would reject characters above 0xffff,
whereas a correct UTF-16 encoding would cope.

I guess another way of looking at it (please someone, correct me if I'm
wrong!) is that although each character in .NET is only UCS-2, strings
are sometimes regarded as UTF-16. (It's the "sometimes" which is the
problem, here.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #27

P: n/a
By the way.
Is there a way to compress Unicode strings by SCSU in .Net?

Jul 21 '05 #28

P: n/a
By the way.
Is there a way to compress Unicode strings by SCSU in .Net?

Jul 21 '05 #29

P: n/a
Vladimir <xo***@tut.by> wrote:
By the way.
Is there a way to compress Unicode strings by SCSU in .Net?


I don't know of any way built into the framework. I'd be happy to
collaborate with someone on an open source solution, if people think it
would be useful.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #30

P: n/a
Vladimir <xo***@tut.by> wrote:
By the way.
Is there a way to compress Unicode strings by SCSU in .Net?


I don't know of any way built into the framework. I'd be happy to
collaborate with someone on an open source solution, if people think it
would be useful.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #31

P: n/a
> It's only if String.Length counted surrogates as single characters that
you'd need to return count*4.

This is why is used length_of_the_string (something generic) and not
String.Length (real API).
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
At some point you may need some api to tell you the length of the string
in Unicode chars. Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.
--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #32

P: n/a
> It's only if String.Length counted surrogates as single characters that
you'd need to return count*4.

This is why is used length_of_the_string (something generic) and not
String.Length (real API).
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
At some point you may need some api to tell you the length of the string
in Unicode chars. Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.
--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #33

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
It's only if String.Length counted surrogates as single characters that
you'd need to return count*4. This is why is used length_of_the_string (something generic) and not
String.Length (real API).


What do you mean by "This is why is used"? Who are you saying is using
this code?
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
Yup.
At some point you may need some api to tell you the length of the string
in Unicode chars.
Indeed. I wrote a Utf32String class a while ago which does all this,
and can convert to and from "normal" strings.
Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.
Blech - yes, that's horrible.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.


Yup. To be honest, I can't see it being *cleanly* sorted without taking
the hit of going for full UTF-32 (or UCS-4 - I don't know if there's
any difference) characters. Doing that would be a nasty memory hit, but
it may be what's required.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #34

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
It's only if String.Length counted surrogates as single characters that
you'd need to return count*4. This is why is used length_of_the_string (something generic) and not
String.Length (real API).


What do you mean by "This is why is used"? Who are you saying is using
this code?
My guess it that the .NET and the strings in .NET are not yet fully
aware of surrogates. Some parts have to be (convertor), some parts not.
String.Length returns the number of Chars, but these are .NET chars,
not Unicode chars.
Yup.
At some point you may need some api to tell you the length of the string
in Unicode chars.
Indeed. I wrote a Utf32String class a while ago which does all this,
and can convert to and from "normal" strings.
Imagine someone typing 5 unicode characters, all of them
in the surrogate area. String.Length returns 10, the application complains
that the user name (for instance) should be max 8 characters, and the user
is puzzled, because he did type only 5.
Blech - yes, that's horrible.
But the IME is not there for this, and many things are not in place, yet.

We can assume this will be cleaned out at some point. All we can do is
understand the differences between Unicode (the standard) and the real
life use of Unicode (.NET, NT, XP, Unix, etc). Know what the standard
states and what the implementations do different.


Yup. To be honest, I can't see it being *cleanly* sorted without taking
the hit of going for full UTF-32 (or UCS-4 - I don't know if there's
any difference) characters. Doing that would be a nasty memory hit, but
it may be what's required.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #35

P: n/a
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).

Jerry

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Vladimir <xo***@tut.by> wrote:
> Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16
> internally,
> it uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.


The difference is that UCS-2 can only encode Unicode characters 0-
0xffff. UTF-16 can encode the whole of Unicode.

I'm not *entirely* clear, but I believe that the difference is fairly
minimal in .NET itself, unless you view the characters which form
surrogate pairs as invalid UCS-2 characters (pass on that one, I'm
afraid). If you had a 32-bit character data type to start with,
however, a correct UCS-2 encoding would reject characters above 0xffff,
whereas a correct UTF-16 encoding would cope.

I guess another way of looking at it (please someone, correct me if I'm
wrong!) is that although each character in .NET is only UCS-2, strings
are sometimes regarded as UTF-16. (It's the "sometimes" which is the
problem, here.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #36

P: n/a
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).

Jerry

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Vladimir <xo***@tut.by> wrote:
> Vladimir, Unicode is NOT UTF-16. And .Net doesn't use UTF-16
> internally,
> it uses UCS-2, which is different encoding.


Different in what?

UnicodeEncoding class represents a UTF-16 encoding of Unicode characters
(this is from documentation).
And it works straight forward with Char structure.


The difference is that UCS-2 can only encode Unicode characters 0-
0xffff. UTF-16 can encode the whole of Unicode.

I'm not *entirely* clear, but I believe that the difference is fairly
minimal in .NET itself, unless you view the characters which form
surrogate pairs as invalid UCS-2 characters (pass on that one, I'm
afraid). If you had a 32-bit character data type to start with,
however, a correct UCS-2 encoding would reject characters above 0xffff,
whereas a correct UTF-16 encoding would cope.

I guess another way of looking at it (please someone, correct me if I'm
wrong!) is that although each character in .NET is only UCS-2, strings
are sometimes regarded as UTF-16. (It's the "sometimes" which is the
problem, here.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #37

P: n/a
Jerry Pisk <je******@hotmail.com> wrote:
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).


But each character itself in .NET is only 16 bits. It's only strings
which have the concept of surrogate pairs, surely. The .NET concept of
a character is limited to UCS-2, but other things can interpret
sequences of those characters as UTF-16 sequences.

If you could state *exactly* which part of my post was "not true" it
would make it easier to either defend my position or retract it though.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #38

P: n/a
Jerry Pisk <je******@hotmail.com> wrote:
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using 16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).


But each character itself in .NET is only 16 bits. It's only strings
which have the concept of surrogate pairs, surely. The .NET concept of
a character is limited to UCS-2, but other things can interpret
sequences of those characters as UTF-16 sequences.

If you could state *exactly* which part of my post was "not true" it
would make it easier to either defend my position or retract it though.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #39

P: n/a
Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).

Jerry

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Jerry Pisk <je******@hotmail.com> wrote:
Not true again, UTF-16 can only encode 0 - 0x10FFFF
(http://www.ietf.org/rfc/rfc2781.txt), while UCS-2 can only use
characters
0 - 0xFFFF. So I was wrong, .Net uses UTF-16. UCS-2 is simply using
16-bit
Unicode characters, without surrogate pairs (those are created by UTF-16
encoding).


But each character itself in .NET is only 16 bits. It's only strings
which have the concept of surrogate pairs, surely. The .NET concept of
a character is limited to UCS-2, but other things can interpret
sequences of those characters as UTF-16 sequences.

If you could state *exactly* which part of my post was "not true" it
would make it easier to either defend my position or retract it though.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #40

P: n/a
Jerry Pisk <je******@hotmail.com> wrote:
Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).


I don't believe that's true. While ISO/IEC 10646 contains 2^31 code
positions, I believe the Unicode standard itself limits characters to
the BMP or the first supplementary 14 planes of ISO/IEC 10646. From the
Unicode standard:

<quote>
The Principles and Procedures document of JTC1/SC2/WG2 states that all
future assignments of characters to 10646 will be constrained to the
BMP or the first 14 supplementary planes. This is to ensure
interoperability between the 10646 transformation formats (see below).
It also guarantees interoperability with implementations of the Unicode
Standard, for which only code positions 0..10FFFF16 are meaningful.
</quote>

From elsewhere in the standard
(http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf):

<quote>
In the Unicode Standard, the codespace consists of the integers from 0
to 10FFFF16, comprising 1,114,112 code points available for assigning
the repertoire of abstract characters. Of course, there are constraints
on how the codespace is organized, and particular areas of the
codespace have been set aside for encoding of certain kinds of abstract
characters or for other uses in the standard. For more on the
allocation of the Unicode codespace, see Section 2.8, Unicode
Allocation.
</quote>

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #41

P: n/a
>> This is why is used length_of_the_string (something generic) and not
String.Length (real API).


What do you mean by "This is why is used"? Who are you saying is using
this code?

My mistake. "This is why I used". Was kind of pseudo-code to avoid any
specific API.

Otherwise, we seem to agree on all :-)
--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #42

P: n/a
> Not true was referring to you saying that UTF-16 can encode the whole
Unicode range. It can't, the maximum value it can encode is 0x10FFFF, not
the whole Unicode range (which doesn't currently use higher values but it
also doesn't say those are invalid).

0 - 0x10FFFF IS the whole Unicode range.

--
Mihai
-------------------------
Replace _year_ with _ to get the real email
Jul 21 '05 #43

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
This is why is used length_of_the_string (something generic) and not
String.Length (real API).


What do you mean by "This is why is used"? Who are you saying is using
this code?

My mistake. "This is why I used". Was kind of pseudo-code to avoid any
specific API.

Otherwise, we seem to agree on all :-)


Yup. I've just submitted a comment to the MSDN team to make the docs
more explicit.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #44

This discussion thread is closed

Replies have been disabled for this discussion.