472,133 Members | 1,062 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,133 software developers and data experts.

Unclear about string class

Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

-- Pavils
Nov 16 '05 #1
5 1490
Pavils Jurjans wrote:
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?


Strings in .NET are comprised of UTF-16 encoded characters. For the
vast majority of characters, one character will be encoded into 16 bits
(a single .NET Char). There are some characters which get encoded into
more than one set of 16-bit values - similar to the way MBCS work on
Win32 systems. These are pretty rare, and in my experience, I have not
seen any .NET code that even makes an attempt to deal with it - most
..NET code I've seen treats a System.Char as a character.

So I guess what they're saying is that an index into a .NET String type
will point to a System.Char type, but that it is not necessarily
pointing to a Unicode character, since some UTF-16 characters are
encoded using more than one code point.

See Jon Skeet's excellent FAQ on Unicode/Character Encoding issues:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb
Nov 16 '05 #2
> Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might not correspond to consecutive Unicode characters because a Unicode character might be encoded as more than one Char. To work with each Unicode character instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference of indexes and unicode characters, while I can't really detect one?

Are you sure that your japanese character consists of more than one unicode
character?

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Nov 16 '05 #3
> Are you sure that your japanese character consists of more than one
unicode
character?


No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

-- Pavils
Nov 16 '05 #4
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

--
mikeb
Nov 16 '05 #5
mikeb <ma************@nospam.mailnull.com> wrote:
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8
A character isn't converted into *characters* when it's encoded - it's
converted into *bytes*. There's a big difference.
and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.


Yup, that's absolutely correct. I'm mystified as to where this doc is
too...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

12 posts views Thread by Tim Daneliuk | last post: by
9 posts views Thread by Derek Hart | last post: by
4 posts views Thread by Carl Youngblood | last post: by
27 posts views Thread by djake | last post: by
9 posts views Thread by Eirik WS | last post: by
Dormilich
1 post views Thread by Dormilich | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.