By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,905 Members | 1,682 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,905 IT Pros & Developers. It's quick & easy.

Unclear about string class

P: n/a
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

-- Pavils
Nov 16 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Pavils Jurjans wrote:
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?


Strings in .NET are comprised of UTF-16 encoded characters. For the
vast majority of characters, one character will be encoded into 16 bits
(a single .NET Char). There are some characters which get encoded into
more than one set of 16-bit values - similar to the way MBCS work on
Win32 systems. These are pretty rare, and in my experience, I have not
seen any .NET code that even makes an attempt to deal with it - most
..NET code I've seen treats a System.Char as a character.

So I guess what they're saying is that an index into a .NET String type
will point to a System.Char type, but that it is not necessarily
pointing to a Unicode character, since some UTF-16 characters are
encoded using more than one code point.

See Jon Skeet's excellent FAQ on Unicode/Character Encoding issues:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb
Nov 16 '05 #2

P: n/a
> Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might not correspond to consecutive Unicode characters because a Unicode character might be encoded as more than one Char. To work with each Unicode character instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference of indexes and unicode characters, while I can't really detect one?

Are you sure that your japanese character consists of more than one unicode
character?

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Nov 16 '05 #3

P: n/a
> Are you sure that your japanese character consists of more than one
unicode
character?


No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

-- Pavils
Nov 16 '05 #4

P: n/a
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

--
mikeb
Nov 16 '05 #5

P: n/a
mikeb <ma************@nospam.mailnull.com> wrote:
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8
A character isn't converted into *characters* when it's encoded - it's
converted into *bytes*. There's a big difference.
and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.


Yup, that's absolutely correct. I'm mystified as to where this doc is
too...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.