473,326 Members | 2,104 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Unclear about string class

Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?

-- Pavils
Nov 16 '05 #1
5 1565
Pavils Jurjans wrote:
Hello,

Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference
of indexes and unicode characters, while I can't really detect one?


Strings in .NET are comprised of UTF-16 encoded characters. For the
vast majority of characters, one character will be encoded into 16 bits
(a single .NET Char). There are some characters which get encoded into
more than one set of 16-bit values - similar to the way MBCS work on
Win32 systems. These are pretty rare, and in my experience, I have not
seen any .NET code that even makes an attempt to deal with it - most
..NET code I've seen treats a System.Char as a character.

So I guess what they're saying is that an index into a .NET String type
will point to a System.Char type, but that it is not necessarily
pointing to a Unicode character, since some UTF-16 characters are
encoded using more than one code point.

See Jon Skeet's excellent FAQ on Unicode/Character Encoding issues:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb
Nov 16 '05 #2
> Here's an excerpt from msdn online documentation:

An index is the position of a Char, not a Unicode character, in a String. An index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might not correspond to consecutive Unicode characters because a Unicode character might be encoded as more than one Char. To work with each Unicode character instead of each Char, use the System.Globalization.StringInfo class.

I did some testing:

string test = "%BC";
Console.WriteLine((long) test[0]);
Console.WriteLine((long) test[1]);
Console.WriteLine((long) test[2]);
Console.WriteLine(test.IndexOf("%"));
Console.WriteLine(test.IndexOf("B"));
Console.WriteLine(test.IndexOf("C"));
// Where "%" is actually a japanese character (code=38283), the cs file is
saved with UTF-8 encoding.

So, from the doc, I should get something like 0, 3, 4. But I actually get
normal 0, 1, 2.

So, where's explanation? Why documentation is warning about some difference of indexes and unicode characters, while I can't really detect one?

Are you sure that your japanese character consists of more than one unicode
character?

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Nov 16 '05 #3
> Are you sure that your japanese character consists of more than one
unicode
character?


No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.

-- Pavils
Nov 16 '05 #4
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8, and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.

--
mikeb
Nov 16 '05 #5
mikeb <ma************@nospam.mailnull.com> wrote:
Pavils Jurjans wrote:
Are you sure that your japanese character consists of more than one


unicode
character?

No, but I know that it is converted to number of characters, when encoded to
UTF-8
A character isn't converted into *characters* when it's encoded - it's
converted into *bytes*. There's a big difference.
and the doc claims that when I look for index, I will get the one
according to the UTF-8 encoded character string, not the unicode character
position.


Could you clarify where it says you'll get the index according to a
UTF-8 encoding? My reading of the String class docs indicate that it
will provide an index according to UTF-16 encoding.


Yup, that's absolutely correct. I'm mystified as to where this doc is
too...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

12
by: Tim Daneliuk | last post by:
I am a bit confused. I was under the impression that: class foo(object): x = 0 y = 1 means that x and y are variables shared by all instances of a class. But when I run this against two...
9
by: Derek Hart | last post by:
I wish to execute code from a string. The string will have a function name, which will return a string: Dim a as string a = "MyFunctionName(param1, param2)" I have seen a ton of people...
3
by: Vincent Cantin | last post by:
I have a class defined by a template which needs to "say" its type to the user via string. As an example, here is the class that I want to fix : template<class T> class Container : public...
4
by: Carl Youngblood | last post by:
I imagine this subject has probably been brought up numerous times. Forgive me for bringing it up again. I was googling through old posts on this newsgroup about it and found a good suggestion on...
27
by: djake | last post by:
In the stroustrup C++ programming language (third edition) i can find example like this: string s1= "Hello"; So I imagine string is a standard class. Furthermore the class in the example is...
13
by: M | last post by:
Hi, I've searched through the previous posts and there seems to be a few examples of search and replacing all occurrances of a string with another string. I would have thought that the code...
9
by: Eirik WS | last post by:
I believe that I am starting to understand the C language satisfactory, but a there are a few things that I still not understand(Feel free to direct me to a tutorial or a FAQ if these questions...
9
by: rsine | last post by:
I have developed a program that sends a command through the serial port to our business system and then reads from the buffer looking for a number. Everything worked great on my WinXP system, but...
1
Dormilich
by: Dormilich | last post by:
My website system is giving me the following error message, where I don't know where it originates or how I can fix it. I get this error repeatedly... the server log only differs in the IP address....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.