Unicode, encodings, and asian languages: need some help.

apprentice

Hello,

I'm writing an class library that I imagine people from different countries
might be interested in using, so I'm considering what needs to be provided
to support foreign languages, including asian languages (chinese, japanese,
korean, etc).

First of all, strings will be passed to my class methods, some of which
based on the language (and on the encoding) might contain characters that
require more that a single byte.
Having to cycle through each byte composing each char of an input string,
how does .NET guarantee that the string is broken up correctly in its
composing chars based on the string's language??? In other words, how does
..NET identify the correct "boundary" for each char (what bytes are part of
each char) based on the string's language??? Also, what is the encoding with
which are strings initially taken into memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page??? Is there a way to set
the encoding that .NET should be using for strings so that when cycling
through the characters in the string, bytes are correctly assigned to each
char based on the string's language???
Regards,
Bob Rock

Mar 30 '06 #1

Subscribe Post Reply

3112

Sylvain Lafontaine

All chars and strings are Unicode 16 (or UTF-16): each char require two
bytes. More specifically, this is not UNICODE 16 but a subset called UTC16
because it excludes extended characters (those who requires three and four
bytes). Of course, you can store and display extended characters into a UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).

The notion of Culture Set, Localization and Code Page are mainly taken into
account when .NET must converse (reading or writing something) with the
external world.

--
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: http://cerbermail.com/?QugbLEWINF
"apprentice" <ye********************@hotmail.nospam.com> wrote in message
news:eZ**************@tk2msftngp13.phx.gbl...

Hello,

I'm writing an class library that I imagine people from different
countries might be interested in using, so I'm considering what needs to
be provided to support foreign languages, including asian languages
(chinese, japanese, korean, etc).

First of all, strings will be passed to my class methods, some of which
based on the language (and on the encoding) might contain characters that
require more that a single byte.
Having to cycle through each byte composing each char of an input string,
how does .NET guarantee that the string is broken up correctly in its
composing chars based on the string's language??? In other words, how does
.NET identify the correct "boundary" for each char (what bytes are part of
each char) based on the string's language??? Also, what is the encoding
with which are strings initially taken into memory??? Does this encoding
depend from the culture set for the current thread or does it maybe depend
from the encoding for the system's current ANSI code page??? Is there a
way to set the encoding that .NET should be using for strings so that when
cycling through the characters in the string, bytes are correctly assigned
to each char based on the string's language???
Regards,
Bob Rock

Mar 31 '06 #2

Mihai N.

> All chars and strings are Unicode 16 (or UTF-16): each char require two

bytes. More specifically, this is not UNICODE 16 but a subset called UTC16
because it excludes extended characters (those who requires three and four
bytes). Of course, you can store and display extended characters into a
UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).

Sorry to contradict, but it really is UTF-16.
I think the confusion comes from the old NT, which was indeed UCS2.
And in fact all the Win 2000 was UTF-16.
As with most things, Windows improved. NT had no clue of UTF-16
(normal, since which UTF did not exist at the time :-).
W2K was better, WXP is even better, but still not perfect.

So, .NET is UTF-16. Maybe not perfect, but those are bugs :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mar 31 '06 #3

Lau Lei Cheong

Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:

Unicode defines two mapping methods:

a.. the UTF (Unicode Transformation Format) encodings
b.. the UCS (Universal Character Set) encodings
The encodings include:

a.. UTF-7 ¡X a relatively unpopular 7-bit encoding, often considered
obsolete
b.. UTF-8 ¡X an 8-bit, variable-width encoding
c.. UCS-2 ¡X a 16-bit, fixed-width encoding that only supports the BMP
d.. UTF-16 ¡X a 16-bit, variable-width encoding
e.. UCS-4 and UTF-32 ¡X functionally identical 32-bit fixed-width
encodings
f.. UTF-EBCDIC ¡X an unpopular encoding intended for EBCDIC based
mainframe systems

"Mihai N." <nm**************@yahoo.com> ¼¶¼g©ó¶l¥ó·s»D:Xn*******************@207.46.248.16 ...

All chars and strings are Unicode 16 (or UTF-16): each char require two
bytes. More specifically, this is not UNICODE 16 but a subset called
UTC16
because it excludes extended characters (those who requires three and
four
bytes). Of course, you can store and display extended characters into a
UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).

Sorry to contradict, but it really is UTF-16.
I think the confusion comes from the old NT, which was indeed UCS2.
And in fact all the Win 2000 was UTF-16.
As with most things, Windows improved. NT had no clue of UTF-16
(normal, since which UTF did not exist at the time :-).
W2K was better, WXP is even better, but still not perfect.

So, .NET is UTF-16. Maybe not perfect, but those are bugs :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mar 31 '06 #4

apprentice

I like it when people use posts to provide any answer and carefully avoid
considering the questions made ... and often even end up diverting the
entire thread. I wonder, is it because to become or remain a MCP you must
guarantee a certain number of posts per month .... and anything post will
contribute to that number???

Anyway, here I go again.
I'll have strings in various languages (including east asian languages)
passes to my class methods and I need to run an algorithm on the bytes that
compose each char in the strings (char based on the string's language).

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???
2) Is there a way to set the encoding that .NET should be using for strings
so that when cycling through the characters in the string (look at the code
below), bytes are correctly assigned to each char based on the string's
language???
3) In what encoding are strings kept in memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page???
4) Having a string in chinese (simplified or traditional), in japanese or in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}

Bob Rock

"Lau Lei Cheong" <le****@yehoo.com.hk> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...

Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:

> Unicode defines two mapping methods:

a.. the UTF (Unicode Transformation Format) encodings
b.. the UCS (Universal Character Set) encodings
The encodings include:

a.. UTF-7 ¡X a relatively unpopular 7-bit encoding, often considered
obsolete
b.. UTF-8 ¡X an 8-bit, variable-width encoding
c.. UCS-2 ¡X a 16-bit, fixed-width encoding that only supports the BMP
d.. UTF-16 ¡X a 16-bit, variable-width encoding
e.. UCS-4 and UTF-32 ¡X functionally identical 32-bit fixed-width
encodings
f.. UTF-EBCDIC ¡X an unpopular encoding intended for EBCDIC based
mainframe systems

"Mihai N." <nm**************@yahoo.com>
¼¶¼g©ó¶l¥ó·s»D:Xn*******************@207.46.248.16 ...
All chars and strings are Unicode 16 (or UTF-16): each char require two
bytes. More specifically, this is not UNICODE 16 but a subset called
UTC16
because it excludes extended characters (those who requires three and
four
bytes). Of course, you can store and display extended characters into a
UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).

Sorry to contradict, but it really is UTF-16.
I think the confusion comes from the old NT, which was indeed UCS2.
And in fact all the Win 2000 was UTF-16.
As with most things, Windows improved. NT had no clue of UTF-16
(normal, since which UTF did not exist at the time :-).
W2K was better, WXP is even better, but still not perfect.

So, .NET is UTF-16. Maybe not perfect, but those are bugs :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mar 31 '06 #5

Jon Skeet [C# MVP]

Lau Lei Cheong <le****@yehoo.com.hk> wrote:

Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

No, that's just not true - and nothing that you posted gave any
evidence for it.

From the docs (admittedly for 2.0, but this hasn't changed) for String:

<quote>
Each Unicode character in a string is defined by a Unicode scalar
value, also called a Unicode code point or the ordinal (numeric) value
of the Unicode character. Each code point is encoded using UTF-16
encoding, and the numeric value of each element of the encoding is
represented by a Char object.
</quote>

Similarly from the docs for System.Char:

<quote>
The .NET Framework uses the Char structure to represent Unicode
characters. The Unicode Standard identifies each Unicode character with
a unique 21-bit scalar number called a code point, and defines the UTF-
16 encoding form that specifies how a code point is encoded into a
sequence of one or more 16-bit values. Each 16-bit value ranges from
hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.
The value of a Char object is its 16-bit numeric (ordinal) value.
</quote>

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #6

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

<snip>

Anyway, here I go again.
I'll have strings in various languages (including east asian languages)
passes to my class methods and I need to run an algorithm on the bytes that
compose each char in the strings (char based on the string's language).

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???
Strings don't have languages. All strings are stored in UTF-16.
2) Is there a way to set the encoding that .NET should be using for strings
so that when cycling through the characters in the string (look at the code
below), bytes are correctly assigned to each char based on the string's
language???
The conversion between bytes and strings is performed by the Encoding
classes.
3) In what encoding are strings kept in memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page???
As has been specified, UTF-16.
4) Having a string in chinese (simplified or traditional), in japanese or in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}

That wouldn't do anything about surrogate characters. If you really
care about those (and I didn't *think* that any natural language
characters were in the surrogate range, although I could be wrong) you
might be interested in my Utf32String class:

http://www.pobox.com/~skeet/csharp/miscutil

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #7

apprentice

Hello Jon,

let my try to clarify some of my statements.

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

apprentice <ye********************@hotmail.nospam.com> wrote:

<snip>
Anyway, here I go again.
I'll have strings in various languages (including east asian languages)
passes to my class methods and I need to run an algorithm on the bytes
that
compose each char in the strings (char based on the string's language).

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???
Strings don't have languages. All strings are stored in UTF-16.

You are right. What I was trying to say is if the .NET Framework is able to
somehow guess (e.g. through a statistical analysis) the natural language of
a string thus getting an appropriate Encoding or if more simply it gets the
appropriate Encoding instance by doing something like
Encoding.GetEncoding(system_ansi_code_page)??? In fact, if you take a look
at the Encoding class the Default property does exactly that. I guess that
strings are encoded using that default Encoding instance.

2) Is there a way to set the encoding that .NET should be using for
strings
so that when cycling through the characters in the string (look at the
code
below), bytes are correctly assigned to each char based on the string's
language???
The conversion between bytes and strings is performed by the Encoding
classes.

Yes, that I already new. However, in a cycle such as the following, am I
guaranteed that each char handed to me is exactly a char in the string's
natural language??? I wonder, how can .NET break up correctly the string in
its natural language chars???

foreach(char ch in text.ToCharArray())
{
// break up ch in bytes
}

3) In what encoding are strings kept in memory??? Does this encoding
depend
from the culture set for the current thread or does it maybe depend from
the
encoding for the system's current ANSI code page???
As has been specified, UTF-16.

I was hoping that the Framework would allow somehow to specify the code page
to use to get the correct Encoding instance. That would probably guarantee
the cycle above to behave as I need (correctly break up the string in its
natural language chars).

4) Having a string in chinese (simplified or traditional), in japanese or
in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}
That wouldn't do anything about surrogate characters. If you really
care about those (and I didn't *think* that any natural language
characters were in the surrogate range, although I could be wrong) you
might be interested in my Utf32String class:

That is exactly the knowledge I'm after. Does any natural language
(expecially asian languages such as chinese, japanese, korean or vietnamese)
require more than the 2 bytes provided by .NET???
http://www.pobox.com/~skeet/csharp/miscutil
Thanks. I'll take a look at it.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #8

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???

Strings don't have languages. All strings are stored in UTF-16.

You are right. What I was trying to say is if the .NET Framework is able to
somehow guess (e.g. through a statistical analysis) the natural language of
a string thus getting an appropriate Encoding or if more simply it gets the
appropriate Encoding instance by doing something like
Encoding.GetEncoding(system_ansi_code_page)??? In fact, if you take a look
at the Encoding class the Default property does exactly that. I guess that
strings are encoded using that default Encoding instance.

No, strings are always stored internally as UTF-16. All Unicode
characters can be represented in UTF-16, using surrogate pairs for
Unicode characters above U+FFFF. There's no "natural language" of a
string - it's always stored in UTF-16.

Now, if you're talking about converting to and from bytes when (say)
reading from a file, that's a different matter - and it depends on what
API you're using. Most default to a UTF-8 encoding (making
Encoding.Default a really bad name) but allow you to specify an
encoding.

Once the string has been read in, however, there is no trace of which
encoding was used to convert the bytes to chars.

2) Is there a way to set the encoding that .NET should be using for
strings
so that when cycling through the characters in the string (look at the
code
below), bytes are correctly assigned to each char based on the string's
language???

The conversion between bytes and strings is performed by the Encoding
classes.

Yes, that I already new. However, in a cycle such as the following, am I
guaranteed that each char handed to me is exactly a char in the string's
natural language??? I wonder, how can .NET break up correctly the string in
its natural language chars???

foreach(char ch in text.ToCharArray())
{
// break up ch in bytes
}

Again, there is no concept of "natural language char". In your code
snippet (which creates a char array unnecessarily, btw - you can just
use foreach (char ch in text)) each char is a UTF-16 code point. If you
want to convert that text data into bytes, you need to explicitly use
an encoding.

3) In what encoding are strings kept in memory??? Does this encoding
depend
from the culture set for the current thread or does it maybe depend from
the
encoding for the system's current ANSI code page???

As has been specified, UTF-16.

I was hoping that the Framework would allow somehow to specify the code page
to use to get the correct Encoding instance. That would probably guarantee
the cycle above to behave as I need (correctly break up the string in its
natural language chars).

It's not at all clear what the ultimate goal is. What is the larger
picture here?

That wouldn't do anything about surrogate characters. If you really
care about those (and I didn't *think* that any natural language
characters were in the surrogate range, although I could be wrong) you
might be interested in my Utf32String class:

That is exactly the knowledge I'm after. Does any natural language
(expecially asian languages such as chinese, japanese, korean or vietnamese)
require more than the 2 bytes provided by .NET???

http://www.jbrowse.com/text/ suggests that it should be okay:

<quote>
There are enough code points (without using surrogates, see below) to
represent all the characters commonly in use in Japan, China and Korea
</quote>

http://www.unicode.org/roadmaps/index.html gives a pretty good
indication of what's likely to be in each of the "planes" (BMP, or
plane 0, is what can be handled without surrogates).

In general, http://www.unicode.org is the authority on all these
matters - if you want to know whether a given character is covered,
look there.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #9

apprentice

BTW, the algo I posted was missing the line of code where I get the correct
encoding instance based on the natural language.
Here is the full snippet:
this._encoding = Encoding.GetEncoding(naturalLanguageCodePage);

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that words on byte b
}
}

Bob Rock

Mar 31 '06 #10

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

BTW, the algo I posted was missing the line of code where I get the correct
encoding instance based on the natural language.
Here is the full snippet:

this._encoding = Encoding.GetEncoding(naturalLanguageCodePage);

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that words on byte b
}
}

That's unlikely to be useful - dealing with individual bytes doesn't
make nearly as much sense as dealing with a character at a time. What
is your algorithm meant to do?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #11

apprentice

Ok, lets start over. I'll try my best to be clear in my intent.

1) I will have strings in different languages passed to my classes (examples
of such language might be chinese, japanese, korean or vietnamese)

2) I need to operate on the bytes that compose each char. For my algo to
work correctly the string must be broken up correctly in chars ... and I
mean chars as they would be understood in the string's language (e.g.
chinese, japanese, korean or vietnamese).

3) I imagine that when a string (let's suppose it is in korean) is passed in
as a parameter to one of my methods, it will be taken into memory encoded in
an encoding based on the system's current ANSI code page (that which is
returned by the Encoding.Default property). I also imagine that when I run a
piece of code such as the following, the string will be broken into chars
based on the encoding ... which might not be the correct one for the
string's language (korean).

foreach(char ch in myString)
{
// code that operates on ch
}

4) I thought that if I could specify the correct encoding (as I do below)
the cycle would however work correctly:

Encoding koreanEncoding = Encoding.GetEncoding(codePageForKorean);

5) Not being able to do so, what do I have to expect from the following
code? Will it correctly break up the string in chars whatever language it
might be in (again chinese, japanese, korean or vietnamese)?

this._encoding = Encoding.GetEncoding(languageCodePagge);
foreach(char ch in myString)
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that works on byte b
}
}
All of the above is really to answer a single question: how many bytes do I
have to expect to be used by .NET Framework strings to represent a single
character of an asian language such as chinese, japanese or korean? I though
that 2 bytes would do, but then how could more languages be encoded in the
same stream????

I read the section "Overview and Description" of this article
https://www.microsoft.com/globaldev/..._codepage.mspx and I'm
still confused. If 2 bytes are not enough to encode more asian language
characters in the same stream, how many bytes are used to represent a single
character in a .NET string???
Bob Rock

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

apprentice <ye********************@hotmail.nospam.com> wrote:
>> Here are my questions:
>>
>> 1) How does the .NET Framework know how to appropriately assign bytes
>> to
>> chars??? How does the Framework identify the correct "boundary" for
>> each
>> char (what bytes are part of each char) based on the string's
>> language???
>
> Strings don't have languages. All strings are stored in UTF-16.

You are right. What I was trying to say is if the .NET Framework is able
to
somehow guess (e.g. through a statistical analysis) the natural language
of
a string thus getting an appropriate Encoding or if more simply it gets
the
appropriate Encoding instance by doing something like
Encoding.GetEncoding(system_ansi_code_page)??? In fact, if you take a
look
at the Encoding class the Default property does exactly that. I guess
that
strings are encoded using that default Encoding instance.

No, strings are always stored internally as UTF-16. All Unicode
characters can be represented in UTF-16, using surrogate pairs for
Unicode characters above U+FFFF. There's no "natural language" of a
string - it's always stored in UTF-16.

Now, if you're talking about converting to and from bytes when (say)
reading from a file, that's a different matter - and it depends on what
API you're using. Most default to a UTF-8 encoding (making
Encoding.Default a really bad name) but allow you to specify an
encoding.

Once the string has been read in, however, there is no trace of which
encoding was used to convert the bytes to chars.
>> 2) Is there a way to set the encoding that .NET should be using for
>> strings
>> so that when cycling through the characters in the string (look at the
>> code
>> below), bytes are correctly assigned to each char based on the
>> string's
>> language???
>
> The conversion between bytes and strings is performed by the Encoding
> classes.

Yes, that I already new. However, in a cycle such as the following, am I
guaranteed that each char handed to me is exactly a char in the string's
natural language??? I wonder, how can .NET break up correctly the string
in
its natural language chars???

foreach(char ch in text.ToCharArray())
{
// break up ch in bytes
}

Again, there is no concept of "natural language char". In your code
snippet (which creates a char array unnecessarily, btw - you can just
use foreach (char ch in text)) each char is a UTF-16 code point. If you
want to convert that text data into bytes, you need to explicitly use
an encoding.
>> 3) In what encoding are strings kept in memory??? Does this encoding
>> depend
>> from the culture set for the current thread or does it maybe depend
>> from
>> the
>> encoding for the system's current ANSI code page???
>
> As has been specified, UTF-16.

I was hoping that the Framework would allow somehow to specify the code
page
to use to get the correct Encoding instance. That would probably
guarantee
the cycle above to behave as I need (correctly break up the string in its
natural language chars).

It's not at all clear what the ultimate goal is. What is the larger
picture here?
> That wouldn't do anything about surrogate characters. If you really
> care about those (and I didn't *think* that any natural language
> characters were in the surrogate range, although I could be wrong) you
> might be interested in my Utf32String class:

That is exactly the knowledge I'm after. Does any natural language
(expecially asian languages such as chinese, japanese, korean or
vietnamese)
require more than the 2 bytes provided by .NET???

http://www.jbrowse.com/text/ suggests that it should be okay:

<quote>
There are enough code points (without using surrogates, see below) to
represent all the characters commonly in use in Japan, China and Korea
</quote>

http://www.unicode.org/roadmaps/index.html gives a pretty good
indication of what's likely to be in each of the "planes" (BMP, or
plane 0, is what can be handled without surrogates).

In general, http://www.unicode.org is the authority on all these
matters - if you want to know whether a given character is covered,
look there.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #12

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

Ok, lets start over. I'll try my best to be clear in my intent.

1) I will have strings in different languages passed to my classes (examples
of such language might be chinese, japanese, korean or vietnamese)
Right. All of these will be UTF-16 encoded, as that's what .NET uses
for strings.
2) I need to operate on the bytes that compose each char. For my algo to
work correctly the string must be broken up correctly in chars ... and I
mean chars as they would be understood in the string's language (e.g.
chinese, japanese, korean or vietnamese).
There are any number of encodings which can *can* encode the string,
but there's no such thing as "the string's language". There's no such
thing as "Japanese encoding" or "Korean encoding".

What is this algorithm meant to do, anyway?
3) I imagine that when a string (let's suppose it is in korean) is passed in
as a parameter to one of my methods, it will be taken into memory encoded in
an encoding based on the system's current ANSI code page (that which is
returned by the Encoding.Default property).
No, that's not true. It will be encoded in UTF-16, but that's mostly
transparent.
I also imagine that when I run a
piece of code such as the following, the string will be broken into chars
based on the encoding ... which might not be the correct one for the
string's language (korean).
Again, that's not true. You will be given the sequence of UTF-16 code
points which make up the string.
4) I thought that if I could specify the correct encoding (as I do below)
the cycle would however work correctly:

Encoding koreanEncoding = Encoding.GetEncoding(codePageForKorean);
Well, that will allow you to get the string encoded in that particular
encoding, but whether or not that's what you really need, I don't know
- I'd have to know more about what your algorithm is really meant to
do.
5) Not being able to do so, what do I have to expect from the following
code? Will it correctly break up the string in chars whatever language it
might be in (again chinese, japanese, korean or vietnamese)?

this._encoding = Encoding.GetEncoding(languageCodePagge);
foreach(char ch in myString)
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that works on byte b
}
}

All of the above is really to answer a single question: how many bytes do I
have to expect to be used by .NET Framework strings to represent a single
character of an asian language such as chinese, japanese or korean? I though
that 2 bytes would do, but then how could more languages be encoded in the
same stream????
It depends on what encoding you use. A single .NET framework char can
always be represented in 2 bytes, but some Unicode characters are
composed of a surrogate pair - two characters together. Note that with
your code above, you'd get each half of the surrogate pair separately.

However, with UTF-8 not all .NET chars are represented in 2 bytes -
anything over U+0799 is represented as 3 bytes.
I read the section "Overview and Description" of this article
https://www.microsoft.com/globaldev/..._codepage.mspx and I'm
still confused. If 2 bytes are not enough to encode more asian language
characters in the same stream, how many bytes are used to represent a single
character in a .NET string???

I couldn't see anything there saying that 2 bytes aren't enough to
encode Asian language characters. Could you quote the section that
worries you? I suspect you'll find all the natural language characters
are encoded in the BMP so you don't need to worry about surrogates -
but I wouldn't like to swear to it.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #13

apprentice

What is this algorithm meant to do, anyway?

Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.

I couldn't see anything there saying that 2 bytes aren't enough to
encode Asian language characters. Could you quote the section that
worries you? I suspect you'll find all the natural language characters
are encoded in the BMP so you don't need to worry about surrogates -
but I wouldn't like to swear to it.

This is it:

......
Each Asian character is represented by a pair of code points (thus
double-byte). For programming awareness, a set of points are set aside to
represent the first byte of the set and are not valued unless they are
immediately followed by a defined second byte. DBCS meant that you had to
write code that would treat these pair of code points as one,and this still
disallowed the combining of say Japanese and Chinese in the same data
stream, because depending on the codepage the same double-byte code points
represent different characters for the different languages.

In order to allow for the storage of different languages in the same data
stream, Unicode was created. This one "codepage" can represent 64000+
characters and now with the introduction of surrogates it can represent
1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for
easier creation of World-Ready code, because you no longer have to worry
about which codepage you are addressing, nor whether you had to group
character points to represent one character.
......

It looks as if to handle for example characters coming from different
(asian) languages, 2 bytes are not enough. So, I imagine that there might be
situations when surrogate pairs are indeed necessary.
Bob Rock

Mar 31 '06 #14

apprentice

>> 2) I need to operate on the bytes that compose each char. For my algo to

work correctly the string must be broken up correctly in chars ... and I
mean chars as they would be understood in the string's language (e.g.
chinese, japanese, korean or vietnamese).

There are any number of encodings which can *can* encode the string,
but there's no such thing as "the string's language". There's no such
thing as "Japanese encoding" or "Korean encoding".

I never meant to say that there are japanese or korean encodings. But still,
the Encoding instance you get from the following 2 statements is not the
same one so there are indeed *japanese* and *korean* specific encodings:

Encoding jEnc = Encoding.GetEncoding(932); // 932 = japanese code page
Encoding kEnc = Encoding.GetEncoding(949); // 949 = korean code page
Bob Rock

Mar 31 '06 #15

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

What is this algorithm meant to do, anyway?
Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.

I'm not entirely sure what you mean by "character bytes" but that
sounds broadly correct - but you need to make sure that whatever
encoding you use is the one the RTF reader is going to use too. That's
the crucial bit of information.

I couldn't see anything there saying that 2 bytes aren't enough to
encode Asian language characters. Could you quote the section that
worries you? I suspect you'll find all the natural language characters
are encoded in the BMP so you don't need to worry about surrogates -
but I wouldn't like to swear to it.

This is it:

.....
Each Asian character is represented by a pair of code points (thus
double-byte). For programming awareness, a set of points are set aside to
represent the first byte of the set and are not valued unless they are
immediately followed by a defined second byte. DBCS meant that you had to
write code that would treat these pair of code points as one,and this still
disallowed the combining of say Japanese and Chinese in the same data
stream, because depending on the codepage the same double-byte code points
represent different characters for the different languages.

In order to allow for the storage of different languages in the same data
stream, Unicode was created. This one "codepage" can represent 64000+
characters and now with the introduction of surrogates it can represent
1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for
easier creation of World-Ready code, because you no longer have to worry
about which codepage you are addressing, nor whether you had to group
character points to represent one character.
.....

It looks as if to handle for example characters coming from different
(asian) languages, 2 bytes are not enough.

I don't see how you infer that from the above. There are certainly
characters for which 2 bytes aren't enough, but I don't see any
indication above that Asian languages fall into that category.
So, I imagine that there might be
situations when surrogate pairs are indeed necessary.

Don't imagine - look at the code charts on http://www.unicode.org

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #16

apprentice

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

apprentice <ye********************@hotmail.nospam.com> wrote:
> What is this algorithm meant to do, anyway?
Well, it is a simple RTF library. For asian languages the RTF
specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and
above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping
if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them
as
required. In fact I can probably handle these asian characters using the
\u
control word without even having to get to the character bytes.

I'm not entirely sure what you mean by "character bytes" but that
sounds broadly correct - but you need to make sure that whatever
encoding you use is the one the RTF reader is going to use too. That's
the crucial bit of information.

A single asian character is represented using more bytes ... does bytes are
what I called "character bytes".

> I couldn't see anything there saying that 2 bytes aren't enough to
> encode Asian language characters. Could you quote the section that
> worries you? I suspect you'll find all the natural language characters
> are encoded in the BMP so you don't need to worry about surrogates -
> but I wouldn't like to swear to it.

This is it:

.....
Each Asian character is represented by a pair of code points (thus
double-byte). For programming awareness, a set of points are set aside to
represent the first byte of the set and are not valued unless they are
immediately followed by a defined second byte. DBCS meant that you had to
write code that would treat these pair of code points as one,and this
still
disallowed the combining of say Japanese and Chinese in the same data
stream, because depending on the codepage the same double-byte code
points
represent different characters for the different languages.

In order to allow for the storage of different languages in the same data
stream, Unicode was created. This one "codepage" can represent 64000+
characters and now with the introduction of surrogates it can represent
1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for
easier creation of World-Ready code, because you no longer have to worry
about which codepage you are addressing, nor whether you had to group
character points to represent one character.
.....

It looks as if to handle for example characters coming from different
(asian) languages, 2 bytes are not enough.

I don't see how you infer that from the above. There are certainly
characters for which 2 bytes aren't enough, but I don't see any
indication above that Asian languages fall into that category.

If I have in the same stream characters coming from more (asian) languages,
2 bytes are not enough since ... "depending on the codepage the same
double-byte code points represent different characters for the different
languages."

So, I imagine that there might be
situations when surrogate pairs are indeed necessary.

Don't imagine - look at the code charts on http://www.unicode.org

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #17

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

I don't see how you infer that from the above. There are certainly
characters for which 2 bytes aren't enough, but I don't see any
indication above that Asian languages fall into that category.

If I have in the same stream characters coming from more (asian) languages,
2 bytes are not enough since ... "depending on the codepage the same
double-byte code points represent different characters for the different
languages."

In that case you'll need to pick an encoding which contains all the
characters you need. UTF-16 may well be the encoding of choice here.

However, as I said before, the crucial thing is to work out what the
reader is going to expect. Are you actually able to dictate which
encoding is used, is it specified somewhere, or does the reader have to
guess?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #18

apprentice

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

apprentice <ye********************@hotmail.nospam.com> wrote:
> I don't see how you infer that from the above. There are certainly
> characters for which 2 bytes aren't enough, but I don't see any
> indication above that Asian languages fall into that category.

If I have in the same stream characters coming from more (asian)
languages,
2 bytes are not enough since ... "depending on the codepage the same
double-byte code points represent different characters for the different
languages."

In that case you'll need to pick an encoding which contains all the
characters you need. UTF-16 may well be the encoding of choice here.

However, as I said before, the crucial thing is to work out what the
reader is going to expect. Are you actually able to dictate which
encoding is used, is it specified somewhere, or does the reader have to
guess?

The developer will have to specify the correct code page for each string
that he/she inputs so that I may encode the string correctly. I wanted to
support different languages on the same document. I should be able to do it
easily.

Mar 31 '06 #19

Lau Lei Cheong

"Jon Skeet [C# MVP]" <sk***@pobox.com>
???????:MP************************@msnews.microsof t.com...

Lau Lei Cheong <le****@yehoo.com.hk> wrote:
Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

No, that's just not true - and nothing that you posted gave any
evidence for it.

From the docs (admittedly for 2.0, but this hasn't changed) for String:

<quote>
Each Unicode character in a string is defined by a Unicode scalar
value, also called a Unicode code point or the ordinal (numeric) value
of the Unicode character. Each code point is encoded using UTF-16
encoding, and the numeric value of each element of the encoding is
represented by a Char object.
</quote>

For .NET v1.1 documentation, verified the same as above:
ms-help://MS.MSDNQTR.2005JUL.1033/cpref/html/frlrfSystemStringClassTopic.htm
Admitted my mistake.

Somehow I remebered the default encoding setting in web.config is utf-8, and
all the text files I have to access here is in utf-8, that lead me to
believe everything is in utf-8 here unless explicitly spoken otherwise.

Sorry for the misinformation.

Actually, I intended to reply this thread because of another thread about
"copy char[] to byte[]", and somehow I can't find the title or search it
back. It contains some misunderstanding among issues with Unicode, and I go
search for some reference and intended to post back.

Now I seems to stir things up. Sorry again.

Apr 1 '06 #20

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

However, as I said before, the crucial thing is to work out what the
reader is going to expect. Are you actually able to dictate which
encoding is used, is it specified somewhere, or does the reader have to
guess?

The developer will have to specify the correct code page for each string
that he/she inputs so that I may encode the string correctly. I wanted to
support different languages on the same document. I should be able to do it
easily.

But how does that encoding information end up in the file? How does the
RTF file itself specify the encoding?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 1 '06 #21

Mihai N.

"apprentice" <ye********************@hotmail.nospam.com> wrote in
news:#F**************@tk2msftngp13.phx.gbl:

Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.

This is how Japanese looks like, with the mixture of Unicode and shift-jis
required by the spec:

\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea

But this is also valid:
\f1\'93\'fa\'96\'7b\'8c\'ea\par

The condition is that the font number 1 (specified by \f1) has the proper
charset:

{\fonttbl
{\f1\froman\fprq1\fcharset128 MS PGothic;}
}

In fact, this would be a minimal Japanese rtf:

{\rtf1\ansi

{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea
}

But there are a lot of other useful rtf tags that are important,
like \dbch \langnp1041 \lang1041 and many others.

The official spec is the most important document if you want your own RTF
lib:
http://msdn.microsoft.com/library/de...l=/library/en-
us/dnrtfspec/html/rtfspec.asp
You can also consider the WinWord Converter SDK:
http://support.microsoft.com/kb/q111716/

And saving a lot of RTF files from Write & Word, then "dissecting"
them in Notepad is the best way to understand how this works.
But is there any good reason not to use a standard RTF control?
You can then serialize to/from it, full Unicode, no need to worry about
internal RTF representation, bytes, etc.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 1 '06 #22

Mihai N.

> Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8

(not sure in .NET v2.0).
We don't talk "Unicode" here, we talk about Unicode in .NET context.
And that is UTF-16 in .NET 1.0, 1.1, 2.0, you name it.
End of story.

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:

And the "Unicode" page from Wikipedia (and especially the "Storage, transfer,
and processing" section) is not very good (to put it mild :-)

If you want something, go to the official Unicode web site.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 1 '06 #23

Mihai N.

> Now I seems to stir things up. Sorry again.

Checking the facts, then saying "my bad" and pointing ppl to the right
document is something to be prised for, and I don't see it done to often.

Sorry, I just posted some short rebutal to your post, before reading the full
thread. Although true, you can just ignore it :-)

See? We all learn something (technical or not), every day :-)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 1 '06 #24

apprentice

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

apprentice <ye********************@hotmail.nospam.com> wrote:
> However, as I said before, the crucial thing is to work out what the
> reader is going to expect. Are you actually able to dictate which
> encoding is used, is it specified somewhere, or does the reader have to
> guess?
The developer will have to specify the correct code page for each string
that he/she inputs so that I may encode the string correctly. I wanted to
support different languages on the same document. I should be able to do
it
easily.

But how does that encoding information end up in the file? How does the
RTF file itself specify the encoding?

Well, there are RTF control words to specify (1) the ANSI codepage for the
entire document, (2) the char set for each font used in the document and (3)
then there is the encoding of bytes (which I expect require the correct
Encoding class to be used ... and thus again the correct codepage).
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #25

apprentice

Hello Mihai,

thank you very much for you answer.
Please allow me to ask you a few things ... I've got no way of testing my
code to see if my assumptions are correct.

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?

In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right?

I don't know if you read the entire thread, but to sum it up, my main worry
in my post is if in a piece of code such as the following, a string of
japanese text would be broken up correctly returning chars meaningful for
the japanese language (bytes in the stream are correctly assigned to each
japanese char):

foreach(char ch in text)
{
// is ch indeed a japanese char??? I doubt it!
}

My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese. That would mean that I
cannot rely on having an algorithm break up strings in characters according
to the language of the text they contain but I'll have to work at the byte
level, encoding bytes as required by the RTF spec (using the \' syntax).

I'd appreciate if you could shed some light on the above.
Bob Rock

This is how Japanese looks like, with the mixture of Unicode and shift-jis
required by the spec:

\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea

But this is also valid:
\f1\'93\'fa\'96\'7b\'8c\'ea\par

The condition is that the font number 1 (specified by \f1) has the proper
charset:

{\fonttbl
{\f1\froman\fprq1\fcharset128 MS PGothic;}
}

In fact, this would be a minimal Japanese rtf:

{\rtf1\ansi

{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea
}

But there are a lot of other useful rtf tags that are important,
like \dbch \langnp1041 \lang1041 and many others.

The official spec is the most important document if you want your own RTF
lib:
http://msdn.microsoft.com/library/de...l=/library/en-
us/dnrtfspec/html/rtfspec.asp
You can also consider the WinWord Converter SDK:
http://support.microsoft.com/kb/q111716/

And saving a lot of RTF files from Write & Word, then "dissecting"
them in Notepad is the best way to understand how this works.
But is there any good reason not to use a standard RTF control?
You can then serialize to/from it, full Unicode, no need to worry about
internal RTF representation, bytes, etc.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 2 '06 #26

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

But how does that encoding information end up in the file? How does the
RTF file itself specify the encoding?

Well, there are RTF control words to specify (1) the ANSI codepage for the
entire document, (2) the char set for each font used in the document and (3)
then there is the encoding of bytes (which I expect require the correct
Encoding class to be used ... and thus again the correct codepage).

In that case, I'd suggest using UTF-16 everywhere - it'll make life
much easier for you.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #27

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

thank you very much for you answer.
Please allow me to ask you a few things ... I've got no way of testing my
code to see if my assumptions are correct.

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?
It entirely depends on which encoding you use. In the form of a string,
each UTF-16 code point will take two bytes. Leaving surrogate
characters out of it for the moment, that means each character is two
bytes.

However, when you convert it to a different encoding, it entirely
depends on what that encoding uses.
In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right?

I don't know if you read the entire thread, but to sum it up, my main worry
in my post is if in a piece of code such as the following, a string of
japanese text would be broken up correctly returning chars meaningful for
the japanese language (bytes in the stream are correctly assigned to each
japanese char):
A string of Japanese text will always be broken up into a sequence of
UTF-16 code points.
My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese.
You don't need to.
That would mean that I
cannot rely on having an algorithm break up strings in characters according
to the language of the text they contain but I'll have to work at the byte
level, encoding bytes as required by the RTF spec (using the \' syntax).

The string representation isn't concerned about the byte level - it's
concerned about the UTF-16 code point level. If you want to convert
into bytes, *you* provide the encoding, so you can give whichever one
you want.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #28

apprentice

> Using how many bytes is each japanese char encoded??? From my

understanding, depending on the word, they are encoded using 1 or 2 bytes,
with precise rules on the valid ranges for the leading and the trailing
bytes of double-byte chars (dbch). Could you please confirm?

Sorry I make a mistake typing the post ... it should be CHAR not WORD:

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the CHAR, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?

Apr 2 '06 #29

apprentice

>> Well, there are RTF control words to specify (1) the ANSI codepage for

the
entire document, (2) the char set for each font used in the document and
(3)
then there is the encoding of bytes (which I expect require the correct
Encoding class to be used ... and thus again the correct codepage).

In that case, I'd suggest using UTF-16 everywhere - it'll make life
much easier for you.

Jon, I posted a small sample C# project to make you understand why I cannot
do what you are suggesting.
Please take a look at it:

http://backslashzero.united.net.kg/TestEncoding.zip

How it will clarify things.

Apr 2 '06 #30

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

In that case, I'd suggest using UTF-16 everywhere - it'll make life
much easier for you.

Jon, I posted a small sample C# project to make you understand why I cannot
do what you are suggesting.
Please take a look at it:

http://backslashzero.united.net.kg/TestEncoding.zip

How it will clarify things.

Not really. Your code shows you using different encodings - it doesn't
show any restrictions, as far as I can see. Yes, the output is more
verbose - but the ability to encode *any* Unicode character without
having to guess at which encoding might or might not work is worth
that, isn't it?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #31

apprentice

Jon, I posted a small sample C# project to make you understand why I
cannot
do what you are suggesting.
Please take a look at it:

http://backslashzero.united.net.kg/TestEncoding.zip

How it will clarify things.

Not really. Your code shows you using different encodings - it doesn't
show any restrictions, as far as I can see. Yes, the output is more
verbose - but the ability to encode *any* Unicode character without
having to guess at which encoding might or might not work is worth
that, isn't it?

Jon, I believe I'm not getting you ... and probably you are not getting my
point either. As you might have seen, the number and even what bytes are
being printed out for the same exact unicode string (the one containing the
japanese text) are different. One of the ways that RTF requires you to
encode such double-byte char texts (texts for example in chinese, japanese,
korean and vietnamese) is to precede each byte's ascii code with a \'.
However, I believe (but I admit it, this is only my belief) the correct
encoding must first be selected before getting the ascii code for the bytes
in the stream because otherwise I might end up generating RTF code that does
not display the wanted text but simply a bunch of rubbish.

Apr 2 '06 #32

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

Not really. Your code shows you using different encodings - it doesn't
show any restrictions, as far as I can see. Yes, the output is more
verbose - but the ability to encode *any* Unicode character without
having to guess at which encoding might or might not work is worth
that, isn't it?
Jon, I believe I'm not getting you ... and probably you are not getting my
point either. As you might have seen, the number and even what bytes are
being printed out for the same exact unicode string (the one containing the
japanese text) are different.

Of course they are. There wouldn't be much use in having different
encodings if they all did the same thing, would there? The point of
different encodings is that they take the same text data and represent
it in different binary formats. Think of it in the same kind of way as
image formats - several formats could all take the same picture and
save it in different ways.
One of the ways that RTF requires you to
encode such double-byte char texts (texts for example in chinese, japanese,
korean and vietnamese) is to precede each byte's ascii code with a \'.
There *isn't* an ASCII code for Chinese, Japanese etc characters.
That's why you can't use the ASCII encoding.
However, I believe (but I admit it, this is only my belief) the correct
encoding must first be selected before getting the ascii code for the bytes
in the stream because otherwise I might end up generating RTF code that does
not display the wanted text but simply a bunch of rubbish.

But you said yourself that you can set the codepage for the whole
document. So set it to UTF-16 and use that throughout. Not sure where
the character set to use for the font comes into it, admittedly - you'd
have to read the specs for what that means.

On the other hand, looking at the specs briefly myself, the \uN keyword
seems to cover you fairly reasonably. I'd be tempted to stick to ASCII
and use \uN for every non-ASCII character, just to keep things simple.
That would mean the documents because pretty large, however.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #33

apprentice

>> One of the ways that RTF requires you to

encode such double-byte char texts (texts for example in chinese,
japanese,
korean and vietnamese) is to precede each byte's ascii code with a \'.
There *isn't* an ASCII code for Chinese, Japanese etc characters.
That's why you can't use the ASCII encoding.

You are right. I used the word ascii, but I meant the hex value for a byte.

However, I believe (but I admit it, this is only my belief) the correct
encoding must first be selected before getting the ascii code for the
bytes
in the stream because otherwise I might end up generating RTF code that
does
not display the wanted text but simply a bunch of rubbish.

But you said yourself that you can set the codepage for the whole
document. So set it to UTF-16 and use that throughout. Not sure where
the character set to use for the font comes into it, admittedly - you'd
have to read the specs for what that means.

Yes, I get your point now. Don't know if it would work ... but I'll try it
an submit it to someone who may test the code.

On the other hand, looking at the specs briefly myself, the \uN keyword
seems to cover you fairly reasonably. I'd be tempted to stick to ASCII
and use \uN for every non-ASCII character, just to keep things simple.
That would mean the documents because pretty large, however.

Yes, but either your idea above works, or the .NET framework will probably
break up the text in a string into chars that are simply junk for the
specific language of the text. Hope you now understand why I was trying to
find a way to help the framework break up the text into chars that are
meaningful for the specific language.

Apr 2 '06 #34

Jon Skeet [C# MVP]

apprentice <ye********************@hotmail.nospam.com> wrote:

There *isn't* an ASCII code for Chinese, Japanese etc characters.
That's why you can't use the ASCII encoding.

You are right. I used the word ascii, but I meant the hex value for a byte.

Right - although "hex value" is unnecessary too, it's really just the
bytes which are important. They're really bytes for the characters,
rather than codes for the bytes, if you see what I mean :)

But you said yourself that you can set the codepage for the whole
document. So set it to UTF-16 and use that throughout. Not sure where
the character set to use for the font comes into it, admittedly - you'd
have to read the specs for what that means.

Yes, I get your point now. Don't know if it would work ... but I'll try it
an submit it to someone who may test the code.

Excellent.

On the other hand, looking at the specs briefly myself, the \uN keyword
seems to cover you fairly reasonably. I'd be tempted to stick to ASCII
and use \uN for every non-ASCII character, just to keep things simple.
That would mean the documents because pretty large, however.

Yes, but either your idea above works, or the .NET framework will probably
break up the text in a string into chars that are simply junk for the
specific language of the text. Hope you now understand why I was trying to
find a way to help the framework break up the text into chars that are
meaningful for the specific language.

But the point is that because .NET uses UTF-16, and UTF-16 can encode
*all* Unicode characters, you shouldn't have a problem. The characters
*can't* be junk for the language of the text, unless the text was
extracted badly to start with - because the characters *are* the text
as far as .NET is concerned.

It sounds like we're definitely making progress though :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #35

Mihai N.

> Using how many bytes is each japanese char encoded??? From my

understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm? True.
In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right? True again :-)
My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese.

Well, you have to know. You need to carry that info together with the string.
Where are the strings comming from? Is that info available at some point?

I still don't undertstant exactly what you need.
Back to my question "is there any good reason not to use a standard RTF
control?"
At what level do you want to work?
Have code producing an rtf "from scratch", no rtf control involved?
I find this a bit tough and probably not worth the effort.
Why not use the standard RTF control? Then you do not need to care
about the internal representation (but you still have to care about the right
fonts).

The same unicode code point looks differently in Japanese/Traditional
Chinese/Simplified Chinese, and you need the proper font for the proper
language.

The font gives a hint to the RTF control for what encoding to use.
See my example:
{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea

This reads: font number 1 using charset 128 is "MS PGothic"
Then \f1 tells that the text following used font 1.
Charset 128 is SHIFTJIS_CHARSET (WinGDI.h), which means Japanese,
which means 932 used for the bytes.
On the other side, "the bytes" part is only used by old RTF controls.
For new controls you can even use this:
\u26085\'3f\'3f\u26412\'3f\'3f\u-30050\'3f\'3f
(\'3f = question mark)
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 3 '06 #36

apprentice

My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese. Well, you have to know. You need to carry that info together with the
string.
Where are the strings comming from? Is that info available at some point?

I still don't undertstant exactly what you need.

I'm writing an RTF library.
Back to my question "is there any good reason not to use a standard RTF
control?"
Yes, it seems it does not work.
On my system (set to use ansi codepage 1252) the code that makes use of the
RichTextBox control does not work.
Take a look at the project at the following location:

http://backslashzero.united.net.kg/JapaneseRTF.zip

On my system, after setting the Text property to the japanese string, I get
the RTF output (Rtf property on the RichTextBox control) and this is
empty!!!
Can anyone explain it???
At what level do you want to work?
Have code producing an rtf "from scratch", no rtf control involved?
I find this a bit tough and probably not worth the effort.
Why not use the standard RTF control? Then you do not need to care
about the internal representation (but you still have to care about the
right
fonts).

The same unicode code point looks differently in Japanese/Traditional
Chinese/Simplified Chinese, and you need the proper font for the proper
language.

Yes, I know.
The font gives a hint to the RTF control for what encoding to use.
See my example:
{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea

This reads: font number 1 using charset 128 is "MS PGothic"
Then \f1 tells that the text following used font 1.
Charset 128 is SHIFTJIS_CHARSET (WinGDI.h), which means Japanese,
which means 932 used for the bytes.
Ok.

On the other side, "the bytes" part is only used by old RTF controls.
For new controls you can even use this:
\u26085\'3f\'3f\u26412\'3f\'3f\u-30050\'3f\'3f
(\'3f = question mark)

I believe using the bytes is simpler. In the code you sent me you do the
following on the unicode value of a char in the provided text string:

if (unicodeValue >= 0x8000)
unicodeValue -= 0x10000;

This means that there are certain chars when I just cannot print out the
unicode value for the char BUT I somehow need to do a transformation that is
dependant on the text language (japanese, chinese, etc.). Having to write
language specific code is something I want to avoid. Printing out the bytes
should not require any of these transformations so I'll stick to them.
Bob

Apr 5 '06 #37

Mihai N.

>> Back to my question "is there any good reason not to use a standard RTF

control?"
Yes, it seems it does not work.
On my system (set to use ansi codepage 1252) the code that makes use of the
RichTextBox control does not work.

As I explained by email, you probably don't have Japanese support installed.

Take a look at the project at the following location:

http://backslashzero.united.net.kg/JapaneseRTF.zip

On my system, after setting the Text property to the japanese string, I get
the RTF output (Rtf property on the RichTextBox control) and this is
empty!!!
Can anyone explain it??? Working on my system.
And I would appreciate if once you ask me something by email, you keep it
there. And not publicly posting code I give you without checking with me
and withouy the proper credits.

I believe using the bytes is simpler. In the code you sent me you do the
following on the unicode value of a char in the provided text string:

if (unicodeValue >= 0x8000)
unicodeValue -= 0x10000;

This means that there are certain chars when I just cannot print out the
unicode value for the char BUT I somehow need to do a transformation that
is
dependant on the text language (japanese, chinese, etc.). Having to write
language specific code is something I want to avoid. Printing out the bytes
should not require any of these transformations so I'll stick to them.

Also as explained by email, this is a dumb way to cast to a signed short.
Taking code out of context an posting it publically (again, code I have
privately sent to you by email).
In general, it is my pleasure to help. But I think once a discution moves
to email, stays to email. And that my code sent by email should not me
made public without my permision and especially without giving credit.

If you are in a rush and I don't answer fast enough, then please
keep the thread on the newsgroups.
I have a day job and doing this on my time, usually late at nigh.
This means I don't answer questions (including email) during the day.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 6 '06 #38

apprentice

> And I would appreciate if once you ask me something by email, you keep it

there. And not publicly posting code I give you without checking with me
and withouy the proper credits.

Credits??? Are you alright??? In that code there are only 3 lines (and they
are indeed only 3) of YOUR code.

Apr 6 '06 #39

Mihai N.

>> And I would appreciate if once you ask me something by email, you keep it

there. And not publicly posting code I give you without checking with me
and withouy the proper credits.

Credits??? Are you alright??? In that code there are only 3 lines (and they
are indeed only 3) of YOUR code.

Yes, credits. One line or 5000, it does not matter.
When 3 lines of my code do the job right and replace 40 lines of your code
that do not, then yes, you should give credit.

Programming is about quality, not quantity.
Maybe as an "apprentice" you did not know that.

To close the chapter I will answer here the questions you ask by email.
I don't know why I am doing it, but here it is:

===================
What RTL issues??? RTL = Right To Left, used for scripts like Arabic or Hebreaw.
You think there are no issues? Search the specs (from the link I have sent
you) for rtf tags related to this.

You mean that I cannot encode
the text using iso-2022-jp and then print out the bytes preceeding the
byte value with \' ? Yes, this is what I am mean.

Could you please explain why? Read the RTF specs.

Yes, windows 2003. What you are saying leads however to exclude the use of
the RichTextBox control in my library: I may not count on a client system
to have asian symbols installed.

Then you cannot count on a client system to handle Asian languages.
It is unrealistic to ask for Japanese without Japanese support from the OS.
Unless you want to do everything from scratch and carry all the data with
you (like ICU).

===================
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Apr 7 '06 #40

apprentice

> Yes, credits. One line or 5000, it does not matter.

When 3 lines of my code do the job right and replace 40 lines of your code
that do not, then yes, you should give credit.

I see that you still do not understand. Your 3 lines of code really solve
nothing.
Your code does not even work. Production and consumption of a RTF document
may very well happen on different systems. I may not count on the fact that
the system that generates the document supports the given foreign language.
Programming is about quality, not quantity.
Maybe as an "apprentice" you did not know that.

I've got 22 years of sw development on my back. And yes, even after 22 years
I still consider myself an apprentice.

You mean that I cannot encode
the text using iso-2022-jp and then print out the bytes preceeding the
byte value with \' ?

Yes, this is what I am mean.

Could you please explain why?

Read the RTF specs.

I've read the specs. And I've not found anything that confirms what you are
stating.
To my knowledge there are only a few control words that play a role in this
matter: \ansicpgN (or \ansi, \mac, \pc, \pca) and those used in the font
table (\fcharset and \cpg). As long as the \ansicpgN control words specifies
the correct codepage (i.e. 932 for japanese) I don't see why it would not be
correct to use the encoding corresponding to codepage 932 (i.e. ISO-2022-JP)
to encode the bytes and then print them with the \' format.

Yes, windows 2003. What you are saying leads however to exclude the use
of
the RichTextBox control in my library: I may not count on a client system
to have asian symbols installed.

Then you cannot count on a client system to handle Asian languages.
It is unrealistic to ask for Japanese without Japanese support from the
OS.
Unless you want to do everything from scratch and carry all the data with
you (like ICU).

As I said above, I may have to generate an RTF doc on a system to be sent to
someone else who will open it on another system.
An example of the above would be an e-commerce platform that generates
purchase documents for customers around the world in their own native
language.
I see nothing so suprising in this.

Apr 7 '06 #41

Unicode, encodings, and asian languages: need some help.

Similar topics