473,725 Members | 2,428 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationaliz ation of Strings (Framework oversite?)

I'm implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.

I'm having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428.
This list goes on (the left colulmn is the existing value, the right column
is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I've got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

Jul 21 '05 #1
12 4103
Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo .com> wrote in message
news:ee******** *******@TK2MSFT NGP11.phx.gbl.. .
I'm implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2.

I'm having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I've got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

Jul 21 '05 #2
Unfortunatly, 2 bytes per character - which is what much of the libraries in
..NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs). Because everything is based off
"Chars", I can't figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can't figure out how to properly
encode one...

If only there were a UTF.Encoder method that encoded a true Unicode Code
Point (any value from 0 to 10FFFF), rather than a a char() array. There's
got to be a simple way around this, but it's not evident to me...

I suppose I could manually encode my value into a series of UTF8 bytes, but
that sure seems ugly.

--
Chris

"Jason Smith" <ja***@nospam.c om> wrote in message
news:OU******** ******@TK2MSFTN GP10.phx.gbl...
Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo .com> wrote in message
news:ee******** *******@TK2MSFT NGP11.phx.gbl.. .
I'm implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1

and
B.2.

I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework:

When I see Unicode value 0x10400, I'm supposed to map it to value

0x10428.
This list goes on (the left colulmn is the existing value, the right

column
is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I've got the strings loaded into a StringBuilder, and am iterating through it one character at a time, and comparing the character value to the

mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins


Jul 21 '05 #3
Chris Mullins <cm******@yahoo .com> wrote:
Unfortunatly, 2 bytes per character - which is what much of the libraries in
.NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs).
Char is actually 0-65535. The range -32768 to 65535 couldn't be stored
in 16 bits.
Because everything is based off
"Chars", I can't figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can't figure out how to properly
encode one...


See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html
(amongst other pages - a google search for
Unicode "surrogate pairs"
finds a lot of pages.)

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #4
"Jon Skeet" <sk***@pobox.co m> wrote:
Chris Mullins <cm******@yahoo .com> wrote:
Because everything is based off
"Chars", I can't figure out how to get an arbitrary Unicode Code Point to properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can't figure out how to properly encode one...


See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html


I've read and read on surrogate pairs, and I understand what they are at
this point. My problem is how to encode a surrogate pair from an arbitrary
Unicode point. There doesn't seem to be any support in the .NET framework
for doing this.

I suppose I can manually encode the value I'm looking for using UTF8 or
UTF16 encoding, but that seems like the wrong approach.

..NET Encoders have to convert char arrays into a particular byte encoding,
and to turn a byte encoding into a character array. The problem is that I
don't see any mechanism for encoding a value that won't fit in a char array.

How do I, using the .NET Framework, get U-10FF8 into a UTF-8 encoded string?
This is driving me batty.

There is a fantastic mechanism in the framework for iterating over a string
and pull out all the graphemes, but I can't find the encoding side of this
equation....

--
Chris Mullins
Jul 21 '05 #5
"Chris Mullins" <cm******@yahoo .com> wrote in message
news:%2******** ********@TK2MSF TNGP10.phx.gbl. ..
"Jon Skeet" <sk***@pobox.co m> wrote:
Chris Mullins <cm******@yahoo .com> wrote:
Because everything is based off
"Chars", I can't figure out how to get an arbitrary Unicode Code Point to properly encode into any of the encodings. The problem is one of Unicode surrogate pairs, which are supported, but I can't figure out how to properly encode one...
See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html


I've read and read on surrogate pairs, and I understand what they are at
this point. My problem is how to encode a surrogate pair from an arbitrary
Unicode point. There doesn't seem to be any support in the .NET framework
for doing this.

I suppose I can manually encode the value I'm looking for using UTF8 or
UTF16 encoding, but that seems like the wrong approach.

.NET Encoders have to convert char arrays into a particular byte encoding,
and to turn a byte encoding into a character array. The problem is that I
don't see any mechanism for encoding a value that won't fit in a char

array.
How do I, using the .NET Framework, get U-10FF8 into a UTF-8 encoded string? This is driving me batty.

There is a fantastic mechanism in the framework for iterating over a string and pull out all the graphemes, but I can't find the encoding side of this
equation....


Since the "char" type is 16 bits, and since strings consist of "char"s, I
don't think you're going to be doing _anything_ with strings and chars and
Unicode code points > 0xffff.
--
John Saunders
Internet Engineer
jo***********@s urfcontrol.com
Jul 21 '05 #6
"John Saunders" <jo***********@ surfcontrol.com > wrote:

[Unicode Surrogate Pairs]
Since the "char" type is 16 bits, and since strings consist of "char"s, I
don't think you're going to be doing _anything_ with strings and chars and
Unicode code points > 0xffff.


That was originally my though as well, but further reading has proved both
of us wrong.

..NET strings actually have full support for Unicode Surrogate pairs built
into them. I can iterate over the graphemes in the string (rather than all
the characters in the string), with no trouble at all. This functionality is
provided by the StringInfo class (and related family of classes).

It's just the encoding side that I still haven't figured out...

--
Chris Mullins
Jul 21 '05 #7
"Chris Mullins" <cm******@yahoo .com> wrote in message
news:uo******** *******@TK2MSFT NGP11.phx.gbl.. .
"John Saunders" <jo***********@ surfcontrol.com > wrote:

[Unicode Surrogate Pairs]
Since the "char" type is 16 bits, and since strings consist of "char"s, I don't think you're going to be doing _anything_ with strings and chars and Unicode code points > 0xffff.
That was originally my though as well, but further reading has proved both
of us wrong.

.NET strings actually have full support for Unicode Surrogate pairs built
into them. I can iterate over the graphemes in the string (rather than all
the characters in the string), with no trouble at all. This functionality

is provided by the StringInfo class (and related family of classes).


Ok, but it's 16-bit surrogate pairs in the string, not 32-bit characters,
right?
--
John Saunders
Internet Engineer
jo***********@s urfcontrol.com
Jul 21 '05 #8
"John Saunders" <jo***********@ surfcontrol.com > wrote:
"Chris Mullins" <cm******@yahoo .com> wrote in message
news:uo******** *******@TK2MSFT NGP11.phx.gbl.. .
"John Saunders" <jo***********@ surfcontrol.com > wrote:

[Unicode Surrogate Pairs]

.NET strings actually have full support for Unicode Surrogate pairs built into them. I can iterate over the graphemes in the string (rather than all the characters in the string), with no trouble at all. This
functionality is
provided by the StringInfo class (and related family of classes).


Ok, but it's 16-bit surrogate pairs in the string, not 32-bit characters,
right?


True.

But I need to figure out how to encode a 32 bit character (0x10FFA) into a
UTF-8 encoded string. This is a legit thing to do, I just don't know how to
do it....

--
Chris Mullins
Jul 21 '05 #9
Chris Mullins <cm******@yahoo .com> wrote:
True.

But I need to figure out how to encode a 32 bit character (0x10FFA) into a
UTF-8 encoded string. This is a legit thing to do, I just don't know how to
do it....


The UTF-8 encoding of the 32-bit character should be treated the same
as the UTF-8 encoding of the equivalent surrogate pair. In other words,
you shouldn't need to worry too much.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
2587
by: pekka niiranen | last post by:
Hi there, I have two files "my.utf8" and "my.utf16" which both contain BOM and two "a" characters. Contents of "my.utf8" in HEX: EFBBBF6161 Contents of "my.utf16" in HEX: FEFF6161
6
18325
by: Spamtrap | last post by:
I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy it, however some specific lines are in fact in UTF8 as printed garbagy characters and they need to be converted to Unicode, so that the new text file can be imported into a desktop program and into some Word documents. For the moment I would be...
2
3275
by: Chris Mullins | last post by:
I've spent a bit of time over the last year trying to implement RFC 3454 (Preparation of Internationalized Strings, aka 'StringPrep'). This RFC is also a dependency for RFC 3491 (Internationalized Domain Names / IDNA) which is something that I also need to support. The problem that I've been struggling with in .NET is that of Unicode Code Points > 0xFFFF. These points are encoded into UTF8 using the Surrogate Pair encoding scheme that...
12
301
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column
18
620
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
2
10304
by: Jason | last post by:
Hi, I was wondering if anyone could advise me on this. Right now I am setting up a DB2 UDB V8.2.3 database with UTF8 character set, which will work with a J2EE application running on WebSphere Application Server. I have two questions: 1. How many characters, such as Chinese, Japanese, can a CHAR(128) or
1
5357
by: Server Applications | last post by:
Hello I am trying to build a system where I can full-text index documents with UTF8 or UTF16 data using Oracle Text. I am doing the filtering in a third-party component outside the database, so the I dont need filtering in Oracle, but only indexing. If I put file references to the filtered files in the database and index these (using FILE_DATASTORE), everything works fine. But I rather put the filtered data in the database, and index it...
4
2518
by: Samuel | last post by:
Hi I am trying to write to a string text encoded to utf8 as oppose to utf16 Since the data comes from an XML object (and I serialize it) I need to pass either StreamWriter or a StringWriter object, I don't want to create a file so I want to use a StringWriter (passing to it's constructor a StringBuilder) The problem is that the StringWriter encodes utf16 (I don't know how to change it)
0
8748
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9164
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8072
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6695
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6000
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4506
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4775
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2622
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2151
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.