I'm implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.
I'm having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:
When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428.
This list goes on (the left colulmn is the existing value, the right column
is the replacement value):
(values are in HEX)
10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map
(... and on for another few thousand lines...)
I've got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?
--
Chris Mullins 12 4031
Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.
Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.
"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl... I'm implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1
and B.2.
I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework:
When I see Unicode value 0x10400, I'm supposed to map it to value
0x10428. This list goes on (the left colulmn is the existing value, the right
column is the replacement value): (values are in HEX)
10400; 10428; Case map 10401; 10429; Case map 10402; 1042A; Case map 10403; 1042B; Case map 10404; 1042C; Case map 10405; 1042D; Case map 10406; 1042E; Case map 10407; 1042F; Case map 10408; 10430; Case map
(... and on for another few thousand lines...)
I've got the strings loaded into a StringBuilder, and am iterating through it one character at a time, and comparing the character value to the
mapping values. The problem is that a Character cannot have a value greater than 0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to convert everything to Bytes and do this the hard way?
-- Chris Mullins
Unfortunatly, 2 bytes per character - which is what much of the libraries in
..NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs). Because everything is based off
"Chars", I can't figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can't figure out how to properly
encode one...
If only there were a UTF.Encoder method that encoded a true Unicode Code
Point (any value from 0 to 10FFFF), rather than a a char() array. There's
got to be a simple way around this, but it's not evident to me...
I suppose I could manually encode my value into a series of UTF8 bytes, but
that sure seems ugly.
--
Chris
"Jason Smith" <ja***@nospam.com> wrote in message
news:OU**************@TK2MSFTNGP10.phx.gbl... Have you thought about using an array of "long" values? All the string libraries in .NET assume Unicode, which is 2 bytes per character.
Alternately, you might use a "struct" containing a "long" in place of a "long." That would just make it easier to group your character conversion routines.
"Chris Mullins" <cm******@yahoo.com> wrote in message news:ee***************@TK2MSFTNGP11.phx.gbl... I'm implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2.
I'm having trouble with the following mappings though, and it seems like
a shortcoming of the .NET framework:
When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column is the replacement value): (values are in HEX)
10400; 10428; Case map 10401; 10429; Case map 10402; 1042A; Case map 10403; 1042B; Case map 10404; 1042C; Case map 10405; 1042D; Case map 10406; 1042E; Case map 10407; 1042F; Case map 10408; 10430; Case map
(... and on for another few thousand lines...)
I've got the strings loaded into a StringBuilder, and am iterating
through it one character at a time, and comparing the character value to the mapping values. The problem is that a Character cannot have a value greater than 0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to convert everything to Bytes and do this the hard way?
-- Chris Mullins
Chris Mullins <cm******@yahoo.com> wrote: Unfortunatly, 2 bytes per character - which is what much of the libraries in .NET assume - is not sufficient. The .NET "char" value, is only good for the assymetric range -32768 to +65535 (this is sufficient for almost eveything... except for surrogate pairs).
Char is actually 0-65535. The range -32768 to 65535 couldn't be stored
in 16 bits.
Because everything is based off "Chars", I can't figure out how to get an arbitrary Unicode Code Point to properly encode into any of the encodings. The problem is one of Unicode surrogate pairs, which are supported, but I can't figure out how to properly encode one...
See my recent post - and http://uk.geocities.com/BabelStone13...urrogates.html
(amongst other pages - a google search for
Unicode "surrogate pairs"
finds a lot of pages.)
--
Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
"Jon Skeet" <sk***@pobox.com> wrote: Chris Mullins <cm******@yahoo.com> wrote: Because everything is based off "Chars", I can't figure out how to get an arbitrary Unicode Code Point
to properly encode into any of the encodings. The problem is one of Unicode surrogate pairs, which are supported, but I can't figure out how to
properly encode one...
See my recent post - and http://uk.geocities.com/BabelStone13...urrogates.html
I've read and read on surrogate pairs, and I understand what they are at
this point. My problem is how to encode a surrogate pair from an arbitrary
Unicode point. There doesn't seem to be any support in the .NET framework
for doing this.
I suppose I can manually encode the value I'm looking for using UTF8 or
UTF16 encoding, but that seems like the wrong approach.
..NET Encoders have to convert char arrays into a particular byte encoding,
and to turn a byte encoding into a character array. The problem is that I
don't see any mechanism for encoding a value that won't fit in a char array.
How do I, using the .NET Framework, get U-10FF8 into a UTF-8 encoded string?
This is driving me batty.
There is a fantastic mechanism in the framework for iterating over a string
and pull out all the graphemes, but I can't find the encoding side of this
equation....
--
Chris Mullins
"Chris Mullins" <cm******@yahoo.com> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl... "Jon Skeet" <sk***@pobox.com> wrote: Chris Mullins <cm******@yahoo.com> wrote: Because everything is based off "Chars", I can't figure out how to get an arbitrary Unicode Code Point to properly encode into any of the encodings. The problem is one of
Unicode surrogate pairs, which are supported, but I can't figure out how to properly encode one... See my recent post - and http://uk.geocities.com/BabelStone13...urrogates.html
I've read and read on surrogate pairs, and I understand what they are at this point. My problem is how to encode a surrogate pair from an arbitrary Unicode point. There doesn't seem to be any support in the .NET framework for doing this.
I suppose I can manually encode the value I'm looking for using UTF8 or UTF16 encoding, but that seems like the wrong approach.
.NET Encoders have to convert char arrays into a particular byte encoding, and to turn a byte encoding into a character array. The problem is that I don't see any mechanism for encoding a value that won't fit in a char
array. How do I, using the .NET Framework, get U-10FF8 into a UTF-8 encoded
string? This is driving me batty.
There is a fantastic mechanism in the framework for iterating over a
string and pull out all the graphemes, but I can't find the encoding side of this equation....
Since the "char" type is 16 bits, and since strings consist of "char"s, I
don't think you're going to be doing _anything_ with strings and chars and
Unicode code points > 0xffff.
--
John Saunders
Internet Engineer jo***********@surfcontrol.com
"John Saunders" <jo***********@surfcontrol.com> wrote:
[Unicode Surrogate Pairs] Since the "char" type is 16 bits, and since strings consist of "char"s, I don't think you're going to be doing _anything_ with strings and chars and Unicode code points > 0xffff.
That was originally my though as well, but further reading has proved both
of us wrong.
..NET strings actually have full support for Unicode Surrogate pairs built
into them. I can iterate over the graphemes in the string (rather than all
the characters in the string), with no trouble at all. This functionality is
provided by the StringInfo class (and related family of classes).
It's just the encoding side that I still haven't figured out...
--
Chris Mullins
"Chris Mullins" <cm******@yahoo.com> wrote in message
news:uo***************@TK2MSFTNGP11.phx.gbl... "John Saunders" <jo***********@surfcontrol.com> wrote:
[Unicode Surrogate Pairs]
Since the "char" type is 16 bits, and since strings consist of "char"s,
I don't think you're going to be doing _anything_ with strings and chars
and Unicode code points > 0xffff. That was originally my though as well, but further reading has proved both of us wrong.
.NET strings actually have full support for Unicode Surrogate pairs built into them. I can iterate over the graphemes in the string (rather than all the characters in the string), with no trouble at all. This functionality
is provided by the StringInfo class (and related family of classes).
Ok, but it's 16-bit surrogate pairs in the string, not 32-bit characters,
right?
--
John Saunders
Internet Engineer jo***********@surfcontrol.com
"John Saunders" <jo***********@surfcontrol.com> wrote: "Chris Mullins" <cm******@yahoo.com> wrote in message news:uo***************@TK2MSFTNGP11.phx.gbl... "John Saunders" <jo***********@surfcontrol.com> wrote:
[Unicode Surrogate Pairs]
.NET strings actually have full support for Unicode Surrogate pairs
built into them. I can iterate over the graphemes in the string (rather than
all the characters in the string), with no trouble at all. This
functionality is provided by the StringInfo class (and related family of classes).
Ok, but it's 16-bit surrogate pairs in the string, not 32-bit characters, right?
True.
But I need to figure out how to encode a 32 bit character (0x10FFA) into a
UTF-8 encoded string. This is a legit thing to do, I just don't know how to
do it....
--
Chris Mullins
Chris Mullins <cm******@yahoo.com> wrote: True.
But I need to figure out how to encode a 32 bit character (0x10FFA) into a UTF-8 encoded string. This is a legit thing to do, I just don't know how to do it....
The UTF-8 encoding of the 32-bit character should be treated the same
as the UTF-8 encoding of the equivalent surrogate pair. In other words,
you shouldn't need to worry too much.
--
Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Hi Chris,
I think you may need to do it yourself.
Here is a link, you may have a look.
Regards,
Peter Huang
Microsoft Online Partner Support
Get Secure! www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.
-------------------- From: "Chris Mullins" <cm******@yahoo.com> References: <ee*************@TK2MSFTNGP11.phx.gbl>
<OU**************@TK2MSFTNGP10.phx.gbl>
<OZ**************@tk2msftngp13.phx.gbl>
<MP***********************@news.microsoft.com>
<#J**************@TK2MSFTNGP10.phx.gbl>
<OS**************@TK2MSFTNGP11.phx.gbl>
<uo*************@TK2MSFTNGP11.phx.gbl>
<#4**************@TK2MSFTNGP10.phx.gbl>Subject: Re: UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization
of Strings (Framework oversite?)Date: Sun, 21 Sep 2003 22:16:14 -0700 Lines: 29 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1158 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Message-ID: <u$*************@TK2MSFTNGP10.phx.gbl> Newsgroups:
microsoft.public.dotnet.framework,microsoft.public .dotnet.generalNNTP-Posting-Host: dcn242-16.dcn.davis.ca.us 168.150.242.16 Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTN GP10.phx.gbl Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.general:109350
microsoft.public.dotnet.framework:54354X-Tomcat-NG: microsoft.public.dotnet.general
"John Saunders" <jo***********@surfcontrol.com> wrote: "Chris Mullins" <cm******@yahoo.com> wrote in message news:uo***************@TK2MSFTNGP11.phx.gbl... > "John Saunders" <jo***********@surfcontrol.com> wrote: > > [Unicode Surrogate Pairs] > > .NET strings actually have full support for Unicode Surrogate pairsbuilt > into them. I can iterate over the graphemes in the string (rather thanall > the characters in the string), with no trouble at all. This functionality is > provided by the StringInfo class (and related family of classes).
Ok, but it's 16-bit surrogate pairs in the string, not 32-bit characters, right?
True.
But I need to figure out how to encode a 32 bit character (0x10FFA) into a UTF-8 encoded string. This is a legit thing to do, I just don't know how to do it....
-- Chris Mullins
"Peter Huang [MSFT]" <v-******@online.microsoft.com> wrote:
[Encoding a Unicode surrogate pair into a UTF8 string] Hi Chris,
I think you may need to do it yourself. Here is a link, you may have a look.
How about that link?
--
Chris Mullins
Hi Chris,
I am sorry for missing the link. http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
It is about the Encoding from UCS-4 to UTF-8. You may have a check.
Regards,
Peter Huang
Microsoft Online Partner Support
Get Secure! www.microsoft.com/security
This posting is provided "as is" with no warranties and confers no rights.
-------------------- From: "Chris Mullins" <cm******@yahoo.com> References: <ee*************@TK2MSFTNGP11.phx.gbl>
<OU**************@TK2MSFTNGP10.phx.gbl>
<OZ**************@tk2msftngp13.phx.gbl>
<MP***********************@news.microsoft.com>
<#J**************@TK2MSFTNGP10.phx.gbl>
<OS**************@TK2MSFTNGP11.phx.gbl>
<uo*************@TK2MSFTNGP11.phx.gbl>
<#4**************@TK2MSFTNGP10.phx.gbl>
<u$*************@TK2MSFTNGP10.phx.gbl>
<X4**************@cpmsftngxa06.phx.gbl>Subject: Re: UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization
of Strings (Framework oversite?)Date: Mon, 22 Sep 2003 09:59:48 -0700 Lines: 15 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1158 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Message-ID: <eh**************@TK2MSFTNGP10.phx.gbl> Newsgroups: microsoft.public.dotnet.general NNTP-Posting-Host: dcn242-16.dcn.davis.ca.us 168.150.242.16 Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTN GP10.phx.gbl Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.general:109460 X-Tomcat-NG: microsoft.public.dotnet.general
"Peter Huang [MSFT]" <v-******@online.microsoft.com> wrote:
[Encoding a Unicode surrogate pair into a UTF8 string]
Hi Chris,
I think you may need to do it yourself. Here is a link, you may have a look.
How about that link?
-- Chris Mullins This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: pekka niiranen |
last post by:
Hi there,
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.
Contents of "my.utf8" in HEX:
EFBBBF6161
Contents of "my.utf16" in HEX:
FEFF6161
|
by: Spamtrap |
last post by:
I only work in Perl occasionaly, and have been searching for a
solution for a conversion, and everything I found seems much too
complex.
All I need to do is take a simple text file and copy...
|
by: Chris Mullins |
last post by:
I've spent a bit of time over the last year trying to implement RFC 3454
(Preparation of Internationalized Strings, aka 'StringPrep').
This RFC is also a dependency for RFC 3491...
|
by: Chris Mullins |
last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.
I'm having trouble with the following...
|
by: Chameleon |
last post by:
I am trying to #define this:
#ifdef UNICODE_STRINGS
#define UC16 L
typedef wstring String;
#else
#define UC16
typedef string String;
#endif
....
|
by: Jason |
last post by:
Hi,
I was wondering if anyone could advise me on this.
Right now I am setting up a DB2 UDB V8.2.3 database with UTF8
character set, which will work with a J2EE application running on...
|
by: Server Applications |
last post by:
Hello
I am trying to build a system where I can full-text index documents with
UTF8 or UTF16 data using Oracle Text. I am doing the filtering in a
third-party component outside the database, so...
|
by: Samuel |
last post by:
Hi
I am trying to write to a string text encoded to utf8 as oppose to utf16
Since the data comes from an XML object (and I serialize it) I need to pass
either StreamWriter or a StringWriter...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
| |