473,466 Members | 1,405 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Creating a Unicode Surrogate Pair

I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

--
Chris Mullins
Jul 21 '05 #1
3 5240
Chris Mullins <cm******@yahoo.com> wrote:
I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...


Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

Of course, whatever's reading the string will need to know what to do
with the surrogate. I've managed to avoid using them so far,
fortunately...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #2
"Jon Skeet" <sk***@pobox.com> wrote:
Chris Mullins <cm******@yahoo.com> wrote:
I've got a big unicode character, and i'm trying to build it into a string.
The unicode character is in the range "0x10400", so it's going to require a surrogate pair.


Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)


After reading more, it looks like your suggestion is the best option for
..NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

--
Chris Mullins
Jul 21 '05 #3
Chris Mullins <cm******@yahoo.com> wrote:
After reading more, it looks like your suggestion is the best option for
.NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?


You just append them to your string - anything which can cope with
surrogates should then recognise them appropriately.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Mike Brown | last post by:
In mid-October 2004, Jeff Epler helped me here with this string iterator: def chars(s): """ This generator function helps iterate over the characters in a string. When the string is unicode and...
12
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...
5
by: Johannes | last post by:
Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C# I have a hard time believing this since it would eliminate some Asian languages. If it is true, is...
3
by: Chris Mullins | last post by:
I've got a big unicode character, and i'm trying to build it into a string. The unicode character is in the range "0x10400", so it's going to require a surrogate pair. I've been through all...
3
by: Sakcee | last post by:
Hi In one of the data files that I have , I am seeing these characters \xed\xa0\xa0 . They seem to break the xsl. --------------------------------------------------------------- Extra...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
18
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
17
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.