473,739 Members | 2,385 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Creating a Unicode Surrogate Pair

I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

--
Chris Mullins
Jul 21 '05 #1
3 5276
Chris Mullins <cm******@yahoo .com> wrote:
I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...


Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result% 0x400)

Of course, whatever's reading the string will need to know what to do
with the surrogate. I've managed to avoid using them so far,
fortunately...

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #2
"Jon Skeet" <sk***@pobox.co m> wrote:
Chris Mullins <cm******@yahoo .com> wrote:
I've got a big unicode character, and i'm trying to build it into a string.
The unicode character is in the range "0x10400", so it's going to require a surrogate pair.


Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result% 0x400)


After reading more, it looks like your suggestion is the best option for
..NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

--
Chris Mullins
Jul 21 '05 #3
Chris Mullins <cm******@yahoo .com> wrote:
After reading more, it looks like your suggestion is the best option for
.NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?


You just append them to your string - anything which can cope with
surrogates should then recognise them appropriately.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1596
by: Mike Brown | last post by:
In mid-October 2004, Jeff Epler helped me here with this string iterator: def chars(s): """ This generator function helps iterate over the characters in a string. When the string is unicode and a surrogate pair is encountered, the pair is returned together, regardless of whether Python was built with UCS-4 ('wide') or UCS-2 code values for its internal representation of unicode. This function will raise a ValueError if it detects an...
12
4107
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column
5
8844
by: Johannes | last post by:
Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C# I have a hard time believing this since it would eliminate some Asian languages. If it is true, is there a workaround? Do other .NET languages support code points > 0x10FFFF I appreciate any comments Thanks Johannes
3
211
by: Chris Mullins | last post by:
I've got a big unicode character, and i'm trying to build it into a string. The unicode character is in the range "0x10400", so it's going to require a surrogate pair. I've been through all the logic to iterate over strings that already have these pairs in them, but how do I encode this Unicode Character INTO the string? The string is UTF-8 encoded, but none of the things I've trided using the encoders seems to work right...
3
2008
by: Sakcee | last post by:
Hi In one of the data files that I have , I am seeing these characters \xed\xa0\xa0 . They seem to break the xsl. --------------------------------------------------------------- Extra content at the end of the document XML/XSL Error: </data><data ><![CDATA[ Ū Pls advice ----------------------------------------------------------------
40
3241
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages, including asian languages (chinese, japanese, korean, etc). First of all, strings will be passed to my class methods, some of which based on the language (and on the encoding) might contain characters that require more that a single byte.
6
13892
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2 character in streamreader and streamwriter. How unicode and utf chacters are stored.
18
620
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
17
4528
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own functions do this on occasion. This leads to different behaviour across platforms and makes it unnecessarily difficult to properly support all languages. To solve this I propose Python's unicode type using UTF-16 should have gaps in its index,...
0
8969
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, weíll explore What is ONU, What Is Router, ONU & Routerís main usage, and What is the difference between ONU and Router. Letís take a closer look ! Part I. Meaning of...
0
9483
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9341
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9211
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6056
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4826
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3282
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2748
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2195
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.