Python's handling of unicode surrogates

Adam Olsen

As was seen in another thread[1], there's a great deal of confusion
with regard to surrogates. Most programmers assume Python's unicode
type exposes only complete characters. Even CPython's own functions
do this on occasion. This leads to different behaviour across
platforms and makes it unnecessarily difficult to properly support all
languages.

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

Note that this would not harm performance, nor would it affects
programs that already handle UTF-16 and UTF-32 correctly.

To show how things currently differ across platforms, here's an
example using UTF-32:

>>a, b = u'\U00100000', u'\uFFFF'
a, b

(u'\U00100000', u'\uffff')

>>list(a), list(b)

([u'\U00100000'], [u'\uffff'])

>>sorted([a, b])

[u'\uffff', u'\U00100000']

Now contrast the output of sorted() with what you get when using UTF-16:

>>a, b = u'\U00100000', u'\uFFFF'
a, b

(u'\U00100000', u'\uffff')

>>list(a), list(b)

([u'\udbc0', '\udc00'], [u'\uffff'])

>>sorted([a, b])

[u'\U00100000', u'\uffff']

As you can see, the order has be reversed, because the sort operates
on code units rather than scalar values.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
* "There is no separate character type; a character is represented by
a string of one item."
* iteration would be identical on all platforms
* sorting would be identical on all platforms
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
* Breaks code which does s[s.find(sub)], where sub is a single
surrogate, expecting half a surrogate (why?). Does NOT break code
where sub is always a single code unit, nor does it break code that
assumes a longer sub using s[s.find(sub):s.f ind(sub) + len(sub)]
* Alters the sort order on UTF-16 platforms (to match that of UTF-32
platforms, not to mention UTF-8 encoded byte strings)
* Current Python is fairly tolerant of ill-formed unicode data.
Changing this may break some code. However, if you do have a need to
twiddle low-level UTF encodings, wouldn't the bytes type be better?
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Thoughts, from all you readers out there? For/against? If there's
enough support I'll post the idea on python-3000.

[1] http://groups.google.com/group/comp....76e191831da6de
[2] Pages 23-24 of http://unicode.org/versions/Unicode4.0.0/ch03.pdf

--
Adam Olsen, aka Rhamphoryncus

Apr 19 '07 #1

Subscribe Reply

4548

Neil Hodgson

Adam Olsen:

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.

* "There is no separate character type; a character is represented by
a string of one item."

Could amend this to "a string of one or two items".

* iteration would be identical on all platforms

There could be a secondary iterator that iterates over characters
rather than code units.

* sorting would be identical on all platforms

This should be fixable in the current scheme.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.

The code will work happily for the implementor and then break when
exposed to a surrogate.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

Neil

Apr 20 '07 #2

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.

Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?

Apr 20 '07 #3

Rhamphoryncus

On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+th un...@gmail.com >
wrote:

Adam Olsen:

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.

"Errors should never pass silently."

The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.

Indeed. I was actually surprised that it didn't, originally I had it
listed with \U and repr().

* "There is no separate character type; a character is represented by
a string of one item."

Could amend this to "a string of one or two items".

* iteration would be identical on all platforms

There could be a secondary iterator that iterates over characters
rather than code units.

But since you should use that iterator 90%+ of the time, why not make
it the default?

* sorting would be identical on all platforms

This should be fixable in the current scheme.

True.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.

Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That's half my proposal.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.

The code will work happily for the implementor and then break when
exposed to a surrogate.

The code may well break already. I just make it explicit.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

Yet Unicode has deemed them worth including anyway. I see no reason
to make them more painful then they have to be.

A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -s[i:i
+len(sub)]. If they need iteration they'll have to reimplement it,
providing the exact behaviour I propose. Or they can recompile Python
to use UTF-32, but why shouldn't such features be available by
default?

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

Your solution would require code duplication and would be slower. My
solution would have no duplication and would not be slower. I like
mine. ;)

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

I dream of a day when complete unicode support is universal. With
enough effort we may get there some day. :)

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #4

Rhamphoryncus

(Sorry for the dupe, Martin. Gmail made it look like your reply was
in private.)

On 4/19/07, "Martin v. Löwis" <ma****@v.loewi s.dewrote:

Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.

I don't believe this specific variant has been discussed. The change
I propose would make indexes non-contiguous, making unicode
technically not a sequence. I say that's a case for "practicali ty
beats purity".

Of course I'd appreciate any clarification before I bring it to
python-3000.

Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?

s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

The length of the string will not be changed. s[s.find(sub):] will
not be changed, so long as sub is a well-formed unicode string.
Nothing that properly handles unicode surrogates will be changed.

len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u '\U00100000\uFF FF')) will fail explicitly (rather than
silently).

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #5

Paul Boddie

On 20 Apr, 07:02, Neil Hodgson <nyamatongwe+th un...@gmail.com wrote:

Adam Olsen:

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.

This thread and the other one have been quite educational, and I've
been looking through some of the background material on the topic. I
think the intention was, in PEP 261 [1] and the surrounding
discussion, that people should be able to treat Unicode objects as
sequences of characters, even though GvR's summary [2] in that
discussion defines a character as representing a code point, not a
logical character. In such a scheme, characters should be indexed
contiguously, and if people should want to access surrogate pairs,
there should be a method (or module function) to expose that
information on individual (logical) characters.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.

This would work with the "substring in string" and
"string.index(s ubstring)" pseudo-sequence API. However, once you've
got a character as a Unicode object, surely the nature of the encoded
character is only of peripheral interest. The Unicode API doesn't
return two or more values per character for those in the Basic
Multilingual Plane read from a UTF-8 source - that's inconsequential
detail at that particular point.

[...]

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

I think PEP 261 was mostly concerned with providing a "good enough"
solution until such a time as a better solution could be devised.

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

Do we have a volunteer? ;-)

Paul

[1] http://www.python.org/dev/peps/pep-0261/
[2] http://mail.python.org/pipermail/i18...ne/001107.html

Apr 20 '07 #6

Ross Ridge

Rhamphoryncus <rh****@gmail.c omwrote:

>The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice( u'\U00100000\uF FFF')) will fail explicitly (rather than
silently).

You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.u waterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //

Apr 20 '07 #7

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.

s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

[...]

>
len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

Regards,
Martin

Apr 21 '07 #8

Rhamphoryncus

On Apr 20, 5:49 pm, Ross Ridge <rri...@caffein e.csclub.uwater loo.ca>
wrote:

Rhamphoryncus <rha...@gmail.c omwrote:
The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u '\U00100000\uFF FF')) will fail explicitly (rather than
silently).

You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.

No, I'm only assuming code which would raise an error with my change
is broken. My change would have minimal effect because it's building
on the existing correct way to do things, expanding them to handle
some additional cases.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #9

Rhamphoryncus

On Apr 20, 6:21 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.

Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).

s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

[...]

len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????

If you pick an index at random you will get IndexError. If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change. You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues. Considering the
difficulty of the problem it seems like an okay trade-off to me.

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #10

Similar topics

2727

python-unicode doesn't support >65535 symbols?

by: gabor | last post by:

hi, today i made some tests... i tested some unicode symbols, that are above the 16bit limit (gothic:http://www.unicode.org/charts/PDF/U10330.pdf) .. i played around with iconv and so on, so at the end i created an utf8 encoded text file,

Python

2453

Prothon should not borrow Python strings!

by: Paul Prescod | last post by:

I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold any arbitrary data, even binary data such as photos and movies.They are of course also good for their traditional role of storing and manipulating text." This view of strings is about a decade out of date with modern programmimg practice. From...

Python

3710

author index for Python Cookbook 2?

by: Andrew Dalke | last post by:

Is there an author index for the new version of the Python cookbook? As a contributor I got my comp version delivered today and my ego wanted some gratification. I couldn't find my entries. Andrew dalke@dalkescientific.com

Python

49757

std::string vs. Unicode UTF-8

by: Wolfgang Draxinger | last post by:

I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size(). Instead one has to iterate through the string, parse all UTF-8 multibytes and count each multibyte as one character. To address this problem the GTKmm bindings for the GTK+ toolkit have implemented a own string class Glib::ustring...

C / C++

13900

Unicode and utf 8 /utf 16

by: archana | last post by:

Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2 character in streamreader and streamwriter. How unicode and utf chacters are stored.

C# / C Sharp

620

unicode

by: Chameleon | last post by:

I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....

C / C++

11565

Weekly Python Patch/Bug Summary

by: Kurt B. Kaiser | last post by:

Patch / Bug Summary ___________________ Patches : 342 open (-38) / 3712 closed (+54) / 4054 total (+16) Bugs : 951 open (-14) / 6588 closed (+33) / 7539 total (+19) RFE : 257 open (-15) / 266 closed (+13) / 523 total ( -2) New / Reopened Patches ______________________

Python

122

5604

"index" method only for mutable sequences??

by: C.L. | last post by:

I was looking for a function or method that would return the index to the first matching element in a list. Coming from a C++ STL background, I thought it might be called "find". My first stop was the Sequence Types page of the Library Reference (http://docs.python.org/lib/typesseq.html); it wasn't there. A search of the Library Reference's index seemed to confirm that the function did not exist. A little later I realized it might be called...

Python

10039

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

11354

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10923

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9732

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

8100

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5943

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4778

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

4344

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

3368

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General