unicode surrogates in py2.2/win

Mike Brown

In mid-October 2004, Jeff Epler helped me here with this string iterator:

def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i
I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.

A simpler test case demonstrates the symptom:

C:\dev\test>echo x = u'\ud800' > testd800.py

C:\dev\test>cat testd800.py
x = u'\ud800'

C:\dev\test>python -c "import testd800"

C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

C:\dev\test>python testd800.py

C:\dev\test>python testd800.py

Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.

In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.

Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)

-Mike

Jul 18 '05 #1

Subscribe Post Reply

1574

Martin v. LÃ¶wis

Mike Brown wrote:

Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.
The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.
Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed?
In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.
I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)

I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
....
if surr1 <= i < surr2:
...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin

Jul 18 '05 #2

Similar topics

Unicode browser support charts

by: Nancy | last post by:

I recently completed a web page, "Browser Tests of Entities in 2004". http://www.santagata.us/characters/CharacterEntities.html It shows those characters that work in all of the version 5.2+...

HTML / CSS

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

by: Chris Mullins | last post by:

I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...

.NET Framework

Can an HTML source file be specified in unicode ?

by: Patrick Van Esch | last post by:

Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two...

HTML / CSS

std::string vs. Unicode UTF-8

by: Wolfgang Draxinger | last post by:

I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....

C / C++

C# does not support Unicode?

by: Johannes | last post by:

Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C# I have a hard time believing this since it would eliminate some Asian languages. If it is true, is...

C# / C Sharp

Unicode, encodings, and asian languages: need some help.

by: apprentice | last post by:

Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...

.NET Framework

Unicode and utf 8 /utf 16

by: archana | last post by:

Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...

C# / C Sharp

unicode

by: Chameleon | last post by:

I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....

C / C++

Python's handling of unicode surrogates

by: Adam Olsen | last post by:

As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware