473,378 Members | 1,622 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

unicode surrogates in py2.2/win

In mid-October 2004, Jeff Epler helped me here with this string iterator:

def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i
I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.

A simpler test case demonstrates the symptom:

C:\dev\test>echo x = u'\ud800' > testd800.py

C:\dev\test>cat testd800.py
x = u'\ud800'

C:\dev\test>python -c "import testd800"

C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

C:\dev\test>python testd800.py

C:\dev\test>python testd800.py

Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.

In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.

Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)

-Mike
Jul 18 '05 #1
1 1574
Mike Brown wrote:
Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.
The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.
Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed?
In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.
I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)


I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
....
if surr1 <= i < surr2:
...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Nancy | last post by:
I recently completed a web page, "Browser Tests of Entities in 2004". http://www.santagata.us/characters/CharacterEntities.html It shows those characters that work in all of the version 5.2+...
12
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...
11
by: Patrick Van Esch | last post by:
Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two...
32
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....
5
by: Johannes | last post by:
Is it correct that Unicode characters with code points above 0x10FFFF are not supported by C# I have a hard time believing this since it would eliminate some Asian languages. If it is true, is...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
18
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
17
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.