By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,851 Members | 1,041 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,851 IT Pros & Developers. It's quick & easy.

Python's handling of unicode surrogates

P: n/a
As was seen in another thread[1], there's a great deal of confusion
with regard to surrogates. Most programmers assume Python's unicode
type exposes only complete characters. Even CPython's own functions
do this on occasion. This leads to different behaviour across
platforms and makes it unnecessarily difficult to properly support all
languages.

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

Note that this would not harm performance, nor would it affects
programs that already handle UTF-16 and UTF-32 correctly.

To show how things currently differ across platforms, here's an
example using UTF-32:
>>a, b = u'\U00100000', u'\uFFFF'
a, b
(u'\U00100000', u'\uffff')
>>list(a), list(b)
([u'\U00100000'], [u'\uffff'])
>>sorted([a, b])
[u'\uffff', u'\U00100000']

Now contrast the output of sorted() with what you get when using UTF-16:
>>a, b = u'\U00100000', u'\uFFFF'
a, b
(u'\U00100000', u'\uffff')
>>list(a), list(b)
([u'\udbc0', '\udc00'], [u'\uffff'])
>>sorted([a, b])
[u'\U00100000', u'\uffff']

As you can see, the order has be reversed, because the sort operates
on code units rather than scalar values.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
* "There is no separate character type; a character is represented by
a string of one item."
* iteration would be identical on all platforms
* sorting would be identical on all platforms
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
* Breaks code which does s[s.find(sub)], where sub is a single
surrogate, expecting half a surrogate (why?). Does NOT break code
where sub is always a single code unit, nor does it break code that
assumes a longer sub using s[s.find(sub):s.find(sub) + len(sub)]
* Alters the sort order on UTF-16 platforms (to match that of UTF-32
platforms, not to mention UTF-8 encoded byte strings)
* Current Python is fairly tolerant of ill-formed unicode data.
Changing this may break some code. However, if you do have a need to
twiddle low-level UTF encodings, wouldn't the bytes type be better?
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Thoughts, from all you readers out there? For/against? If there's
enough support I'll post the idea on python-3000.

[1] http://groups.google.com/group/comp....76e191831da6de
[2] Pages 23-24 of http://unicode.org/versions/Unicode4.0.0/ch03.pdf

--
Adam Olsen, aka Rhamphoryncus
Apr 19 '07 #1
Share this Question
Share on Google+
17 Replies


P: n/a
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).
I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
unichr could return a 2 code unit string without forcing surrogate
indivisibility.
* "There is no separate character type; a character is represented by
a string of one item."
Could amend this to "a string of one or two items".
* iteration would be identical on all platforms
There could be a secondary iterator that iterates over characters
rather than code units.
* sorting would be identical on all platforms
This should be fixable in the current scheme.
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].
It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.
Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
The code will work happily for the implementor and then break when
exposed to a surrogate.
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.
Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

Neil
Apr 20 '07 #2

P: n/a
Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.

Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?
Apr 20 '07 #3

P: n/a
On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+thun...@gmail.com>
wrote:
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
"Errors should never pass silently."

The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.
Indeed. I was actually surprised that it didn't, originally I had it
listed with \U and repr().

* "There is no separate character type; a character is represented by
a string of one item."

Could amend this to "a string of one or two items".
* iteration would be identical on all platforms

There could be a secondary iterator that iterates over characters
rather than code units.
But since you should use that iterator 90%+ of the time, why not make
it the default?

* sorting would be identical on all platforms

This should be fixable in the current scheme.
True.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.
Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That's half my proposal.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.

The code will work happily for the implementor and then break when
exposed to a surrogate.
The code may well break already. I just make it explicit.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.
Yet Unicode has deemed them worth including anyway. I see no reason
to make them more painful then they have to be.

A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -s[i:i
+len(sub)]. If they need iteration they'll have to reimplement it,
providing the exact behaviour I propose. Or they can recompile Python
to use UTF-32, but why shouldn't such features be available by
default?

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.
Your solution would require code duplication and would be slower. My
solution would have no duplication and would not be slower. I like
mine. ;)

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.
I dream of a day when complete unicode support is universal. With
enough effort we may get there some day. :)

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #4

P: n/a
(Sorry for the dupe, Martin. Gmail made it look like your reply was
in private.)

On 4/19/07, "Martin v. Lwis" <ma****@v.loewis.dewrote:
Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.
I don't believe this specific variant has been discussed. The change
I propose would make indexes non-contiguous, making unicode
technically not a sequence. I say that's a case for "practicality
beats purity".

Of course I'd appreciate any clarification before I bring it to
python-3000.
Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?
s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

The length of the string will not be changed. s[s.find(sub):] will
not be changed, so long as sub is a well-formed unicode string.
Nothing that properly handles unicode surrogates will be changed.

len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #5

P: n/a
On 20 Apr, 07:02, Neil Hodgson <nyamatongwe+thun...@gmail.comwrote:
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
This thread and the other one have been quite educational, and I've
been looking through some of the background material on the topic. I
think the intention was, in PEP 261 [1] and the surrounding
discussion, that people should be able to treat Unicode objects as
sequences of characters, even though GvR's summary [2] in that
discussion defines a character as representing a code point, not a
logical character. In such a scheme, characters should be indexed
contiguously, and if people should want to access surrogate pairs,
there should be a method (or module function) to expose that
information on individual (logical) characters.
Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.
This would work with the "substring in string" and
"string.index(substring)" pseudo-sequence API. However, once you've
got a character as a Unicode object, surely the nature of the encoded
character is only of peripheral interest. The Unicode API doesn't
return two or more values per character for those in the Basic
Multilingual Plane read from a UTF-8 source - that's inconsequential
detail at that particular point.

[...]
I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.
I think PEP 261 was mostly concerned with providing a "good enough"
solution until such a time as a better solution could be devised.
BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.
Do we have a volunteer? ;-)

Paul

[1] http://www.python.org/dev/peps/pep-0261/
[2] http://mail.python.org/pipermail/i18...ne/001107.html

Apr 20 '07 #6

P: n/a
Rhamphoryncus <rh****@gmail.comwrote:
>The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).
You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 20 '07 #7

P: n/a
I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.
s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.
[...]
>
len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.
Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

Regards,
Martin
Apr 21 '07 #8

P: n/a
On Apr 20, 5:49 pm, Ross Ridge <rri...@caffeine.csclub.uwaterloo.ca>
wrote:
Rhamphoryncus <rha...@gmail.comwrote:
The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).

You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.
No, I'm only assuming code which would raise an error with my change
is broken. My change would have minimal effect because it's building
on the existing correct way to do things, expanding them to handle
some additional cases.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #9

P: n/a
On Apr 20, 6:21 pm, "Martin v. Lwis" <mar...@v.loewis.dewrote:
I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.
Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).

s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

[...]
len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????
If you pick an index at random you will get IndexError. If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change. You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues. Considering the
difficulty of the problem it seems like an okay trade-off to me.

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).
I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #10

P: n/a
Paul Boddie:
Do we have a volunteer? ;-)
I won't volunteer to do a real implementation - the Unicode type in
Python is currently around 7000 lines long and there is other code to
change in, for example, regular expressions. Here's a demonstration C++
implementation that stores an array of surrogate positions for indexing.
For text in which every character is a surrogate, this could lead to
requiring 3 times as much storage (with a size_t requiring 64 bits for
each 32 bit surrogate pair). Indexing is reasonably efficient, using a
binary search through the surrogate positions so is proportional to
log(number of surrogates).

Code (not thoroughly tested):

/** @file surstr.cxx
** A simple Unicode string class that stores in UTF-16
** but indexes by character.
**/
// Copyright 2007 by Neil Hodgson <ne***@scintilla.org>
// This source code is public domain.

#include <string.h>
#include <stdio.h>

class surstr {
public:

typedef wchar_t codePoint;
enum { SURROGATE_LEAD_FIRST = 0xD800 };
enum { SURROGATE_LEAD_LAST = 0xDBFF };
enum { measure_length=0xffffffffU};

codePoint *text;
size_t length;
size_t *surrogates;
// Memory use would be decreased by allocating text and
// surrogates as one block but the code is clearer this way
size_t lengthSurrogates;

static bool IsLead(codePoint cp) {
return cp >= SURROGATE_LEAD_FIRST &&
cp <= SURROGATE_LEAD_LAST;
}

void FindSurrogates() {
lengthSurrogates = 0;
for (size_t i=0; i < length; i++) {
if (IsLead(text[i])) {
lengthSurrogates++;
}
}
surrogates = new size_t[lengthSurrogates];
size_t surr = 0;
for (size_t i=0; i < length; i++) {
if (IsLead(text[i])) {
surrogates[surr] = i - surr;
surr++;
}
}
}

size_t LinearIndexFromPosition(size_t position) const {
// For checking that the binary search version works
for (size_t i=0; i<lengthSurrogates; i++) {
if (surrogates[i] >= position) {
return position + i;
}
}
return position + lengthSurrogates;
}

size_t IndexFromPosition(size_t position) const {
// Use a binary search to find index in log(lengthSurrogates)
if (lengthSurrogates == 0)
return position;
if (position surrogates[lengthSurrogates - 1])
return position + lengthSurrogates;
size_t lower = 0;
size_t upper = lengthSurrogates-1;
do {
size_t middle = (upper + lower + 1) / 2; // Round high
size_t posMiddle = surrogates[middle];
if (position < posMiddle) {
upper = middle - 1;
} else {
lower = middle;
}
} while (lower < upper);
if (surrogates[lower] >= position)
return position + lower;
else
return position + lower + 1;
}

size_t Length() const {
return length - lengthSurrogates;
}

surstr() : text(0), length(0), surrogates(0), lengthSurrogates(0) {}

// start and end are in code points
surstr(codePoint *text_,
size_t start=0, size_t end=measure_length) :
text(0), length(0), surrogates(0), lengthSurrogates(0) {
// Assert text_[start:end] only contains whole surrogate pairs
if (end == measure_length) {
end = 0;
while (text_[end])
end++;
}
length = end - start;
text = new codePoint[length];
memcpy(text, text_, sizeof(codePoint) * length);
FindSurrogates();
}
// start and end are in characters
surstr(const surstr &source,
size_t start=0, size_t end=measure_length) {
size_t startIndex = source.IndexFromPosition(start);
size_t endIndex;
if (end == measure_length)
endIndex = source.IndexFromPosition(source.Length());
else
endIndex = source.IndexFromPosition(end);

length = endIndex - startIndex;
text = new codePoint[length];
memcpy(text, source.text + startIndex,
sizeof(codePoint) * length);
if (start == 0 && end == measure_length) {
surrogates = new size_t[source.lengthSurrogates];
memcpy(surrogates, source.surrogates,
sizeof(size_t) * source.lengthSurrogates);
lengthSurrogates = source.lengthSurrogates;
} else {
FindSurrogates();
}
}
~surstr() {
delete []text;
text = 0;
delete []surrogates;
surrogates = 0;
}
void print() {
for (size_t i=0;i<length;i++) {
if (text[i] < 0x7f) {
printf("%c", text[i]);
} else {
printf("\\u%04x", text[i]);
}
}
}
};

void slicer(surstr &a) {
printf("Length in characters = %d, code units = %d ==",
a.Length(), a.length);
a.print();
printf("\n");
for (size_t pos = 0; pos < a.Length(); pos++) {
if (a.IndexFromPosition(pos) !=
a.LinearIndexFromPosition(pos)) {
printf(" Failed at position %d -%d",
pos, a.IndexFromPosition(pos));
}
printf(" [%0d] ", pos);
surstr b(a, pos, pos+1);
b.print();
printf("\n");
}
}

int main() {
surstr n(L"");
slicer(n);
surstr a(L"a");
slicer(a);
surstr b(L"a\u6C348\U0001D11E-\U00010338!");
slicer(b);
printf("\n");
surstr c(L"a\u6C348\U0001D11E\U0001D11E-\U00010338!"
L"\U0001D11E\U0001D11Ea\u6C348\U0001D11E-\U00010338!");
slicer(c);
printf("\n");
}

Test run:

Length in characters = 0, code units = 0 ==>
Length in characters = 1, code units = 1 ==a
[0] a
Length in characters = 7, code units = 9 ==>
a\u6c348\ud834\udd1e-\ud800\udf38!
[0] a
[1] \u6c34
[2] 8
[3] \ud834\udd1e
[4] -
[5] \ud800\udf38
[6] !

Length in characters = 17, code units = 24 ==>
a\u6c348\ud834\udd1e\ud834\udd1e-\ud800\udf38!\ud834\udd1e\ud834\udd1ea\u6c348\ud83 4\udd1e-\ud800\udf38!
[0] a
[1] \u6c34
[2] 8
[3] \ud834\udd1e
[4] \ud834\udd1e
[5] -
[6] \ud800\udf38
[7] !
[8] \ud834\udd1e
[9] \ud834\udd1e
[10] a
[11] \u6c34
[12] 8
[13] \ud834\udd1e
[14] -
[15] \ud800\udf38
[16] !

Neil
Apr 21 '07 #11

P: n/a
On Apr 20, 7:34 pm, Rhamphoryncus <rha...@gmail.comwrote:
On Apr 20, 6:21 pm, "Martin v. Lwis" <mar...@v.loewis.dewrote:
I don't believe this specific variant has been discussed.
Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.

Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).
The last thing I heard with regards to string indexing was that Guido
was very adamant about O(1) indexing. On the other hand, if one is
willing to have O(log n) access time (where n is the number of
surrogate pairs), it can be done in O(n/logn) space (again where n is
the number of surrogate pairs). An early version of the structure can
be found here: http://mail.python.org/pipermail/pyt...er/003937.html

I can't seem to find my later post with an updated version (I do have
the source somewhere).
If you pick an index at random you will get IndexError. If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change. You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues. Considering the
difficulty of the problem it seems like an okay trade-off to me.
It is not ok for s[i] to fail for any index within the range of the
length of the string.

- Josiah

Apr 22 '07 #12

P: n/a
On Apr 20, 7:34 pm, Rhamphoryncus <rha...@gmail.comwrote:
On Apr 20, 6:21 pm, "Martin v. Lwis" <mar...@v.loewis.dewrote:
If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.
Having ability to iterate over code points doesn't mean you support
Unicode. For example if you want to determine if a string is one word
and you iterate over code points and call isalpha you'll get incorrect
result in some cases in some languages (just to backup this
claim this isn't going to work at least in Russian. Russian language
uses U+0301 combining acute accent which is not part of the alphabet
but it's an element of the Russian writing system).

IMHO what is really needed is a bunch of high level methods like
..graphemes() - iterate over graphemes
..codepoints() - iterate over codepoints
..isword() - check if the string represents one word
etc...

Then you can actually support all unicode characters in utf-16 build
of Python. Just make all existing unicode methods (except
unicode.__iter__) iterate over code points. Changing __iter__
to iterate over code points will make indexing wierd. When the
programmer is *ready* to support unicode he/she will explicitly
call .codepoints() or .graphemes(). As they say: You can lead
a horse to water, but you can't make it drink.

-- Leo

Apr 22 '07 #13

P: n/a
Rhamphoryncus <rh****@gmail.comwrote:
>I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.

Also since few Python programs claim to support Unicode, why do you
think it's acceptable to break them if they don't support surrogates?

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 22 '07 #14

P: n/a
IMHO what is really needed is a bunch of high level methods like
.graphemes() - iterate over graphemes
.codepoints() - iterate over codepoints
.isword() - check if the string represents one word
etc...
This doesn't need to come as methods, though. If anybody wants to
provide a library with such functions, they can do so today.

I'd be hesitant to add methods to the string object with no proven
applications.

IMO, the biggest challenge in Unicode support is neither storage
nor iteration, but it's output (rendering, fonts, etc.), and,
to some degree, input (input methods). As Python has no "native"
GUI library, we currently defer that main challenge to external
libraries already.

Regards,
Martin
Apr 23 '07 #15

P: n/a
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.
There is the notion of Unicode implementation levels, and each of them
does include a set of characters to support. In level 1, combining
characters need not to be supported (which is sufficient for scripts
that can be represented without combining characters, such as Latin
and Cyrillic, using precomposed characters if necessary). In level 2,
combining characters must be supported for some scripts that absolutely
need them, and in level 3, all characters must be supported.

It is probably an interpretation issue what "supported" means. Python
clearly supports Unicode level 1 (if we leave alone the issue that it
can't render all these characters out of the box, as it doesn't ship
any fonts); it could be argued that it implements level 3, as it is
capable of representing all Unicode characters (but, of course, so
does Python 1.5.2, if you put UTF-8 into byte strings).

Regards,
Martin
Apr 23 '07 #16

P: n/a
Ross Ridge writes:
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.
<ma****@v.loewis.dewrote:
There is the notion of Unicode implementation levels, and each of them
does include a set of characters to support.
There are different levels of implemtentation for ISO 10646, but not
of Unicode.
It is probably an interpretation issue what "supported" means.
The strongest claim to support Unicode that you can meaningfully make
is that of conformance to the Unicode standard. The Unicode standard's
conformance requirements make it explicit that you don't need to support
any particular character:

C8 A process shall not assume that it is required to interpret
any particular coded character representation.

. Processes that interpret only a subset of Unicode characters
are allowed; there is no blanket requirement to interpret
all Unicode characters.
[...]
Python clearly supports Unicode level 1 (if we leave alone the issue
that it can't render all these characters out of the box, as it doesn't
ship any fonts);
It's not at all clear to to me that Python does support ISO 10646's
implementation level 1, if only because I don't, and I assume you don't,
have a copy of ISO 10646 available to verify what the requirements
actually are.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 23 '07 #17

P: n/a
Ross Ridge <rr****@caffeine.csclub.uwaterloo.cawrites:
The Unicode standard doesn't require that you support surrogates,
or any other kind of character, so no you wouldn't be lying.
+1 on Ross Ridge's contributions to this thread.

If Unicode is processed using UTF-8 or UTF-32 encoding forms then
there are no surrogates. They would only be present in UTF-16.
CESU-8 is strongly discouraged.

A Unicode 16-bit string is allowed to be ill-formed as UTF-16. The
example they give is one string that ends with a high surrogate code
point and another that starts with a low surrogate code point. The
result of concatenation is a valid UTF-16 string.

The above refers to the Unicode standard. In Python with narrow
Py_UNICODE a unicode string is a sequence of 16-bit Unicode code
points. It is up to the programmer whether they want to specially
handle code points for surrogates. Operations based on concatenation
will conform to Unicode, whether or not there are surrogates in the
strings.
--
Pete Forman -./\.- Disclaimer: This post is originated
WesternGeco -./\.- by myself and does not represent
pe*********@westerngeco.com -./\.- the opinion of Schlumberger or
http://petef.port5.com -./\.- WesternGeco.
Apr 24 '07 #18

This discussion thread is closed

Replies have been disabled for this discussion.