473,320 Members | 1,846 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Python's handling of unicode surrogates

As was seen in another thread[1], there's a great deal of confusion
with regard to surrogates. Most programmers assume Python's unicode
type exposes only complete characters. Even CPython's own functions
do this on occasion. This leads to different behaviour across
platforms and makes it unnecessarily difficult to properly support all
languages.

To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

Note that this would not harm performance, nor would it affects
programs that already handle UTF-16 and UTF-32 correctly.

To show how things currently differ across platforms, here's an
example using UTF-32:
>>a, b = u'\U00100000', u'\uFFFF'
a, b
(u'\U00100000', u'\uffff')
>>list(a), list(b)
([u'\U00100000'], [u'\uffff'])
>>sorted([a, b])
[u'\uffff', u'\U00100000']

Now contrast the output of sorted() with what you get when using UTF-16:
>>a, b = u'\U00100000', u'\uFFFF'
a, b
(u'\U00100000', u'\uffff')
>>list(a), list(b)
([u'\udbc0', '\udc00'], [u'\uffff'])
>>sorted([a, b])
[u'\U00100000', u'\uffff']

As you can see, the order has be reversed, because the sort operates
on code units rather than scalar values.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
* "There is no separate character type; a character is represented by
a string of one item."
* iteration would be identical on all platforms
* sorting would be identical on all platforms
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
* Breaks code which does s[s.find(sub)], where sub is a single
surrogate, expecting half a surrogate (why?). Does NOT break code
where sub is always a single code unit, nor does it break code that
assumes a longer sub using s[s.find(sub):s.find(sub) + len(sub)]
* Alters the sort order on UTF-16 platforms (to match that of UTF-32
platforms, not to mention UTF-8 encoded byte strings)
* Current Python is fairly tolerant of ill-formed unicode data.
Changing this may break some code. However, if you do have a need to
twiddle low-level UTF encodings, wouldn't the bytes type be better?
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Thoughts, from all you readers out there? For/against? If there's
enough support I'll post the idea on python-3000.

[1] http://groups.google.com/group/comp....76e191831da6de
[2] Pages 23-24 of http://unicode.org/versions/Unicode4.0.0/ch03.pdf

--
Adam Olsen, aka Rhamphoryncus
Apr 19 '07 #1
17 4498
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).
I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
unichr could return a 2 code unit string without forcing surrogate
indivisibility.
* "There is no separate character type; a character is represented by
a string of one item."
Could amend this to "a string of one or two items".
* iteration would be identical on all platforms
There could be a secondary iterator that iterates over characters
rather than code units.
* sorting would be identical on all platforms
This should be fixable in the current scheme.
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].
It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.
Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
The code will work happily for the implementor and then break when
exposed to a surrogate.
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.
Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

Neil
Apr 20 '07 #2
Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.

Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?
Apr 20 '07 #3
On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+thun...@gmail.com>
wrote:
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
"Errors should never pass silently."

The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.
Indeed. I was actually surprised that it didn't, originally I had it
listed with \U and repr().

* "There is no separate character type; a character is represented by
a string of one item."

Could amend this to "a string of one or two items".
* iteration would be identical on all platforms

There could be a secondary iterator that iterates over characters
rather than code units.
But since you should use that iterator 90%+ of the time, why not make
it the default?

* sorting would be identical on all platforms

This should be fixable in the current scheme.
True.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.
Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That's half my proposal.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.

The code will work happily for the implementor and then break when
exposed to a surrogate.
The code may well break already. I just make it explicit.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don't think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.
Yet Unicode has deemed them worth including anyway. I see no reason
to make them more painful then they have to be.

A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -s[i:i
+len(sub)]. If they need iteration they'll have to reimplement it,
providing the exact behaviour I propose. Or they can recompile Python
to use UTF-32, but why shouldn't such features be available by
default?

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.
Your solution would require code duplication and would be slower. My
solution would have no duplication and would not be slower. I like
mine. ;)

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.
I dream of a day when complete unicode support is universal. With
enough effort we may get there some day. :)

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #4
(Sorry for the dupe, Martin. Gmail made it look like your reply was
in private.)

On 4/19/07, "Martin v. Löwis" <ma****@v.loewis.dewrote:
Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.
I don't believe this specific variant has been discussed. The change
I propose would make indexes non-contiguous, making unicode
technically not a sequence. I say that's a case for "practicality
beats purity".

Of course I'd appreciate any clarification before I bring it to
python-3000.
Regards,
Martin

(*) I don't fully understand your proposal. You say that you
want "gaps in [the string's] index", but I'm not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?
s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

The length of the string will not be changed. s[s.find(sub):] will
not be changed, so long as sub is a well-formed unicode string.
Nothing that properly handles unicode surrogates will be changed.

len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).

--
Adam Olsen, aka Rhamphoryncus

Apr 20 '07 #5
On 20 Apr, 07:02, Neil Hodgson <nyamatongwe+thun...@gmail.comwrote:
Adam Olsen:
To solve this I propose Python's unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.
This thread and the other one have been quite educational, and I've
been looking through some of the background material on the topic. I
think the intention was, in PEP 261 [1] and the surrounding
discussion, that people should be able to treat Unicode objects as
sequences of characters, even though GvR's summary [2] in that
discussion defines a character as representing a code point, not a
logical character. In such a scheme, characters should be indexed
contiguously, and if people should want to access surrogate pairs,
there should be a method (or module function) to expose that
information on individual (logical) characters.
Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.
This would work with the "substring in string" and
"string.index(substring)" pseudo-sequence API. However, once you've
got a character as a Unicode object, surely the nature of the encoded
character is only of peripheral interest. The Unicode API doesn't
return two or more values per character for those in the Basic
Multilingual Plane read from a UTF-8 source - that's inconsequential
detail at that particular point.

[...]
I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.
I think PEP 261 was mostly concerned with providing a "good enough"
solution until such a time as a better solution could be devised.
BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.
Do we have a volunteer? ;-)

Paul

[1] http://www.python.org/dev/peps/pep-0261/
[2] http://mail.python.org/pipermail/i18...ne/001107.html

Apr 20 '07 #6
Rhamphoryncus <rh****@gmail.comwrote:
>The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).
You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 20 '07 #7
I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.
s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.
[...]
>
len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.
Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

Regards,
Martin
Apr 21 '07 #8
On Apr 20, 5:49 pm, Ross Ridge <rri...@caffeine.csclub.uwaterloo.ca>
wrote:
Rhamphoryncus <rha...@gmail.comwrote:
The only code that will be changed is that which doesn't handle
surrogates properly. Some will start working properly. Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).

You're falsely assuming that any code that doesn't support surrogates
is broken. Supporting surrogates is no more required than supporting
combining characters, right-to-left languages or lower case letters.
No, I'm only assuming code which would raise an error with my change
is broken. My change would have minimal effect because it's building
on the existing correct way to do things, expanding them to handle
some additional cases.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #9
On Apr 20, 6:21 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.
Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).

s[5] does not exist. You would get an IndexError indicating that it
refers to the second half of a surrogate.

[...]
len(s[k]) would be 2 if it involved a surrogate, yes. One character,
two code units.

Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????
If you pick an index at random you will get IndexError. If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change. You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues. Considering the
difficulty of the problem it seems like an okay trade-off to me.

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).
I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.

--
Adam Olsen, aka Rhamphoryncus

Apr 21 '07 #10
Paul Boddie:
Do we have a volunteer? ;-)
I won't volunteer to do a real implementation - the Unicode type in
Python is currently around 7000 lines long and there is other code to
change in, for example, regular expressions. Here's a demonstration C++
implementation that stores an array of surrogate positions for indexing.
For text in which every character is a surrogate, this could lead to
requiring 3 times as much storage (with a size_t requiring 64 bits for
each 32 bit surrogate pair). Indexing is reasonably efficient, using a
binary search through the surrogate positions so is proportional to
log(number of surrogates).

Code (not thoroughly tested):

/** @file surstr.cxx
** A simple Unicode string class that stores in UTF-16
** but indexes by character.
**/
// Copyright 2007 by Neil Hodgson <ne***@scintilla.org>
// This source code is public domain.

#include <string.h>
#include <stdio.h>

class surstr {
public:

typedef wchar_t codePoint;
enum { SURROGATE_LEAD_FIRST = 0xD800 };
enum { SURROGATE_LEAD_LAST = 0xDBFF };
enum { measure_length=0xffffffffU};

codePoint *text;
size_t length;
size_t *surrogates;
// Memory use would be decreased by allocating text and
// surrogates as one block but the code is clearer this way
size_t lengthSurrogates;

static bool IsLead(codePoint cp) {
return cp >= SURROGATE_LEAD_FIRST &&
cp <= SURROGATE_LEAD_LAST;
}

void FindSurrogates() {
lengthSurrogates = 0;
for (size_t i=0; i < length; i++) {
if (IsLead(text[i])) {
lengthSurrogates++;
}
}
surrogates = new size_t[lengthSurrogates];
size_t surr = 0;
for (size_t i=0; i < length; i++) {
if (IsLead(text[i])) {
surrogates[surr] = i - surr;
surr++;
}
}
}

size_t LinearIndexFromPosition(size_t position) const {
// For checking that the binary search version works
for (size_t i=0; i<lengthSurrogates; i++) {
if (surrogates[i] >= position) {
return position + i;
}
}
return position + lengthSurrogates;
}

size_t IndexFromPosition(size_t position) const {
// Use a binary search to find index in log(lengthSurrogates)
if (lengthSurrogates == 0)
return position;
if (position surrogates[lengthSurrogates - 1])
return position + lengthSurrogates;
size_t lower = 0;
size_t upper = lengthSurrogates-1;
do {
size_t middle = (upper + lower + 1) / 2; // Round high
size_t posMiddle = surrogates[middle];
if (position < posMiddle) {
upper = middle - 1;
} else {
lower = middle;
}
} while (lower < upper);
if (surrogates[lower] >= position)
return position + lower;
else
return position + lower + 1;
}

size_t Length() const {
return length - lengthSurrogates;
}

surstr() : text(0), length(0), surrogates(0), lengthSurrogates(0) {}

// start and end are in code points
surstr(codePoint *text_,
size_t start=0, size_t end=measure_length) :
text(0), length(0), surrogates(0), lengthSurrogates(0) {
// Assert text_[start:end] only contains whole surrogate pairs
if (end == measure_length) {
end = 0;
while (text_[end])
end++;
}
length = end - start;
text = new codePoint[length];
memcpy(text, text_, sizeof(codePoint) * length);
FindSurrogates();
}
// start and end are in characters
surstr(const surstr &source,
size_t start=0, size_t end=measure_length) {
size_t startIndex = source.IndexFromPosition(start);
size_t endIndex;
if (end == measure_length)
endIndex = source.IndexFromPosition(source.Length());
else
endIndex = source.IndexFromPosition(end);

length = endIndex - startIndex;
text = new codePoint[length];
memcpy(text, source.text + startIndex,
sizeof(codePoint) * length);
if (start == 0 && end == measure_length) {
surrogates = new size_t[source.lengthSurrogates];
memcpy(surrogates, source.surrogates,
sizeof(size_t) * source.lengthSurrogates);
lengthSurrogates = source.lengthSurrogates;
} else {
FindSurrogates();
}
}
~surstr() {
delete []text;
text = 0;
delete []surrogates;
surrogates = 0;
}
void print() {
for (size_t i=0;i<length;i++) {
if (text[i] < 0x7f) {
printf("%c", text[i]);
} else {
printf("\\u%04x", text[i]);
}
}
}
};

void slicer(surstr &a) {
printf("Length in characters = %d, code units = %d ==",
a.Length(), a.length);
a.print();
printf("\n");
for (size_t pos = 0; pos < a.Length(); pos++) {
if (a.IndexFromPosition(pos) !=
a.LinearIndexFromPosition(pos)) {
printf(" Failed at position %d -%d",
pos, a.IndexFromPosition(pos));
}
printf(" [%0d] ", pos);
surstr b(a, pos, pos+1);
b.print();
printf("\n");
}
}

int main() {
surstr n(L"");
slicer(n);
surstr a(L"a");
slicer(a);
surstr b(L"a\u6C348\U0001D11E-\U00010338!");
slicer(b);
printf("\n");
surstr c(L"a\u6C348\U0001D11E\U0001D11E-\U00010338!"
L"\U0001D11E\U0001D11Ea\u6C348\U0001D11E-\U00010338!");
slicer(c);
printf("\n");
}

Test run:

Length in characters = 0, code units = 0 ==>
Length in characters = 1, code units = 1 ==a
[0] a
Length in characters = 7, code units = 9 ==>
a\u6c348\ud834\udd1e-\ud800\udf38!
[0] a
[1] \u6c34
[2] 8
[3] \ud834\udd1e
[4] -
[5] \ud800\udf38
[6] !

Length in characters = 17, code units = 24 ==>
a\u6c348\ud834\udd1e\ud834\udd1e-\ud800\udf38!\ud834\udd1e\ud834\udd1ea\u6c348\ud83 4\udd1e-\ud800\udf38!
[0] a
[1] \u6c34
[2] 8
[3] \ud834\udd1e
[4] \ud834\udd1e
[5] -
[6] \ud800\udf38
[7] !
[8] \ud834\udd1e
[9] \ud834\udd1e
[10] a
[11] \u6c34
[12] 8
[13] \ud834\udd1e
[14] -
[15] \ud800\udf38
[16] !

Neil
Apr 21 '07 #11
On Apr 20, 7:34 pm, Rhamphoryncus <rha...@gmail.comwrote:
On Apr 20, 6:21 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I don't believe this specific variant has been discussed.
Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.

Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).
The last thing I heard with regards to string indexing was that Guido
was very adamant about O(1) indexing. On the other hand, if one is
willing to have O(log n) access time (where n is the number of
surrogate pairs), it can be done in O(n/logn) space (again where n is
the number of surrogate pairs). An early version of the structure can
be found here: http://mail.python.org/pipermail/pyt...er/003937.html

I can't seem to find my later post with an updated version (I do have
the source somewhere).
If you pick an index at random you will get IndexError. If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change. You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues. Considering the
difficulty of the problem it seems like an okay trade-off to me.
It is not ok for s[i] to fail for any index within the range of the
length of the string.

- Josiah

Apr 22 '07 #12
On Apr 20, 7:34 pm, Rhamphoryncus <rha...@gmail.comwrote:
On Apr 20, 6:21 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:
If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.
Having ability to iterate over code points doesn't mean you support
Unicode. For example if you want to determine if a string is one word
and you iterate over code points and call isalpha you'll get incorrect
result in some cases in some languages (just to backup this
claim this isn't going to work at least in Russian. Russian language
uses U+0301 combining acute accent which is not part of the alphabet
but it's an element of the Russian writing system).

IMHO what is really needed is a bunch of high level methods like
..graphemes() - iterate over graphemes
..codepoints() - iterate over codepoints
..isword() - check if the string represents one word
etc...

Then you can actually support all unicode characters in utf-16 build
of Python. Just make all existing unicode methods (except
unicode.__iter__) iterate over code points. Changing __iter__
to iterate over code points will make indexing wierd. When the
programmer is *ready* to support unicode he/she will explicitly
call .codepoints() or .graphemes(). As they say: You can lead
a horse to water, but you can't make it drink.

-- Leo

Apr 22 '07 #13
Rhamphoryncus <rh****@gmail.comwrote:
>I wish to write software that supports Unicode. Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.

Also since few Python programs claim to support Unicode, why do you
think it's acceptable to break them if they don't support surrogates?

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 22 '07 #14
IMHO what is really needed is a bunch of high level methods like
.graphemes() - iterate over graphemes
.codepoints() - iterate over codepoints
.isword() - check if the string represents one word
etc...
This doesn't need to come as methods, though. If anybody wants to
provide a library with such functions, they can do so today.

I'd be hesitant to add methods to the string object with no proven
applications.

IMO, the biggest challenge in Unicode support is neither storage
nor iteration, but it's output (rendering, fonts, etc.), and,
to some degree, input (input methods). As Python has no "native"
GUI library, we currently defer that main challenge to external
libraries already.

Regards,
Martin
Apr 23 '07 #15
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.
There is the notion of Unicode implementation levels, and each of them
does include a set of characters to support. In level 1, combining
characters need not to be supported (which is sufficient for scripts
that can be represented without combining characters, such as Latin
and Cyrillic, using precomposed characters if necessary). In level 2,
combining characters must be supported for some scripts that absolutely
need them, and in level 3, all characters must be supported.

It is probably an interpretation issue what "supported" means. Python
clearly supports Unicode level 1 (if we leave alone the issue that it
can't render all these characters out of the box, as it doesn't ship
any fonts); it could be argued that it implements level 3, as it is
capable of representing all Unicode characters (but, of course, so
does Python 1.5.2, if you put UTF-8 into byte strings).

Regards,
Martin
Apr 23 '07 #16
Ross Ridge writes:
The Unicode standard doesn't require that you support surrogates, or
any other kind of character, so no you wouldn't be lying.
<ma****@v.loewis.dewrote:
There is the notion of Unicode implementation levels, and each of them
does include a set of characters to support.
There are different levels of implemtentation for ISO 10646, but not
of Unicode.
It is probably an interpretation issue what "supported" means.
The strongest claim to support Unicode that you can meaningfully make
is that of conformance to the Unicode standard. The Unicode standard's
conformance requirements make it explicit that you don't need to support
any particular character:

C8 A process shall not assume that it is required to interpret
any particular coded character representation.

. Processes that interpret only a subset of Unicode characters
are allowed; there is no blanket requirement to interpret
all Unicode characters.
[...]
Python clearly supports Unicode level 1 (if we leave alone the issue
that it can't render all these characters out of the box, as it doesn't
ship any fonts);
It's not at all clear to to me that Python does support ISO 10646's
implementation level 1, if only because I don't, and I assume you don't,
have a copy of ISO 10646 available to verify what the requirements
actually are.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Apr 23 '07 #17
Ross Ridge <rr****@caffeine.csclub.uwaterloo.cawrites:
The Unicode standard doesn't require that you support surrogates,
or any other kind of character, so no you wouldn't be lying.
+1 on Ross Ridge's contributions to this thread.

If Unicode is processed using UTF-8 or UTF-32 encoding forms then
there are no surrogates. They would only be present in UTF-16.
CESU-8 is strongly discouraged.

A Unicode 16-bit string is allowed to be ill-formed as UTF-16. The
example they give is one string that ends with a high surrogate code
point and another that starts with a low surrogate code point. The
result of concatenation is a valid UTF-16 string.

The above refers to the Unicode standard. In Python with narrow
Py_UNICODE a unicode string is a sequence of 16-bit Unicode code
points. It is up to the programmer whether they want to specially
handle code points for surrogates. Operations based on concatenation
will conform to Unicode, whether or not there are surrogates in the
strings.
--
Pete Forman -./\.- Disclaimer: This post is originated
WesternGeco -./\.- by myself and does not represent
pe*********@westerngeco.com -./\.- the opinion of Schlumberger or
http://petef.port5.com -./\.- WesternGeco.
Apr 24 '07 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: gabor | last post by:
hi, today i made some tests... i tested some unicode symbols, that are above the 16bit limit (gothic:http://www.unicode.org/charts/PDF/U10330.pdf) .. i played around with iconv and so on,...
16
by: Paul Prescod | last post by:
I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold...
10
by: Andrew Dalke | last post by:
Is there an author index for the new version of the Python cookbook? As a contributor I got my comp version delivered today and my ego wanted some gratification. I couldn't find my entries. ...
32
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
18
by: Chameleon | last post by:
I am trying to #define this: #ifdef UNICODE_STRINGS #define UC16 L typedef wstring String; #else #define UC16 typedef string String; #endif ....
0
by: Kurt B. Kaiser | last post by:
Patch / Bug Summary ___________________ Patches : 342 open (-38) / 3712 closed (+54) / 4054 total (+16) Bugs : 951 open (-14) / 6588 closed (+33) / 7539 total (+19) RFE : 257 open...
122
by: C.L. | last post by:
I was looking for a function or method that would return the index to the first matching element in a list. Coming from a C++ STL background, I thought it might be called "find". My first stop was...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.