"index" method only for mutable sequences??

C.L.

I was looking for a function or method that would return the index to the first
matching element in a list. Coming from a C++ STL background, I thought it might
be called "find". My first stop was the Sequence Types page of the Library
Reference (http://docs.python.org/lib/typesseq.html); it wasn't there. A search
of the Library Reference's index seemed to confirm that the function did not
exist. A little later I realized it might be called "index" instead. Voila.

My point is that the docs list and describe it as a method that only exists for
MUTABLE sequences. Why only for mutables? The class of objects I would expect it
to cover would be all ordered sequences, or, to phrase it a little more
pointedly, anything that supports ordered INDEXing. My understanding is that
dict's don't fall into that class of objects since their ordering is not
documented or to be depended on. However, tuple's do support ordered indexing,
so why don't tuple's have an index method?

P.S.: I know I haven't yet gotten an answer to my "why" question yet, but,
assuming it's just an oversight or an example of design without the big picture
in mind, an added benefit to fixing that oversight would be that the "index"
method's documentation could be moved from the currently odd seeming location on
the "Mutable Sequence Types" page to a place someone would look for it logically.

P.P.S.: As much as the elementary nature of my question would make it seem, this
isn't my first day using Python. I've used it on and off for several years and I
LOVE Python. It is only because of my love for the language that I question its
ways, so please don't be overly defensive when I guess that the cause for this
possible oversight is a lack of design.

Corey Lubin

Apr 6 '07

Subscribe Reply

122

5398

Antoon Pardon

On 2007-04-13, Steve Holden <st***@holdenweb.comwrote:

Antoon Pardon wrote:
>On 2007-04-12, Carsten Haese <ca*****@uniqsys.comwrote:
>>On Thu, 2007-04-12 at 14:10 +0000, Antoon Pardon wrote:
People are always defending duck-typing in this news group and now python
has chosen to choose the option that makes duck-typing more difficult.
Au contraire! The "inconsistent" behavior of "in" is precisely what
duck-typing is all about: Making the operator behave in a way that makes
sense in its context.

No it isn't. Ducktyping is about similar objects using a similar
interface to invoke similar behaviour and getting similar result.

So that if you write a function you don't concern yourself with
the type of the arguments but depend on the similar behaviour.

Please note that "similar" does not mean "exact".

That is because I don't want to get down in an argument about
whether tp[:3] and ls[:3] is similar behaviour or exact the
same behaviour when tp is a tuple and ls is a list.

The behavior of str.__contains__ and list.__contains__ is similar.

That would depend on how much you find things may differ and
still call them similar. IMO they are not similar enough
since "12" in "123" doesn't behave like [1,2] in [1,2,3]

Duck-typing allows natural access to polymorphism. You appear to be
making semantic distinctions merely for the sake of continuing this
rather fatuous thread.

I gave an argument that showed that the specific way the in
functionality was extended in strings makes duck-typing (and
by extention natural access to polymorphism) more difficult.
although it may do so in a way that is not significant to
you and the other developers.

Now if you don't agree with the argument presented that
is fine with me. If you think the problem is not big
enough to bother with, that is fine with me too.
But the argument doesn't disappear simply because you
think ill of my intentions.

And consider that each small inconsistency in itself
may be not important enough to remove. But if you
have enough of them remembering all these special
cases can become tedious.

>Suppose someone writes a function that acts on a sequence.
The algorithm used depending on the following invariant.

i = s.index(e) =s[i] = e

Then this algorithm is no longer guaranteed to work with strings.

Because strings have different properties than other sequences. I can't
help pointing out that your invariant is invalid for tuples also,
because tuples don't have a .index() method.

Strings have some properties that are different and some
properties that are similar with other sequences. My argument
is that if you want to facilitate duck typing and natural access to
polymorphism in peoples functions that work with sequences in general
you'd better take care that the sequence api of strings resembles
the sequence api of other sequences as good as possible.

You on the other hand seem to argue that since strings have
properties where they differ from other sequences it no longer
is so important that the sequence api of strings resembles those
of other sequences.

--
Antoon Pardon

Apr 13 '07 #101

Steve Holden

Antoon Pardon wrote:

On 2007-04-13, Steve Holden <st***@holdenweb.comwrote:
>Antoon Pardon wrote:
>>On 2007-04-12, Carsten Haese <ca*****@uniqsys.comwrote:
On Thu, 2007-04-12 at 14:10 +0000, Antoon Pardon wrote:
People are always defending duck-typing in this news group and now python
has chosen to choose the option that makes duck-typing more difficult.
Au contraire! The "inconsistent" behavior of "in" is precisely what
duck-typing is all about: Making the operator behave in a way that makes
sense in its context.
No it isn't. Ducktyping is about similar objects using a similar
interface to invoke similar behaviour and getting similar result.

So that if you write a function you don't concern yourself with
the type of the arguments but depend on the similar behaviour.

Please note that "similar" does not mean "exact".

That is because I don't want to get down in an argument about
whether tp[:3] and ls[:3] is similar behaviour or exact the
same behaviour when tp is a tuple and ls is a list.

>The behavior of str.__contains__ and list.__contains__ is similar.

That would depend on how much you find things may differ and
still call them similar. IMO they are not similar enough
since "12" in "123" doesn't behave like [1,2] in [1,2,3]

And it never will, because of the property of strings I mentioned
previously. Unless you want to introduce a character type into Python
there is no way that you are ever going to be be satisfied.

>Duck-typing allows natural access to polymorphism. You appear to be
making semantic distinctions merely for the sake of continuing this
rather fatuous thread.

I gave an argument that showed that the specific way the in
functionality was extended in strings makes duck-typing (and
by extention natural access to polymorphism) more difficult.
although it may do so in a way that is not significant to
you and the other developers.

I am not "a developer".

Now if you don't agree with the argument presented that
is fine with me. If you think the problem is not big
enough to bother with, that is fine with me too.
But the argument doesn't disappear simply because you
think ill of my intentions.

Apparently.

And consider that each small inconsistency in itself
may be not important enough to remove. But if you
have enough of them remembering all these special
cases can become tedious.

But not as tedious as this eternal discussion of already-decided issues.

>>Suppose someone writes a function that acts on a sequence.
The algorithm used depending on the following invariant.

i = s.index(e) =s[i] = e

Then this algorithm is no longer guaranteed to work with strings.

Because strings have different properties than other sequences. I can't
help pointing out that your invariant is invalid for tuples also,
because tuples don't have a .index() method.

Strings have some properties that are different and some
properties that are similar with other sequences. My argument
is that if you want to facilitate duck typing and natural access to
polymorphism in peoples functions that work with sequences in general
you'd better take care that the sequence api of strings resembles
the sequence api of other sequences as good as possible.

This is just a bald restatement of the same argument you feel makes it
desirable to add an index() method to tuples. If taken to its logical
(and ridiculous) extreme there should only be one sequence type in Python.

You on the other hand seem to argue that since strings have
properties where they differ from other sequences it no longer
is so important that the sequence api of strings resembles those
of other sequences.

Well, of course. Programming languages are for human users, and they
should do what human users find most natural. Since humans can disagree
the developers (amongst who I do not count myself, although I *am*
concerned about the development of Python) have to try and go by
consensus, which by and large they do reasonably successfully.

So what I suppose I *am* saying is that your opinions would seem to
differ from the consensus. While you are not in a minority of one you
are in a minority, and it would be nice if we could proceed without
having to continually revisit each small design decision on a continuous
basis.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 13 '07 #102

Brian van den Broek

Antoon Pardon said unto the world upon 04/13/2007 02:46 AM:

On 2007-04-12, Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrote:

<snip>

>So much fuss over such a little thing... yes it would be nice if tuples
grew an index method, but it isn't hard to work around the lack.

Yes it is a little thing. But if it is such a little thing why do
the developers don't simply add it?

It's wafer thin!

--

Brian vdB

Apr 13 '07 #103

Steve Holden

Brian van den Broek wrote:

Antoon Pardon said unto the world upon 04/13/2007 02:46 AM:
>On 2007-04-12, Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrote:

<snip>

>>So much fuss over such a little thing... yes it would be nice if tuples
grew an index method, but it isn't hard to work around the lack.
Yes it is a little thing. But if it is such a little thing why do
the developers don't simply add it?

It's wafer thin!

Quite. [The Python language adds an index() method to tuples and
promptly EXPLODES]. Thank you, Mr. Creosote.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 13 '07 #104

Rhamphoryncus

On Apr 13, 1:32 am, Antoon Pardon <apar...@forel.vub.ac.bewrote:

Suppose someone writes a function that acts on a sequence.
The algorithm used depending on the following invariant.

i = s.index(e) =s[i] = e

Then this algorithm is no longer guaranteed to work with strings.

It never worked correctly on unicode strings anyway (which becomes the
canonical string in python 3.0). The base unit exposed by the
implementation is rarely what you want to operate upon.

The terminology is pretty confusing, but let's see if I can lay out
the relationships here:

byte â‰¤ code unit â‰¤ code point â‰¤ scalar value â‰¤ grapheme cluster ~
character â‰¤ syllable â‰¤ word â‰¤ sentence â‰¤ paragraph

"12" in "123" allows you to handle bytes through scalar values the
same way, glossing over the implementation details (such as UTF-32 on
linux and UTF-16 on windows).

--
Adam Olsen, aka Rhamphoryncus

Apr 13 '07 #105

Paul Rubin

Steve Holden <st***@holdenweb.comwrites:

This is just a bald restatement of the same argument you feel makes it
desirable to add an index() method to tuples. If taken to its logical
(and ridiculous) extreme there should only be one sequence type in
Python.

That doesn't sound ridiculous given type/class unification. There
could be a single sequence class that implements functions like index.
Subclasses like strings, tuples, lists, etc. would inherit from it.
Some of them might have optimized or customized implementations of
those standard operations, others might not.

Apr 14 '07 #106

Paul Rubin

"Rhamphoryncus" <rh****@gmail.comwrites:

i = s.index(e) =s[i] = e
Then this algorithm is no longer guaranteed to work with strings.
It never worked correctly on unicode strings anyway (which becomes the
canonical string in python 3.0).

What?! Are you sure? That sounds broken to me.

Apr 14 '07 #107

Hendrik van Rooyen

"Carsten Haese" <ca*****@uniqsys.comwrote:
8<------------

sense in its context. Nobody seems to be complaining about "+" behaving
"inconsistently" depending on whether you're adding numbers or
sequences.

I would If I thought it would do some good - the plus sign as a joiner
was, I think, a bad decision.

Just write a routine to calculate the checksum of an Intel Hex file record
to see what I mean.

- Hendrik

Apr 14 '07 #108

Hendrik van Rooyen

"Donn Cave" <do**@u.washington.eduwrote:

>
Well, yes - consider for example the "tm" tuple returned
from time.localtime() - it's all integers, but heterogeneous
as could be - tm[0] is Year, tm[1] is Month, etc., and it
turns out that not one of them is alike. The point is exactly
that we can't discover these differences from the items itself -
so it isn't about Python types - but rather from the position
of the item in the struct/tuple. (For the person who is about
to write to me that localtime() doesn't exactly return a tuple: QED)

This is the point where the whole thing falls apart in my head and
I get real confused - I can't find a reason why, list or tuple, the first
item can't be something, the second something else, etc...

About the only reason you would use a tuple is if you want to
use it as a key to a dict - and then only because you have to,
you can't use a list as the language stands.

- Hendrik

Apr 14 '07 #109

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

The use case has already been discussed. Removing the pointless

inconsistency between lists and tuples means you can stop having to
remember it, so you can free up brain cells for implementing useful
things. That increases your programming productivity.

So to increase consistency, the .index method should be removed
from lists, as well, IMO. If you find yourself doing a linear
search, something is wrong.

Regards,
Martin

Apr 14 '07 #110

Steven D'Aprano

On Sat, 14 Apr 2007 08:19:36 +0200, Hendrik van Rooyen wrote:

This is the point where the whole thing falls apart in my head and
I get real confused - I can't find a reason why, list or tuple, the first
item can't be something, the second something else, etc...

It's not that they _can't_ be, but that the two sequence types were
designed for different uses:

Here are some tuples of statistics about people:

fred = (35, 66, 212) # age, height, weight
george = (42, 75, 316)

They are tuples because each one is like a Pascal record or a C struct.

It isn't likely that you'll be in a position where you know that Fred's
age, height or weight is 66, but you don't know which one and so need
fred.index() to find out. Hence, tuples weren't designed to have an index
method.

Here are some lists of statistics about people:

ages = [35, 42, 26, 17, 18]
heights = [66, 75, 70, 61, 59]
weights = [212, 316, 295, 247, 251]
# notice that the first column is fred, the second column is george, etc.

They are mutable lists rather than immutable tuples because you don't know
ahead of time how many data items you need to store.

Now, it is likely that you'll want to know which column(s) have an age of
26, so lists were designed to have an index method.

Now, that's the sort of things tuples and lists were designed for. If you
want to use them for something else, you're free to.

--
Steven.

Apr 14 '07 #111

Hendrik van Rooyen

"Martin v. Löwis" <ma****@v.loewis.dewrote:

>
So to increase consistency, the .index method should be removed
from lists, as well, IMO. If you find yourself doing a linear
search, something is wrong.

I agree.
You should at the very least make it a binary search.
To do that you have to sort the list.
Much more efficient.

; - )

Please Sir, can I have the DoubleDict I asked for elsewhere in this thread?

- Hendrik

Apr 14 '07 #112

Rhamphoryncus

On Apr 13, 11:05 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"Rhamphoryncus" <rha...@gmail.comwrites:

i = s.index(e) =s[i] = e
Then this algorithm is no longer guaranteed to work with strings.
It never worked correctly on unicode strings anyway (which becomes the
canonical string in python 3.0).

What?! Are you sure? That sounds broken to me.

Nope, it's pretty fundamental to working with text, unicode only being
an extreme example: there's a wide number of ways to break down a
chunk of text, making the odds of "e" being any particular one fairly
low. Python's unicode type only makes this slightly worse, not
promising any particular one is available.

For example, if you had an algorithm designed for ascii that gathered
statistics on how common each "character" is, you'd want to redesign
it to use either grapheme clusters or scalar values, then improve it
to merge duplicate characters. You'd need to roll your own iterator
though, Python doesn't provide a method that's specifically grapheme
clusters or scalar values (and if I'm wrong I'd love to hear it!).

--
Adam Olsen, aka Rhamphoryncus

Apr 14 '07 #113

Paul Rubin

"Rhamphoryncus" <rh****@gmail.comwrites:

i = s.index(e) =s[i] = e
Then this algorithm is no longer guaranteed to work with strings.
It never worked correctly on unicode strings anyway (which becomes the
canonical string in python 3.0).
What?! Are you sure? That sounds broken to me.

Nope, it's pretty fundamental to working with text, unicode only being
an extreme example: there's a wide number of ways to break down a
chunk of text, making the odds of "e" being any particular one fairly
low. Python's unicode type only makes this slightly worse, not
promising any particular one is available.

I don't understand this. I thought that unicode was a character
coding system like ascii, except with an enormous character set
combined with a bunch of different algorithms for encoding unicode
strings as byte sequences. But I've thought of those algorithms
(UTF-8 and so forth) as basically being kludgy data compression
schemes, and unicode strings are still just sequences of code points.

Apr 14 '07 #114

Antoon Pardon

On 2007-04-13, Steve Holden <st***@holdenweb.comwrote:

Antoon Pardon wrote:
>On 2007-04-13, Steve Holden <st***@holdenweb.comwrote:
>>Antoon Pardon wrote:
On 2007-04-12, Carsten Haese <ca*****@uniqsys.comwrote:
On Thu, 2007-04-12 at 14:10 +0000, Antoon Pardon wrote:
>People are always defending duck-typing in this news group and now python
>has chosen to choose the option that makes duck-typing more difficult.
Au contraire! The "inconsistent" behavior of "in" is precisely what
duck-typing is all about: Making the operator behave in a way that makes
sense in its context.
No it isn't. Ducktyping is about similar objects using a similar
interface to invoke similar behaviour and getting similar result.

So that if you write a function you don't concern yourself with
the type of the arguments but depend on the similar behaviour.

Please note that "similar" does not mean "exact".

That is because I don't want to get down in an argument about
whether tp[:3] and ls[:3] is similar behaviour or exact the
same behaviour when tp is a tuple and ls is a list.

>>The behavior of str.__contains__ and list.__contains__ is similar.

That would depend on how much you find things may differ and
still call them similar. IMO they are not similar enough
since "12" in "123" doesn't behave like [1,2] in [1,2,3]

And it never will, because of the property of strings I mentioned
previously. Unless you want to introduce a character type into Python
there is no way that you are ever going to be be satisfied.

The properties of strings didn't force the developers to make those
two behave differently. They could have made the choice that "12"
in "123" returned False and could have introduced a method that would
return True or False depending on whether the argument was a substring
or not. The same method could then eventually be used in other sequences
to test whether the argument was a subsequence or not. Either by
the python-developers themselves if they ever thought that usefull
or by any programmer who could add this functionality to a subclass.

Yes the properties of strings allowed for the solution the python
developers have chosen, a solution not extendable to other
sequence types. So yes [1,2] in [1,2,3] will never behave like "12" in
"123" currently does and the properties of strings allowed it to evolve this way
but in the end it was a design choice that could have been made differently
and could have been made in a way to allow more duck typing and more
access to polymorphism.

>And consider that each small inconsistency in itself
may be not important enough to remove. But if you
have enough of them remembering all these special
cases can become tedious.

But not as tedious as this eternal discussion of already-decided issues.

A number of those "decided" issue have been changed. Besides
nobody is forcing you to participate. If you think these
kind of issues is too tedious for your taste, feel free
to no longer participate.

>Strings have some properties that are different and some
properties that are similar with other sequences. My argument
is that if you want to facilitate duck typing and natural access to
polymorphism in peoples functions that work with sequences in general
you'd better take care that the sequence api of strings resembles
the sequence api of other sequences as good as possible.

This is just a bald restatement of the same argument you feel makes it
desirable to add an index() method to tuples.

No it is not a bald statement. If tuple would have methods like index
and count, more functions could be written that are indifferent to
the argument being a tuple or a list or at least it would make
writing such a function easier, so it would allow for more
duck typing and give more access to polymorphism.

You may think this kind of duck typing and polymorphism insignificant
but that doesn't change the truth about the above statement.

If taken to its logical
(and ridiculous) extreme there should only be one sequence type in Python.

No it doesn't. There is a big difference between having sequences with
different properties because there is a need for those different
properties and making things more different than needed and using
the need for different properties to introduce differences that are
unnecessary.

>You on the other hand seem to argue that since strings have
properties where they differ from other sequences it no longer
is so important that the sequence api of strings resembles those
of other sequences.

Well, of course. Programming languages are for human users, and they
should do what human users find most natural. Since humans can disagree
the developers (amongst who I do not count myself, although I *am*
concerned about the development of Python) have to try and go by
consensus, which by and large they do reasonably successfully.

But the defence of not having tuple.index has never been about
what was natural to the user or not, but has always been about what
tuples were supposedly intended for.

So what I suppose I *am* saying is that your opinions would seem to
differ from the consensus. While you are not in a minority of one you
are in a minority, and it would be nice if we could proceed without
having to continually revisit each small design decision on a continuous
basis.

I am not so sure I'm in a minority. This kind of thing is not decided by
consensus at least not among the python users. It is the sole decision of
the BDFL. Besides, this is usenet, all kind of things get revisted here on
a continuous basis. Why should design decisions be an exception?

--
Antoon Pardon

Apr 14 '07 #115

Rhamphoryncus

On Apr 14, 11:59 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"Rhamphoryncus" <rha...@gmail.comwrites:
Nope, it's pretty fundamental to working with text, unicode only being
an extreme example: there's a wide number of ways to break down a
chunk of text, making the odds of "e" being any particular one fairly
low. Python's unicode type only makes this slightly worse, not
promising any particular one is available.

I don't understand this. I thought that unicode was a character
coding system like ascii, except with an enormous character set
combined with a bunch of different algorithms for encoding unicode
strings as byte sequences. But I've thought of those algorithms
(UTF-8 and so forth) as basically being kludgy data compression
schemes, and unicode strings are still just sequences of code points.

Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

As an aside, I feel the need to clarify the terms "code points" and
"scalar values". The only difference is that "code points" includes
the surrogates, whereas "scalar values" does not. As the surrogates
are just an encoding detail of UTF-16 I feel this makes "scalar
values" the more canonical term. It's all quite confusing though x_x.

--
Adam Olsen, aka Rhamphoryncus

Apr 15 '07 #116

Paul Rubin

"Rhamphoryncus" <rh****@gmail.comwrites:

Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

Apr 15 '07 #117

Neil Hodgson

Paul Rubin:

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

Python Unicode strings are arrays of code units which are either 16
or 32 bits wide with the width of a code unit determined when Python is
compiled. s[17] will be the 18th code unit of the string and is found by
indexing with no ancillary data structure or processing to interpret the
string as a sequence of code points.

This is the same technique used by other languages such as Java.
Implementing the Python string type with a data structure that can
switch between UTF-8, UTF-16 and UTF-32 while preserving the appearance
of a UTF-32 sequence has been proposed but has not gained traction due
to issues of complexity and cost.

Neil

Apr 15 '07 #118

Roel Schroeven

Paul Rubin schreef:

"Rhamphoryncus" <rh****@gmail.comwrites:
>Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

I didn't get it either, but now I understand. Like you, I thought Python
Unicode strings contain a canonical representation (in interface, not
necessarily in implementation) but apparently that is not true; see
Neil's post and the reference manual
(http://docs.python.org/ref/types.html#l2h-22).

A simple example on my Python installation, apparently compiled to use
UTF-16 (sys.maxunicode == 65535):

>>s = u'\u1d400'
s.index(s)

>>s[0]

u'\u1d40'

>>s == s[0]

False
In this case s[0] is not the full Unicode scalar, but instead just the
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
0x0000 (in s[1]).

--
If I have been able to see further, it was only because I stood
on the shoulders of giants. -- Isaac Newton

Roel Schroeven

Apr 15 '07 #119

Rhamphoryncus

On Apr 15, 1:55 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

"Rhamphoryncus" <rha...@gmail.comwrites:
Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

On linux (UTF-32):

>>c = u'\U0010FFFF'
c

u'\U0010ffff'

>>list(c)

[u'\U0010ffff']

On windows (UTF-32):

>>c = u'\U0010FFFF'
c

u'\U0010ffff'

>>list(c)

[u'\udbff', u'\udfff']

The unicode type's repr hides the distinction but you can see it with
list. Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character

--
Adam Olsen, aka Rhamphoryncus

Apr 15 '07 #120

Rhamphoryncus

On Apr 15, 8:56 am, Roel Schroeven <rschroev_nospam...@fastmail.fm>
wrote:

Paul Rubin schreef:

"Rhamphoryncus" <rha...@gmail.comwrites:
Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

I didn't get it either, but now I understand. Like you, I thought Python
Unicode strings contain a canonical representation (in interface, not
necessarily in implementation) but apparently that is not true; see
Neil's post and the reference manual
(http://docs.python.org/ref/types.html#l2h-22).

A simple example on my Python installation, apparently compiled to use
UTF-16 (sys.maxunicode == 65535):

>>s = u'\u1d400'

You're confusing \u, which is followed by 4 digits, and \U, which is
followed by eight:

>>list(u'\u1d400')

[u'\u1d40', u'0']

>>list(u'\U0001d400')

[u'\U0001d400'] # UTF-32 output, sys.maxunicode == 1114111
[u'\ud835', u'\udc00'] # UTF-16 output, sys.maxunicode == 65535

--
Adam Olsen, aka Rhamphoryncus

Apr 15 '07 #121

Paul Rubin

Roel Schroeven <rs****************@fastmail.fmwrites:

In this case s[0] is not the full Unicode scalar, but instead just the
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
0x0000 (in s[1]).

Arrrrgggh. After much head scratching I think I now understand what
you are saying. This appears to me to be absolutely nuts. What is
the purpose of having a unicode string type, if its sequence elements
are not guaranteed to be the unicode characters in the string? Might
as well use byte strings for everything.

Come to think of it, I don't understand why we have this plethora of
encodings like utf-16. utf-8 I can sort of understand on pragmatic
grounds, but aside from that I'd think UCS-4 should be used for everything,
and when a space-saving compressed representation is desired, then use
a general purpose data compression algorithm such as gzip.

Apr 15 '07 #122

Donn Cave

In article <ma***************************************@python. org>,
"Hendrik van Rooyen" <ma**@microcorp.co.zawrote:

"Donn Cave" <do**@u.washington.eduwrote:

Well, yes - consider for example the "tm" tuple returned
from time.localtime() - it's all integers, but heterogeneous
as could be - tm[0] is Year, tm[1] is Month, etc., and it
turns out that not one of them is alike. The point is exactly
that we can't discover these differences from the items itself -
so it isn't about Python types - but rather from the position
of the item in the struct/tuple. (For the person who is about
to write to me that localtime() doesn't exactly return a tuple: QED)

This is the point where the whole thing falls apart in my head and
I get real confused - I can't find a reason why, list or tuple, the first
item can't be something, the second something else, etc...

Of course, you may do what you like. Don't forget, though,
that there's no "index" method for a tuple.

Donn Cave, do**@u.washington.edu

Apr 16 '07 #123

Similar topics