Ordering python sets

Mr.SpOOn

Hi,
I need a structure to represent a set of integers. I also need to
perform on this set some basic set operations, such as adding or
removing elements, joining with other sets and checking for the
presence of specific elements.

I think that using Python sets would be the best choice, but I also
need integers to be ordered inside the set and I've just found out
that, instead, Python sets are unordered collections.

Is there another convenient structure or shall I use lists and define
the operations I need?

Thanks.

Oct 22 '08 #1

Subscribe Post Reply

2417

Steven D'Aprano

On Sat, 25 Oct 2008 08:58:18 +0000, Lie Ryan wrote:

<anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist')
b = dict({'a': 'A'}, implementation = 'binarytree')
c = dict({'a': 'A'}, implementation = 'binarytree')

Oh I hope not. I think you have mistaken "dynamic" for "chaotic".

When I see a dict, I want to know that any two dicts work the same way. I
don't want to have to search the entire project's source code to find out
if it is a dict implemented as a hash table with O(1) lookups, or a dict
implemented as a binary tree with O(log N) lookups, or a dict implemented
as a linear array with O(N) lookups.

If I wanted that sort of nightmare, I can already do it by shadowing the
builtin:

dict = binarytree
D = dict({'a': 'A'}) # make a binary tree

There is no possible good that come from this suggestion. The beauty of
Python is that the built-in data structures (list, dict, set) are
powerful enough for 99% of uses[1], and for the other 1%, you can easily
and explicitly use something else.

But *explicitly* is the point. There's never any time where you can do
this:

type(mydict) is dict

and not know exactly what performance characteristics mydict will have.
(Unless you shadow dict or type, or otherwise do something that breaks
the rules.) You never need to ask, "Okay, it's a dict. What sort of dict?"

If you want a binary tree, ask for a binary tree.

[1] Your mileage may vary.
--
Steven

Oct 25 '08 #2

Lie Ryan

On Sat, 25 Oct 2008 09:21:05 +0000, Steven D'Aprano wrote:

On Sat, 25 Oct 2008 08:58:18 +0000, Lie Ryan wrote:

><anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist') b = dict({'a':
'A'}, implementation = 'binarytree') c = dict({'a': 'A'},
implementation = 'binarytree')

Oh I hope not. I think you have mistaken "dynamic" for "chaotic".

When I see a dict, I want to know that any two dicts work the same way.

Oh no, the two dict implementation would work _exactly_ the same from the
outside, they are transparently interchangeable. Only the performance
characteristic differs because of the different implementation. Actually
I got this idea from a book about algorithm and data structure, that book
said that an abstract data type (e.g. dict, set, list) have several
competing implementations or data structures (e.g. binary tree dict,
hashed dict, array dict). A data's implementation and interface can be
separated so that we can switch the data structure's implementation
without changing the rest of the code. The book is Algorithm Design
Manual by Skiena.

hint: read the PS below

I don't want to have to search the entire project's source code to find
out if it is a dict implemented as a hash table with O(1) lookups, or a
dict implemented as a binary tree with O(log N) lookups, or a dict
implemented as a linear array with O(N) lookups.

No, you'd only need to look at the dict's creation point (or actually
much better at projects docs, but not everyone create good docs). The
alternative you mentioned below, by shadowing built-in, is both a hack
and is much more likely to be missed.

If I wanted that sort of nightmare, I can already do it by shadowing the
builtin:

dict = binarytree
D = dict({'a': 'A'}) # make a binary tree

I DON'T want THAT sort of nightmare you mentioned...
And it'd be impossible to have two dictionary that have two different
implementations.

There is no possible good that come from this suggestion. The beauty of
Python is that the built-in data structures (list, dict, set) are
powerful enough for 99% of uses[1], and for the other 1%, you can easily
and explicitly use something else.

Oh really? As far as I know, python's list is extremely bad if you're
inserting data at the beginning of the list (e.g. lst.insert(0) requires
the whole array be "copied" one address away). This is because python's
list uses array data structure, making addressing (e.g. a[2]) fast, but
insertion slow. If, on the other hand, it is implemented using binary
tree, it'd make insertion O(log n) but addressing is a bit tricky on it.

The keyword is "tradeoffs".

But *explicitly* is the point. There's never any time where you can do
this:

Yes, true, explicitly IS the point. How more explicit can you be than:
dict({foo: bar}, implementation = 'binarytree')

type(mydict) is dict

If my memory serves right, binary tree dict and hashed table dict is both
a dict right? (Duck Typing)
Only their implementation differs. Implementation is... well,
"implementation detail".

and not know exactly what performance characteristics mydict will have.

Oh... why do I need to know what the performance characteristic of mydict
is? Unless I know what I'm doing.

(Unless you shadow dict or type, or otherwise do something that breaks
the rules.) You never need to ask, "Okay, it's a dict. What sort of
dict?"

Okay, it's a dict. What sort of dict? Who the hell cares? I don't need to
know, they all looks and behave the same (Duck Typing)... at least until
I profile them (since profiling is a deep black magic by itself, it
cannot be used to discredit switching implementations).

Sometimes we need a data type to use a specific data structure that have
some specific performance characteristic, because we know we'll be doing
a specific operation a lot more than other operations.

If you actually need to know what the implementation is currently being
used, you could implement a dict.implementation property.

If you want a binary tree, ask for a binary tree.

Yeah, you ask for binary tree EXPLICITLY:
bintreedict = dict({a:b}, implementation = 'binarytree')

this:
regularhasheddict = dict({a:b})

would have a reasonable default.
PS: I do admit I have used the wrong terms in the last post. I used the
term "data structure", instead it should be "abstract data type", "data
structure" is a synonym for "implementation". In this post, I hope I've
corrected all of the usage.

Oct 25 '08 #3

Terry Reedy

Lie Ryan wrote:

>
<anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist')

For this to work, the abstract list would have to know about all
implementations of the abstraction.

b = dict({'a': 'A'}, implementation = 'binarytree')
c = dict({'a': 'A'}, implementation = 'binarytree')

i.e. basically since a data structure can have different implementations,
and different implementations have different performance characteristics,
it should be possible to dynamically change the implementation used.

In the far future, the data structure and its implementation could be
abstracted even further:

a = list() # ordered list
b = set() # unordered list
c = dict() # unordered dictionary
d = sorteddict() # ordered dictionary

Each of the implementations would share a common subset of methods and
possibly a few implementation dependent method that could only work on
certain implementations (or is extremely slow except in the correct
implementation).
</anotherrandommusing>

The future is 3.0, at least in part, with Abstract Base Classes.
There are 16 in the collectons module.
"In addition to containers, the collections module provides some ABCs
(abstract base classes) that can be used to test whether a class
provides a particular interface, for example, is it hashable or a
mapping, and some of them can also be used as mixin classes."

The ABCs for numbers are in the numbers module.

tjr

Oct 25 '08 #4

Lie Ryan

On Sat, 25 Oct 2008 18:20:46 -0400, Terry Reedy wrote:

Lie Ryan wrote:

><anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist')

For this to work, the abstract list would have to know about all
implementations of the abstraction.

/the exact syntax isn't really important/
/abstract type and implementation separation is the important point/

Actually, if I want to force it, that syntax could work using the same
magic used by event-based syst

Oct 25 '08 #5

Lie Ryan

On Sat, 25 Oct 2008 18:20:46 -0400, Terry Reedy wrote:

Lie Ryan wrote:

><anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist')

For this to work, the abstract list would have to know about all
implementations of the abstraction.

# Sorry the last message is truncated because of an "accident"

/the exact syntax isn't really important/
/abstract type and implementation separation is the important point/

Actually, if I want to force it, that syntax could work using the same
magic used by event-based systems: registration. Although I agree it
might be a bit cumbersome to do registration for something like this, but
as I've said before, exact syntax is not really important.

Oct 25 '08 #6

Steven D'Aprano

On Sat, 25 Oct 2008 21:53:10 +0000, Lie Ryan wrote:

On Sat, 25 Oct 2008 09:21:05 +0000, Steven D'Aprano wrote:

>On Sat, 25 Oct 2008 08:58:18 +0000, Lie Ryan wrote:

>><anotherrandommusing>
Since python is dynamic language, I think it should be possible to do
something like this:

a = list([1, 2, 3, 4, 5], implementation = 'linkedlist') b =
dict({'a': 'A'}, implementation = 'binarytree') c = dict({'a': 'A'},
implementation = 'binarytree')

Oh I hope not. I think you have mistaken "dynamic" for "chaotic".

When I see a dict, I want to know that any two dicts work the same way.

Oh no, the two dict implementation would work _exactly_ the same from
the outside, they are transparently interchangeable. Only the
performance characteristic differs because of the different
implementation.

Exactly. That was my point.

[...]

>I don't want to have to search the entire project's source code to find
out if it is a dict implemented as a hash table with O(1) lookups, or a
dict implemented as a binary tree with O(log N) lookups, or a dict
implemented as a linear array with O(N) lookups.

No, you'd only need to look at the dict's creation point (or actually
much better at projects docs, but not everyone create good docs).

And how do you find an arbitrary object's creation point without
searching the project's source code?

>If I wanted that sort of nightmare, I can already do it by shadowing
the builtin:

dict = binarytree
D = dict({'a': 'A'}) # make a binary tree

I DON'T want THAT sort of nightmare you mentioned... And it'd be
impossible to have two dictionary that have two different
implementations.

Nonsense.

dict = binarytree
D1 = dict({'a': 'A'}) # make a binary tree "dict"
dict = __builtin__.dict
D2 = dict({'a': 'A'}) # make a standard dict
dict = someothertype
D3 = dict({'a': 'A'})

I'm not suggesting this is a good idea. This is a terrible idea. But it
is not much worse than your idea:

D1 = dict({'a': 'A'}, implementation='binarytree')
D2 = dict({'a': 'A'}, implementation='dict')
D3 = dict({'a': 'A'}, implementation='someothertype')

>There is no possible good that come from this suggestion. The beauty of
Python is that the built-in data structures (list, dict, set) are
powerful enough for 99% of uses[1], and for the other 1%, you can
easily and explicitly use something else.

Oh really? As far as I know, python's list is extremely bad if you're
inserting data at the beginning of the list

And how often do you do that?

And when you do, use a deque. Just call it a deque.
[...]

>But *explicitly* is the point. There's never any time where you can do
this:

Yes, true, explicitly IS the point. How more explicit can you be than:
dict({foo: bar}, implementation = 'binarytree')

>type(mydict) is dict

You miss the point. With your plan, you can do this:

D1 = dict({foo: bar}, implementation = 'binarytree')
D2 = dict({foo: bar}, implementation = 'dict')
type(D1) is type(D2)

and yet D1 and D2 have UTTERLY different performance characteristics. So
now you need to add ANOTHER test to distinguish dicts-which-are-dicts
from dicts-which-are-binary-trees:

D1.implementation != D2.implementation

And why? So you can avoid calling a goose a goose, and call it a duck
instead.

If my memory serves right, binary tree dict and hashed table dict is
both a dict right? (Duck Typing)
Only their implementation differs. Implementation is... well,
"implementation detail".

Duck typing refers to *interface*, not implementation. I have no problem
with you using a type with the same interface as a dict. That's what duck
typing is all about. Just don't call it a dict!

>and not know exactly what performance characteristics mydict will have.

Oh... why do I need to know what the performance characteristic of
mydict is? Unless I know what I'm doing.

http://www.joelonsoftware.com/articl...000000319.html
Because when you do this:

mydict[key] = 1

it's important whether each dict lookup is O(1), O(log N) or O(N). For a
dict with one million items, that means that an implementation based on a
binary tree does O(20) times more processing than a dict, and an
implementation based on linear searching does O(1000000) times more
processing.

If you think implementation details don't matter, try this:

s1 = 'c'*(10**6)

versus

s2 = ''
for i in xrange(10**6):
s2 = 'c' + s2 # defeat optimizer

>(Unless you shadow dict or type, or otherwise do something that breaks
the rules.) You never need to ask, "Okay, it's a dict. What sort of
dict?"

Okay, it's a dict. What sort of dict? Who the hell cares?

If you don't care, then why are you specifying the implementation type?

mydict = dict({'foo': 'bar'}, implementation="surprise me!")

You can't have it both ways. If you care, then you know enough to want a
hash table based dict (the standard) or a binary tree or something else.
So go ahead and use whatever data structure you want. Just don't call it
a dict.

But if you don't care, then just use the Python standard data types.

I don't need
to know, they all looks and behave the same (Duck Typing)...

Until you try a simple operation like len(mydict) and it takes three
minutes because the implementation you choose is really inefficient.

Sometimes we need a data type to use a specific data structure that have
some specific performance characteristic, because we know we'll be doing
a specific operation a lot more than other operations.

I'm not arguing that you should never use any other data structure. Go
right ahead and use them all you like.

>If you want a binary tree, ask for a binary tree.

Yeah, you ask for binary tree EXPLICITLY: bintreedict = dict({a:b},
implementation = 'binarytree')

Or you could do it the right way:

D1 = binarytree(data)
D2 = dict(data)
type(D1), type(D2)
=returns binarytree, dict

versus:

D1 = dict(data, implementation='binarytree')
D2 = dict(data)
type(D1), type(D2)
=returns dict, dict

--
Steven

Oct 26 '08 #7

Terry Reedy

Lie Ryan wrote:

On Sat, 25 Oct 2008 18:20:46 -0400, Terry Reedy wrote:

>>a = list([1, 2, 3, 4, 5], implementation = 'linkedlist')
For this to work, the abstract list would have to know about all
implementations of the abstraction.

/the exact syntax isn't really important/
/abstract type and implementation separation is the important point/

Actually, if I want to force it, that syntax could work using the same
magic used by event-based systems: registration.

ABCs have registration method. The builtin ABCs have appropriate
builtin classes preregistered.

>>import collections as co
mu = co.MutableSequence
issubclass(list, mu)

True

I believe user classes that inherit from an ABC are also registered, and
other can be registered explicitly.

>Although I agree it
might be a bit cumbersome to do registration for something like this, but
as I've said before, exact syntax is not really important.

Then why do you object to current
mylist = linkedlist(data)
and request the harder to write and implement
mylist = list(data, implementation = 'linkedlist')
?

tjr

Oct 26 '08 #8

Lie Ryan

On Sun, 26 Oct 2008 00:53:18 +0000, Steven D'Aprano wrote:
[...]

And how do you find an arbitrary object's creation point without
searching the project's source code?

How is it better using the current way?
Asking the .implementation field isn't much harder than asking the type
(), and is much more obvious that we're asking the "implementation" being
used.

>>If I wanted that sort of nightmare, I can already do it by shadowing
the builtin:

dict = binarytree
D = dict({'a': 'A'}) # make a binary tree

I DON'T want THAT sort of nightmare you mentioned... And it'd be
impossible to have two dictionary that have two different
implementations.

Nonsense.

<snip>

There is two good arguments for using a plain string to specify the
implementation used: 1) Plain-text Configuration, 2) argument passing.

1) The current way requires you have a map of the configuration text to a
function (i.e. imps = {'linkedlist': linkedlist}), than assign the
function to _a global name_ (i.e. lista = imps[configuration[confkey]] --
namespace pollution), then instantiate an object using that (ls = lista
([...]))). The implementation way, only requires ls = list([...],
implementation = configuration[confkey])

2) Easily said, plain text passing is safer and easier than passing
function object.

>>There is no possible good that come from this suggestion. The beauty
of Python is that the built-in data structures (list, dict, set) are
powerful enough for 99% of uses[1], and for the other 1%, you can
easily and explicitly use something else.

Oh really? As far as I know, python's list is extremely bad if you're
inserting data at the beginning of the list

And how often do you do that?

It's not even a deque yet, a regular queue is slow using list in python,
and any insertion sort requires you to insert elements into arbitrary
position. This makes using array for it a O(n**2). On the other hand,
using binary tree is a O(log n).

[...]

>>But *explicitly* is the point. There's never any time where you can do
this:

Yes, true, explicitly IS the point. How more explicit can you be than:
dict({foo: bar}, implementation = 'binarytree')

>>type(mydict) is dict

You miss the point. With your plan, you can do this:

D1 = dict({foo: bar}, implementation = 'binarytree') D2 = dict({foo:
bar}, implementation = 'dict') type(D1) is type(D2)

and yet D1 and D2 have UTTERLY different performance characteristics.

It does not matter that it have different performance characteristic.

So
now you need to add ANOTHER test to distinguish dicts-which-are-dicts
from dicts-which-are-binary-trees:

How often would you need to care about the specific implementation used?
Compare with how often you want to know how to work with the object. Most
of the time, you only need to know how to work with it (i.e. knowing its
interface), only _sometimes_ when it DOES matter speed-wise, you'd need
to know and affect its implementation.

D1.implementation != D2.implementation

And why? So you can avoid calling a goose a goose, and call it a duck
instead.

I think we have an utterly different view here.
I wished that list() becomes an abstract type, instead of a data
structure. (binarytreelist and arraylist is some property of a list)

While you insisted that list() is a data structure, and abstract type is
some obscureness somewhere unknown.

>If my memory serves right, binary tree dict and hashed table dict is
both a dict right? (Duck Typing)
Only their implementation differs. Implementation is... well,
"implementation detail".

Duck typing refers to *interface*, not implementation.

Well said, you've just supported my view.

I have no problem
with you using a type with the same interface as a dict. That's what
duck typing is all about. Just don't call it a dict!

why do you think binarytreedict is "less dict" than hasheddict?

>>and not know exactly what performance characteristics mydict will
have.

Oh... why do I need to know what the performance characteristic of
mydict is? Unless I know what I'm doing.

http://www.joelonsoftware.com/articl...000000319.html

Sometimes it is good to have control of everything. Sometimes (most of
the time) I know I can dump some of the details to think on more
important matters. To know which parts need more thinking by doing
profiling.

Because when you do this:

mydict[key] = 1

it's important whether each dict lookup is O(1), O(log N) or O(N). For a
dict with one million items, that means that an implementation based on
a binary tree does O(20) times more processing than a dict, and an
implementation based on linear searching does O(1000000) times more
processing.

Well, I don't need to know how the dict works if I'm only using it to
store twenty items right? (I do know how python's dict/set/list/etc is
implemented, but unless that particular dict is central to my program,
I'm not going to create my programs around how dict is implemented, I'm
going to create it on how my mind works).

I don't care how print statement works, unless I'm trying to do something
fancy with print. I don't care that print takes a series of electrical
signal, then convert it into a series of bytes and pass it into the
terminal stdout then the terminal will figure out how to draw the string
using the appropriate font at the appropriate place and pass itself to a
window manager which would compose the desktop into an image then pass
itself to the graphic driver which would convert the image to electrical
signal to be sent to the graphic card which would do some fancy tricks
before sending it to the monitor, when what I need is only to show the
user a simple welcome message.

The reason we use high-level language is that we know we don't need to
care about some details. Only when the detail matters then we take care
of it. If I do care about all the details, I'd use C, no, no, lower, I'd
use assembly, no, lower..., I'd design my own CPU, no even lower, I'd
design my own non-von-neumann architecture. I don't think I can change
the rules of physics though, but if I can... <some more pseudo random
bytes>

>I don't need
to know, they all looks and behave the same (Duck Typing)...

Until you try a simple operation like len(mydict) and it takes three
minutes because the implementation you choose is really inefficient.

.... then I know I have chosen the wrong data structure, and I know I need
to change it.

>>If you want a binary tree, ask for a binary tree.

Yeah, you ask for binary tree EXPLICITLY: bintreedict = dict({a:b},
implementation = 'binarytree')

Or you could do it the right way:

<snip>

For example, if I have a certain unknown object. What is easier?

>>type(URO) # Unidentified Round Object

=returns (*&%$^%*

or

>>type(URO)

=returns diningplate

only after three thousand years of continuous and laborious researching,
at last I found out that (*&%$^%* is an alien dining plate, and it does
have all the features and functionality as earth dining plate, although
due to it being made of CClKO32C32CH42U it is much more resistant to
acidity but would melt when you put carrot on it.

Only when I want to eat carrot, would I care that it is an alien dining
plate. And only when I want to eat something extremely acidic, I'd chose
the alien dining plate over earth dining plate.

Oct 26 '08 #9

Lie Ryan

On Sat, 25 Oct 2008 21:50:36 -0400, Terry Reedy wrote:

Lie Ryan wrote:
>On Sat, 25 Oct 2008 18:20:46 -0400, Terry Reedy wrote:
Then why do you object to current
mylist = linkedlist(data)
and request the harder to write and implement
mylist = list(data, implementation = 'linkedlist')

I don't wholly object it. I think it's fine but can be improved.

I tend to be more inclined on this side though:

mylist = list.linkedlist(data) # or list<othersymbol>linkedlist(data)
as syntax sugar for
mylist = list(data, implementation = 'linkedlist')

this kinds of syntax magnify the fact that linkedlist is a choice of
implementation for list abstract type. The current syntax makes no
indication whatsoever that linkedlist is anything related to list.

Oct 26 '08 #10

bearophileHUGS

Glenn Linderman:

how does one create a key that corresponds to ascending integer followed by descending character string?

(Others may have already answered you because Google groups is very
slow.)

>>seq = [(10, "abb"), (5, "zul"), (5, "hal"), (2, "of")]
sorted(seq, key=lambda (n,s): (-n, s), reverse=True)

[(2, 'of'), (5, 'zul'), (5, 'hal'), (10, 'abb')]

A little harder question is how to create a key that corresponds to
ascending string followed by descending string?

To do that you can sort the data two times, relying on the stable
nature of the Python sort.

Another (silly?) solution may be to invent a "negate" function for
strings too:

>>strneg = lambda txt: txt.translate(itable)
itable = "".join(chr(i) for i in xrange(255, -1, -1))
pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
sorted(pairs, key=lambda (s1,s2): (s1, strneg(s2)))

[('al', 'den'), ('al', 'boo'), ('zul', 'o'), ('zul', 'ao')]

Bye,
bearophile

Oct 27 '08 #11

Peter Otten

be************@lycos.com wrote:

Glenn Linderman:

>how does one create a key that corresponds to ascending integer followed
by descending character string?

(Others may have already answered you because Google groups is very
slow.)

>>>seq = [(10, "abb"), (5, "zul"), (5, "hal"), (2, "of")]
sorted(seq, key=lambda (n,s): (-n, s), reverse=True)

[(2, 'of'), (5, 'zul'), (5, 'hal'), (10, 'abb')]

A little harder question is how to create a key that corresponds to
ascending string followed by descending string?

To do that you can sort the data two times, relying on the stable
nature of the Python sort.

Another (silly?) solution may be to invent a "negate" function for
strings too:

>>>strneg = lambda txt: txt.translate(itable)
itable = "".join(chr(i) for i in xrange(255, -1, -1))
pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
sorted(pairs, key=lambda (s1,s2): (s1, strneg(s2)))

[('al', 'den'), ('al', 'boo'), ('zul', 'o'), ('zul', 'ao')]

Here's a class that can negate arbitrary values:

>>class Neg(object):

.... def __init__(self, value):
.... self.value = value
.... def __cmp__(self, other):
.... return -cmp(self.value, other.value)
....

>>pairs = [('al', 'boo'), ('zul', 'ao'), ('al', 'den'), ('zul', 'o')]
sorted(pairs, key=lambda (s, t): (s, Neg(t)))

[('al', 'den'), ('al', 'boo'), ('zul', 'o'), ('zul', 'ao')]

Oct 27 '08 #12

bearophileHUGS

Lie Ryan:

>Oh no, the two dict implementation would work _exactly_ the same from the outside, they are transparently interchangeable. Only the performance characteristic differs because of the different implementation.<

I don't agree with the general idea. If the operations done by your
data structure have different computational complexity, then they are
fit for different usages. When you program you must know what
computational complexity has each of the operations of your data
structyre, otherwise there's no way to know the complexity of your
whole program, so instead of programming you are just become a mage
that tries magical spells and hopes for the better. So I don't accept
so much different data structures to have the same name. That's why
I'll never appreciate the Python list type to be named list instead of
array, despite it supports more or less all the functions you expect
from an abstract list type.

Said that, for a high-level language like Python I can see another
possible solution. To have a type that contains several kinds of data
structures, for example a "dict" that contains a hash implementation,
a red-black tree, etc. Such data structure can use a small part of the
computational power given to it to collect statistics usages of each
object (or each variable, that may contain in succession several
ojects of the same type). Such data structure can then at run time
adapt dynamically, chosing to use the implementation fitter for the
specific usage of each object or variable (the programmer can give
hints of course, or sometimes even coerce the usage a specific
implementation). (such statistics can also be saved on disk to be used
for the successive run of the program, to help them work well from the
start too, and not just after a while). If this capability works well
in practice, then it can solve the problem you were talking about, I
think.

I presume data structures in future high-level languages will be quite
more able to adapt themselves to their usages. Today it's the second
time I talk about future programming languages :-)

Bye,
bearophile

Oct 27 '08 #13

greg

On approximately 10/27/2008 10:27 AM, came the following characters from
the keyboard of Peter Otten:

Here's a class that can negate arbitrary values

... def __init__(self, value):
... self.value = value
... def __cmp__(self, other):
... return -cmp(self.value, other.value)

Unfortunately, the __cmp__ method has been eliminated from
3.0 as well, so this won't work as-is. You would need to
override __lt__, __eq__, etc.

--
Greg

Oct 28 '08 #14

Lie Ryan

On Mon, 27 Oct 2008 13:18:43 -0700, bearophileHUGS wrote:

So I don't accept so much different data structures to have the
same name

You need to adjust the current mindset slightly (but in an important way
to understand the "why" behind this idea). The current notion is: list
and dict is a data structure. With this idea, list and dict is an
abstract type, not a data structure. array, linked list, binary tree, red-
black tree, hashed are data structure. We create a data structure by
passing the data structure's identifier string to a factory function
provided by the abstract type.

There are two possible syntax sugar:
a = list.bintree([...])
b = list([...]) # also for backward compat, have reasonable default

In short, the data structures doesn't share the same name, the abstract
data-type does. The suggestion is to change the place where we define the
what data structure to use.

Said that, for a high-level language like Python I can see another
possible solution. To have a type that contains several kinds of data
structures, for example a "dict" that contains a hash implementation, a
red-black tree, etc. Such data structure can use a small part of the
computational power given to it to collect statistics usages of each
object (or each variable, that may contain in succession several ojects
of the same type). Such data structure can then at run time adapt
dynamically, chosing to use the implementation fitter for the specific
usage of each object or variable (the programmer can give hints of
course, or sometimes even coerce the usage a specific implementation).
(such statistics can also be saved on disk to be used for the successive
run of the program, to help them work well from the start too, and not
just after a while). If this capability works well in practice, then it
can solve the problem you were talking about, I think.

I presume data structures in future high-level languages will be quite
more able to adapt themselves to their usages. Today it's the second
time I talk about future programming languages :-)

Although you said you disagree with the general idea, you actually take
the idea two steps further, I take that as an implicit agreement to
several parts of the idea.

Nov 1 '08 #15

bearophileHUGS

Lie Ryan:

>Although you said you disagree with the general idea, you actually take the idea two steps further, I take that as an implicit agreement to several parts of the idea.<

Think about a bridge: building half bridge may be bad/useless, while
building it whole may lead to something useful :-)

Bye,
bearophile

Nov 1 '08 #16

Tim Rowe

2008/10/27 <be************@lycos.com>:

Lie Ryan:

>>Oh no, the two dict implementation would work _exactly_ the same from the outside, they are transparently interchangeable. Only the performance characteristic differs because of the different implementation.<

I don't agree with the general idea. If the operations done by your
data structure have different computational complexity, then they are
fit for different usages. When you program you must know what
computational complexity has each of the operations of your data
structyre, otherwise there's no way to know the complexity of your
whole program, so instead of programming you are just become a mage
that tries magical spells and hopes for the better.

No, when you program you know how you will be using the data
structure, so you can choose the implementation that's right for the
application. That's what the container libraries for other languages
do. At the moment, you just have one implementation, and have to hope
it's right for the job. Adding an *optional* parameter that says, in
effect, "I want this list optimised for writes and reads at both ends"
or "I want this list optimised for fully random reads but writes only
at the end" doesn't *lose* you any information about what you're
programming with. Of course it's essential that the data structure has
identican /functional/ behaviour whatever optimisation you use. Other
languages can enforce that, but Python programmers are used to taking
care of that side of things for themselves.

So I don't accept
so much different data structures to have the same name. That's why
I'll never appreciate the Python list type to be named list instead of
array, despite it supports more or less all the functions you expect
from an abstract list type.

They're not different data structures from the client point of view.
"More or less" all the functions wouldn't be enough.

--
Tim Rowe

Nov 1 '08 #17

Terry Reedy

Lie Ryan wrote:

You need to adjust the current mindset slightly (but in an important way
to understand the "why" behind this idea). The current notion is: list
and dict is a data structure. With this idea, list and dict is an
abstract type, not a data structure. array, linked list, binary tree, red-
black tree, hashed are data structure.

Did you miss my earlier post? Python already has about 20 abstract
types. It has a naming scheme for these, as well for concrete
structures, both built-in and imported. Agitating for names to be
swapped around is a waste of time. You are free, however, to do this in
your own code.

We create a data structure by
passing the data structure's identifier string to a factory function
provided by the abstract type.

Python is object-oriented rather than name-oriented. When one knows the
name of an object at compile time, the Python way is to use it directly
in the code, unquoted. In other words, work with and pass around
objects as much as possible and only use names when they are truly
variable and not known constants.

What you propose is equivalent to using getattr for all attribute
access: "getattr(list, 'append')" versus "list.append".

Terry Jan Reedy

Nov 1 '08 #18

Lie Ryan

On Sun, 02 Nov 2008 02:08:37 +0000, Steven D'Aprano wrote:

On Sat, 01 Nov 2008 18:58:59 +0000, Tim Rowe wrote:

>2008/10/27 <be************@lycos.com>:
>>Lie Ryan:

Oh no, the two dict implementation would work _exactly_ the same from
the outside, they are transparently interchangeable. Only the
performance characteristic differs because of the different
implementation.<

I don't agree with the general idea. If the operations done by your
data structure have different computational complexity, then they are
fit for different usages. When you program you must know what
computational complexity has each of the operations of your data
structyre, otherwise there's no way to know the complexity of your
whole program, so instead of programming you are just become a mage
that tries magical spells and hopes for the better.

No, when you program you know how you will be using the data structure,
so you can choose the implementation that's right for the application.

I don't understand why you have prefixed that sentence with "No". It
seems like you are agreeing with bearophile's point.

On linguistic note: As a non-native speaker of English, I've never relied
on the correct usage of Yes and No and would instead rely on the
following text. In some languages, situations where English-people
usually use Yes is answered with No and vice versa.

>That's what the container libraries for other languages do. At the
moment, you just have one implementation, and have to hope it's right
for the job.

To the extent that this is true, it is a sign of the poverty of the
standard Python library. But in fact it isn't true. Python has quite a
decent collection of collection types. Just off the top of my head:

tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque

set, list and dict are highly optimized for general use, and are very
suitable for building new data structures. And you are wrong to say that
we have to "hope it's right for the job", we can build perfectly
acceptable pure-Python data structures from these building blocks, or
add our own C extensions. You're never forced to use a standard library
data structure, it's a trade-off of your time and effort against runtime
efficiency. And because the standard library types are generally so
excellent, that trade-off is often (but not always) a no-brainer.

There are cases where the algorithm you uses ignited the standard
library's data structure's worst case scenario.

>Adding an *optional* parameter that says, in effect, "I want this list
optimised for writes and reads at both ends" or "I want this list
optimised for fully random reads but writes only at the end" doesn't
*lose* you any information about what you're programming with.

It might not lose information, but it does obfuscate it.

Right now, we can do this:

<snip>

I've repelled this same argument multiple times. How often do you think
you'd need to do check for an object's implementation? Compare that to
how often we need to ensure that the object we're handling is a list-
replacement.

That's assuming the implementation is available at runtime at all; if
it's not, then you have lost information.

The implementation property is always available, because it's a non-
optional argument when registering the data structure.

>Of course it's essential that the data structure has identican
/functional/ behaviour whatever optimisation you use.

"Essential"? That's an interesting point. Let's look at an actual
example.

Consider one of Lie's initial examples (paraphrased):

D1 = dict({1: 'a', 2: 'b'}) # standard hash-table D2 = dict({1: 'a', 2:
'b'}, implementation="binary tree")

You say it's "essential" that the binary tree implementation has
identical functional behaviour to standard hash-table implementation of
dict. Okay, let's see where that leads us. The functional behaviour of
hash-tables is that they are O(1) for reads, while binary trees are
O(log(N)). The only way for them to have identical functional behaviour
is to *artificially* slow down dicts to the lowest common speed. That is
a Bad Thing, and I don't understand why you think it is essential.

I can't think why you consider a function's performance characteristic as
its behavior, a "characteristic" of something means it is a it's property/
character. Not behavior.

we no longer have to
artificially change the behaviour of the type -- be we do have to
artificially cripple the API of the type!

I don't get why you think we've got to restrict it to the lowest common
denominator. We expect people that messes with a field called
"implementation" to know enough about the limitations of the
implementation he chose. Regular python programmers nowadays have to be
aware that the limitation of the (hashed) dict is its key must be
immutable. A true dict/mapping type doesn't have that limitation, we
already have the so-called "artificial" limitations right now. The way it
is now gives the impression that hashed dict's limitations is equal to
dict/mapping's limitation. Instead, it should be: dict is a mapping type
of any object to any object, however the default hashed dict's limitation
is immutable key.

Aside: Also with my way, we could classify a "feature" into three status:
limitation, exact replacement, extension. limitation is minor points
where the implementation simply couldn't fulfill (not fulfilling major
points would fail registration), exact replacement is the regular things,
and extension as features that, if used, might make changing of
implementations harder, so should only be used if that implementation's
extension is the natural way to write an algorithm.

But maybe you'll argue that Lie's examples are bad examples.

Yes, I agree it is a bad example for this issue. When I wrote that, what
I had in mind is to demonstrate how it'd look like in a simple way. Way
too simple, it seems.

Perhaps
what you have in mind is something like the difference between "dicts
implemented as hash tables with chaining" and "dicts implemented as hash
tables with open addressing". (Aside: what sort of open addressing?
Linear? Quadratic? Something else?) At least now we're talking about
different implementations with the same API and more-or-less the same
functional behaviour.

Not quite. I'm looking for a more radical differences. This is more
apparent in array-list vs bintree-list. One is better for straight
traversal and random access, while the other is better for searching and
inserting.

I'd suggest that 99% of the time, client code simply shouldn't care
about the differences. The existing implementations are Good Enough.
Have you *ever* found yourself saying "I can't use the built-in dict,
because it uses chaining and for my application I need linear open
addressing"? (Or vice versa.)

Yes, that's why we have reasonable default.

I doubt you've every said that. I doubt you know anyone who has said it.
I doubt you know *of* anyone who has said it. And even if you have, then
you are free to go and write your own dict type as a C extension type
and call it dict_with_linear_openaddressing and write this:

D = dict_with_linear_openaddressing(data)

instead of:

D = dict(data, implementation = "linear open addressing")

Yet again Lie's proposal gains you nothing. In fact, as an application
writer, you're better off writing your own type rather than waiting for
it to be added to the built-in implementation of dict.

There are times when it'd be too much work to implement it, test it, and
integrate it. There are times when we know that using alternative
implementation could make our program faster, but simply lack enough time
to implement and test it. I was talking about rapid application
development or even the wilder cousin, quick-one-time-disposable scripts.
(I'm not implying that only rapid programming benefits from this, it is
only one example)

Plus, currently it is impossible, without extensive reading of the whole
docs, to search for other data structure that implements the same
abstract type to list/dict/anything. Their documentations are scattered
around here and there. It's helpful if help(list) could list what
implementations are available (no pun intended).

<snip>

Here's an even better way:

class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len
and let the client code call whichever implementation they want.

You missed the last part. Integrating the implementation to current code.
They are the one that makes the difference.

And I wasn't talking about any of those you mentioned.
It's more like this:

class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)

class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])

class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len

class Foo(AbstractType): pass
Foo.register({'standard': StandardFoo, 'green': GreenFoo, 'magic':
MagicFoo})

[snip]
>They're not different data structures from the client point of view.

But of course they are. They have different functional behaviour, and
different APIs.

Unless we artificially cripple the implementations

Then please state the implementation's limitations on the docs. People
who knows enough to chose non-standard implementation should be
knowledgeable enough about the limitations and extension of the chosen
implementation.

and
make them all identical (which defeats the purpose of having different
implementations!) they are different data structures, and those
differences will be visible to the client code.

Yes, they're different data structure, same Abstract Type.

This scheme is a case of over-generalisation. It's also impractical: who
is going to spend the time and effort to do it? It's hard enough to get
somebody to write a binary tree implementation for the standard library,
without expecting them to *also* write a wrapper for it to make it look
like a dict.

It's hard to write it because everyone knows noone is going to use it
because their implementation would be buried inside 60 feet of soil and
sand somewhere in the dark corners of the docs. And when you said "it's
hard to get somebody to write", do you realize how many people have, in
desperation, write their own implementation of something because they
can't find it in the standard library. And I think the 'register' way
would make it unnecessary to write wrappers as long as you've provided a
compatible data structure the generic register function would handle the
rest.

It's working against the grain of Python, which is well
suited for the one-type = one-API model instead of the one-type = many-
APIs model which Lie is suggesting. And for what benefit? The only
benefit suggested so far is that we can write:

dict({1: 'a', 2: 'b'}, implementation="binary tree")

instead of

binarytree({1: 'a', 2: 'b'})

If you call that a benefit. I don't.

A real benefit is apparent if we have this kind of snippet in our codes
(again, in my habit of oversimplifying: f is handling too many things for
its own good, isinstance is usually evil, and there is a strong smell of
bad design):

def f(listofiterable, someiterable):
for lst in listofiterable:
if isinstance(lst, list):
lst.extend(somelist)
elif isinstance(lst, [set, dict]):
lst.update(somelist)
else:
for n in somelist:
lst += n

if you have that kind of code, and many of them scattered, when you want
to change the data structure, you've got to hunt all those isinstance
down and change them to the new data structure. The other solutions, to
shadow 'list' (i.e. the previous name), doesn't work if we want to use
the original data structure too in other unrelated places. With
implementation, only the object creation would need to be searched and
isinstance(something, list) would pass it as an array-list replacement.

Nov 4 '08 #19

MRAB

On Nov 4, 8:40*pm, Lie Ryan <lie.1...@gmail.comwrote:
[snip]

On linguistic note: As a non-native speaker of English, I've never relied
on the correct usage of Yes and No and would instead rely on the
following text. In some languages, situations where English-people
usually use Yes is answered with No and vice versa.

[snip]
In English the rule tends to be that "yes" and "no" don't mean "I
agree" and "I disagree" like in other languages, but that "yes"
asserts the positive and "no" asserts the negative:

Positive question:

Does it work? Yes, it works.

Does it work? No, it doesn't work.

Negative question:

Doesn't it work? Yes, it does work. ["does" added for
emphasis]

Doesn't it work? No, it doesn't work.

Nov 5 '08 #20

Mr.SpOOn

On Wed, Nov 5, 2008 at 10:03 PM, Arnaud Delobelle
<ar*****@googlemail.comwrote:

Only hashable objects can go in a set. By default a class you define is
not hashable (unless it descends from a hashable class). To remedy this
you can define a __hash__ method in your class. IIRC the only
requirement of a hash function is that two equal objects have the same
hash.

Thanks, now it works.

Nov 6 '08 #21

Ordering python sets

Similar topics