By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,617 Members | 1,605 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,617 IT Pros & Developers. It's quick & easy.

Py2.3: Feedback on Sets

P: n/a
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?

* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?

* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?

* Does the performance meet your expectations?

* Do you care that sets can only contain hashable elements?

* How about the design constraint that the argument to most
set methods must be another Set (as opposed to any iterable)?

* Are the docs clear? Can you suggest improvements?

* Are sets helpful in your daily work or does the need arise
only rarely?
User feedback is essential to determining the future direction
of sets (whether it will be implemented in C, change API,
and/or be given supporting language syntax).
Raymond Hettinger


Jul 18 '05 #1
Share this Question
Share on Google+
21 Replies


P: n/a
In article <Jp***************@nwrdny03.gnilink.net>, Raymond Hettinger wrote:
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.
I would have to say that while I have looked at the sets module and read its
documentation, I have not used it much (more below).
* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I actually prefer the | & syntax because it makes more intuitive sense to
me, as the operators work identically to how they do when using them as
binary number operators.

[snip]
* Do you care that sets can only contain hashable elements?
No.

[snip]
* Are the docs clear? Can you suggest improvements?
The docs for the sets module, like most of the Python docs, are very good.
The example helps, too.
* Are sets helpful in your daily work or does the need arise
only rarely?
I rarely need sets explicitly, but sometimes need the logic that they offer.
For example, when wanting to concantenate two lists so that the resulting list
only has unique elements, that can be done with set logic. However, as it
is another module written in Python, and I only need the logic I usually don't
go through with the overhead of creating the Set classes, etc. If it was
better integrated (ie., a set() builtin type constructor like int(), str(),
etc) then I would feel less of a reservation against using sets. I know this
reservation isn't founded in facts, but more in the feeling of trusting
builtin types more than custom classes provided by a module.
User feedback is essential to determining the future direction
of sets (whether it will be implemented in C, change API,
and/or be given supporting language syntax).


Don't know if my feedback provided above helps, but you asked for it ;-)

Regards,

Troels Therkelsen
Jul 18 '05 #2

P: n/a
Raymond Hettinger wrote:
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I slightly favor | and &.

* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?

* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?
I imagine isdisjoint would be useful, although it's easy enough to use
bool(s&t).

* Does the performance meet your expectations?

* Do you care that sets can only contain hashable elements?

* How about the design constraint that the argument to most
set methods must be another Set (as opposed to any iterable)?

* Are the docs clear? Can you suggest improvements?
Yeah: in the library reference, the table entry for s.union(t) should
say "synonym of s|t" instead of repeating the description. This is
especially true because it's not clear from a simple glance whether
"s.union(t)" goes with "s|t" or "s&t", because it sits right between
the two. Better yet, I would change it to a three-column table
(operation, synonym, result).

* Are sets helpful in your daily work or does the need arise
only rarely?


I haven't used sets yet (or Python 2.3), but I expect to use them a
lot. However, I imagine my typical use would be efficient testing for
membership. I have maybe half a dozen places where I use a dictionary
for that now.

--
CARL BANKS
"You don't run Microsoft Windows. Microsoft Windows runs you."
Jul 18 '05 #3

P: n/a
Raymond Hettinger wrote:

First of all, thanks for the work on it, I need to use sets
in my work all the time. I had written my own
(simplistic) implementation but that adds another layer
of headaches when distributing programs since then
I have to distribute multiple modules.

Sometimes I ended up with a little set function in every
big module. Pretty silly. For me sets are a greatly useful
addition.
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
One pattern that I constantly need is to remove duplicates from
a sequence. I don't know if this an often enough used pattern to
warrant an API change, for me it would be most useful if I could
get the contents of a set as a sequence right away, without having to
explicitly code it.
* Are you overjoyed/outraged by the choice of | and & as
set operators (instead of + and *)?
I think that since you have have - as a difference operator it
would make sense to also have + as a union operator. Takes nothing
away from |. The & operator is the right one, * would not be appropriate
IMO.
* Do you care that sets can only contain hashable elements?
I don't really care, on the other hand, it might be better to call the
class HashSet, so that it conveys right away that it uses hashing
to store the elements.
* Are the docs clear? Can you suggest improvements?
I wondered whether it would be better to specify the immutability
of the class at the constructor level.

Then there is the update method. It feels a little bit redundant
since there is an add() method that seems to be doing the same thing
only that add() adds only one element at a time.
Would it be possible to have add() handle all additions, iterable or
not, then scrap update() altogether.

Then just by looking at the docs, it feels a little bit confusing to
have discard() and remove() do essentially the same thing but only one
of them raising an exception. Which one? I already forgot. I don't know
which one I would prefer though.

Another aspect that I did not understand, what is difference between
update() and union_update().

The long winded method names, such as difference_update() also feel
redundant when one can achieve the same thing with the -= operator. I
would drop these and instead show in the docs how to accomplish these
with the operators. Would considerably cut down on the documentation,
and apparent complexity.

I'm a big fan of having the minimal number of methods as long it is
easy to obtain the result.

For example methods like x.issubset(y) is the same as bool(x-y) so may
not be all that necessary, just a thought.
* Are sets helpful in your daily work or does the need arise
only rarely?


I use them very often and they are extremely useful.

thanks again,

Istvan.

Jul 18 '05 #4

P: n/a
Raymond Hettinger <vz******@verizon.net> wrote:
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I would prefer to have + in addition to |. I do not care
about *, but IMHO + is so intuitive and natural that
it is a pity not to have it (you can add lists, tuples,
strings with +, but not sets???)

* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
So far my code used dictionaries with values set to None,
I expect that I will use sets soon, it will be more logical
and the code more readable.

* Are sets helpful in your daily work or does the need arise
only rarely?


daily work, production code, but so far I used lists or
dictionaries instead.
I am somewhat afraid of performance, until setrs are implemented
in C, though

--
-----------------------------------------------------------
| Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Jul 18 '05 #5

P: n/a
In article <bh*************@ID-89407.news.uni-berlin.de>,
ga******************@kassiopeia.juls.savba.sk (Radovan Garabik) wrote:
Raymond Hettinger <vz******@verizon.net> wrote:
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?


So far my code used dictionaries with values set to None,
I expect that I will use sets soon, it will be more logical
and the code more readable.


Same here.

I don't rely on sets heavily (I do have a few implemented as
dictionaries with value=None) and am not yet ready to make my users
upgrade to Python 2.3.

I suspect the upgrade issue will significantly slow the incorporation of
sets and the other new modules, but that over time they're likely to
become quite popular. I am certainly looking forward to using sets and
csv.

I think it'd speed the adoption of new modules if they were explicitly
written to be compatible with one previous generation of Python (and
documented as such) so users could manually include them with their code
until the current generation of Python had a bit more time to be adopted.

I'm not saying this should be a rule, only suggesting it as a useful
goal. Presumably it'd be easy with some modules and not worth the work
in some cases.

-- Russell
Jul 18 '05 #6

P: n/a
Raymond -

Well now that you ask ...

Raymond Hettinger wrote:
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I think that choice appeals to me more than + and * (which are already
more overloaded than I would like). I haven't seen any suggestions that
I liked better.
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
Yes I need it (desperately). Generally works as I need. However, see
more comments below.
* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?
I haven't felt the need yet. So far I've been satisfied with:

if x & y:

as opposed to

if x.isdisjoint(y)
* Does the performance meet your expectations?
So far. However, so far, I haven't been trying meet any demanding
performance requirements.
* Do you care that sets can only contain hashable elements?
This is an interesting question. In particular, I have found myself on
more than one occasion doing the following:

for x in interesting_objects:
x.foo = Set()
while something_to_do:
somex.foo |= something_I_just_computed
for x in interesting_objects:
x.foo = ImmutableSet(x.foo)
build_some_more_sets(somex.foo)

I'm not sure whether I like having to go back and change all my sets to
Immutable ones after I've finished the computation. (Or whether I just
ought to make x.foo immutable all the time.) I did appreciate the
ImmutableSet type, since it allows me to flag to myself that I don't
expect a set to change further.
* How about the design constraint that the argument to most
set methods must be another Set (as opposed to any iterable)?
In some cases I've run into that. Since I can create a set with any
iterable I've been able to do:

set op Set(iterable)

I think I might be interested in using a general iterable if that would
get me some advantage (maybe significantly faster).
* Are the docs clear? Can you suggest improvements?
No problems here. However, my background is math, and I've never had
problems with documentation (I started my career learning IBM mainframe
assembly language programming from the reference manuals) so I don't
think I'm a good test case.
* Are sets helpful in your daily work or does the need arise
only rarely?
I'm working on a project where they are critical. If it hadn't been
supplied I would have implemented one myself. I was using the backported
version of the set module with 2.2 before 2.3 came out.
User feedback is essential to determining the future direction
of sets (whether it will be implemented in C, change API,
and/or be given supporting language syntax).
Raymond Hettinger


Chris

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----
Jul 18 '05 #7

P: n/a

Russell> I suspect the upgrade issue will significantly slow the
Russell> incorporation of sets and the other new modules, but that over
Russell> time they're likely to become quite popular. I am certainly
Russell> looking forward to using sets and csv.

The csv module (and the _csv module which underpins it) should work with
2.2.3. If they don't, please file a bug report.

Russell> I think it'd speed the adoption of new modules if they were
Russell> explicitly written to be compatible with one previous
Russell> generation of Python (and documented as such) so users could
Russell> manually include them with their code until the current
Russell> generation of Python had a bit more time to be adopted.

That was the intention with the csv module. I wonder if some limitations to
use of sets with 2.2.x could be gotten around by adding a __future__ import?
Maybe itertools is also needed.

Russell> I'm not saying this should be a rule, only suggesting it as a
Russell> useful goal. Presumably it'd be easy with some modules and not
Russell> worth the work in some cases.

Yes, that's a worthwhile goal.

Skip

Jul 18 '05 #8

P: n/a
Gary Feldman:
Also, I'd like to see "iterable must be <some type spec>",
though this is a general flaw in the Python doc and is perhaps
biased by my C/C++ background where you'd never dream
of doing a reference manual without explicitly indicating the
types of every parameter.
Python uses what is sometimes called "duck typing" (meaning,
if it quacks like a duck...). Lots of objects are iterable - strings,
lists, sets, dict (keys), and user-defined classes. Since you
prefer C++, think of Python more akin to templates. Templates
expect the objects templated on to have certain properties (can
be "+"ed, can be deferenced, has a method named "xyz") and
not that they have given types.
Personally, I have hard time imagining where I'd want
[remove]. If I really cared, I could check beforehand, so I think
I'd just always use discard.
I'm the other way around. I find it hard to imagine where
I would call discard. If I want to remove an element from a
set then I want to know right away if that element isn't there.
It's been handy for tracking down bugs in my code.
5.12.2
engineering_management = engineers & programmers
Actually, I don't like that example because there is too
much text to read through to see the actual symbols used.
PS I suppose I should mention my strongest pet peeve
with the Python documentation, which is the practice of
putting the member functions on a different page than
the class overview. But that's not your issue, either.


And I confess that I like to see everything on one page
and not split up between several pages. That way I can
use my browser's search facility.

Andrew
da***@dalkescientific.com
Jul 18 '05 #9

P: n/a

"Skip Montanaro" <sk**@pobox.com> wrote in message
news:ma**********************************@python.o rg...

Russell> I suspect the upgrade issue will significantly slow the
Russell> incorporation of sets and the other new modules, but that over
Russell> time they're likely to become quite popular. I am certainly
Russell> looking forward to using sets and csv.

The csv module (and the _csv module which underpins it) should work with
2.2.3. If they don't, please file a bug report.

Russell> I think it'd speed the adoption of new modules if they were
Russell> explicitly written to be compatible with one previous
Russell> generation of Python (and documented as such) so users could
Russell> manually include them with their code until the current
Russell> generation of Python had a bit more time to be adopted.

That was the intention with the csv module. I wonder if some limitations to
use of sets with 2.2.x could be gotten around by adding a __future__ import?
Maybe itertools is also needed.


In the documentation for the itertools module, I intensionally included
pure python versions of each tool that make backporting easy. You
can cut and paste the documentation into a module with
from __future__ import generators and have a Py2.2 version of
itertools that would enable the sets module to run just fine.

Still, why not upgrade to Py2.3? The bug fixes were all ported to 2.2.3
and into Py2.3 so that the essential differences are the new modules
and some minor language improvements.
Raymond Hettinger
Jul 18 '05 #10

P: n/a
"Raymond Hettinger" <vz******@verizon.net> writes:
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I'd actually rather sets didn't overload any operators at all, but
appreciate that this may be a minority position.

| and & is the only sane choice, however.
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
I don't use them as much as I should, I suspect.
* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?
I've not reached for something and not found it there yet.
* Does the performance meet your expectations?
My uses so far have not had even the faintest of performance demands,
so, yes.
* Do you care that sets can only contain hashable elements?


Not yet.

Cheers,
mwh

--
The website looks like the Hi-Score sheet from a Bullshit Bingo
tournament. -- Dan Holdsworth, asr
Jul 18 '05 #11

P: n/a
After giving blanket approval to the docs I now add:

I have a mission to set some new guidelines for Python documentation.
Perhaps this is a good place to start.
Example - currently we have:

class Set( [iterable])
Constructs a new empty Set object. If the optional iterable parameter is
supplied, updates the set with elements obtained from iteration. All of the
elements in iterable should be immutable or be transformable to an
immutable using the protocol described in section
<http://www.python.org/doc/current/lib/immutable-transforms.html#immutable-transforms>5.12.3.
Problems:
The result of Set appears to be an empty Set object. The fact that it might
be filled is hidden in the parameter description.
The parameter description itself is hidden in the paragraph, making it
harder to find, especially when the reader is in a hurry.

Some suggested guidelines to improve readability and understandability:
1 - label each paragraph so we know what it is about
2 - have a function paragraph that briefly but completely describes the
function
3 - have labeled sections for things that can be so grouped (e.g. parameters)
4 - start the description of each thing in a new paragraph.

Example:

class Set( [iterable])
function: Constructs a new empty Set object and optionally fills it.
parameters:
iterable [optional] if supplied, updates the set with elements
obtained from
iteration. All of the elements in iterable should be immutable or be
transformable to an immutable using the protocol described in
section
<http://www.python.org/doc/current/lib/immutable-transforms.html#immutable-transforms>5.12.3.
What do you think? If this layout is appealing, let's use the set docs as a
starting point to model this approach. I for one am willing to apply this
model to the rest ot the set docs, and help update other docs, but not all
of them.

BTW I also have a problem with the term "Common uses". "Common" suggests
that these are better, or more frequent. I suggest "Some examples of
application of sets".

I also agree with the suggestion that operations that are synonymous be so
indicated in the table.

Bob Gailer
bg*****@alum.rpi.edu
303 442 2625

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003

Jul 18 '05 #12

P: n/a
In article <ma**********************************@python.org >,
Skip Montanaro <sk**@pobox.com> wrote:
Russell> I suspect the upgrade issue will significantly slow the
Russell> incorporation of sets and the other new modules, but that over
Russell> time they're likely to become quite popular. I am certainly
Russell> looking forward to using sets and csv.

The csv module (and the _csv module which underpins it) should work with
2.2.3. If they don't, please file a bug report.
That's excellent news. It might be worth adding it to the documentation,
e.g. "new in version 2.3 but compatible with version 2.2.x" (surely x is
1 (with True/False) or 0 (without), or was there really some needed
feature change in 2.2.3?).
That was the intention with the csv module. I wonder if some limitations to
use of sets with 2.2.x could be gotten around by adding a __future__ import?
Maybe itertools is also needed.


That is an interesting question. Mind you, I have no idea if sets is
compatible with 2.2.x or not; I didn't try since it wasn't documented
and I didn't want to risk missing some obscure bug.

-- Russell
Jul 18 '05 #13

P: n/a
In article <bh**********@slb9.atl.mindspring.net>,
"Andrew Dalke" <ad****@mindspring.com> wrote:
I read some mention of using "|" instead of "+", so I knew
to use it. I would have liked +, but not *. I know the logic
for thinking * but & doesn't have the other connotations
* has (like [1] * 2, "a"*9)
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?


After years of using Python without sets, I hand built a specialized
intersection a couple of months ago. Knowing the Sets module was
coming, I did only what I needed at that moment, and didn't bother
optimizing it (it takes a few seconds to do what I need...removing a
second or two isn't useful). (I worked around a "need" for difference
by changing the input generation in the overall problem.)

So..."necessary" is too strong here, but "a good thing" is certainly
apt. If I only get to choose yes or no for "necessary" the answer is
"yes".

--John

--
Email to above address discarded by provider's server. Don't bother sending.
Jul 18 '05 #14

P: n/a
"Istvan Albert"
One pattern that I constantly need is to remove duplicates from
a sequence. I don't know if this an often enough used pattern to
warrant an API change, for me it would be most useful if I could
get the contents of a set as a sequence right away, without having to
explicitly code it.
list(Set('abracadbra'))
['a', 'r', 'b', 'c', 'd']
* Are the docs clear? Can you suggest improvements?


I wondered whether it would be better to specify the immutability
of the class at the constructor level.


ImmutableSet is available as a constructor.

Then there is the update method. It feels a little bit redundant
since there is an add() method that seems to be doing the same thing
only that add() adds only one element at a time.
Would it be possible to have add() handle all additions, iterable or
not, then scrap update() altogether.
Not really.
Set.update() is for vectorizing high volume additions.
There is some analogy to list.append() vs. list.extend().

Then just by looking at the docs, it feels a little bit confusing to

have discard() and remove() do essentially the same thing but only one
of them raising an exception. Which one? I already forgot. I don't know
which one I would prefer though.


Will clarify the docs.

Another aspect that I did not understand, what is difference between
update() and union_update().
update() works with any iterable and union_update() only with another Set.
If the API is liberized to allow any iterable for most operations, then
the distinction will vanish.
The long winded method names, such as difference_update() also feel
redundant when one can achieve the same thing with the -= operator. I
would drop these and instead show in the docs how to accomplish these
with the operators. Would considerably cut down on the documentation,
and apparent complexity.
That is a good thought; however,
some find a.union(b) to be more readable than a|b
and some find that a.symmetric_difference is more memorable than a^b.
For example methods like x.issubset(y) is the same as bool(x-y) so may
not be all that necessary, just a thought.


Granted. However:

* issubset has an early out algorithm and consumes contant memory.
In contrast, bool(x-y) builds a whole new set and then throws it away.
* issubset and issuperset are somewhat basic set operations
* Are sets helpful in your daily work or does the need arise
only rarely?


I use them very often and they are extremely useful.


Me too.
Raymond Hettinger
Jul 18 '05 #15

P: n/a

"Raymond Hettinger" <vz******@verizon.net> wrote in message
news:3b*****************@nwrdny02.gnilink.net...
"Istvan Albert"
Then just by looking at the docs, it feels a little bit
confusing to have discard() and remove() do essentially the same thing but only one of them raising an exception. Which one? I already forgot. I don't know which one I would prefer though.


I agree that this is confusing -- like having both str.find and
str.index. I would prefer one delete function with an optional param
'silent' to switch its 'not there' response from the default (either
True or False, according to what seems to be the more common usage) to
the other choice. (I know, I should have read draft more carefully
and commented last fall -- but this seems like the sort of redundancy
that Guido wants to remove in 3.0.)

Terry J. Reedy
Jul 18 '05 #16

P: n/a
Raymond Hettinger wrote:
Subject: Py2.3: Feedback on Sets * Do you care that sets can only contain hashable elements?
This is the only disadvantage for me.

For the rest, I am happy about it. I am already using it a lot
on places where I used lists before, but where a Set is much
better (no order, no duplicates, it really *is* a set)
User feedback is essential to determining the future direction
of sets (whether it will be implemented in C, change API,
and/or be given supporting language syntax).


I really like them. I would also like to be able to do
{elem for elem in set if foo(elem)} to construct a subset.

Gerrit.

--
255. If he sublet the man's yoke of oxen or steal the seed-corn,
planting nothing in the field, he shall be convicted, and for each one
hundred gan he shall pay sixty gur of corn.
-- 1780 BC, Hammurabi, Code of Law
--
Asperger Syndroom - een persoonlijke benadering:
http://people.nl.linux.org/~gerrit/
Het zijn tijden om je zelf met politiek te bemoeien:
http://www.sp.nl/

Jul 18 '05 #17

P: n/a
"Russell E. Owen"
I don't rely on sets heavily (I do have a few implemented as
dictionaries with value=None) and am not yet ready to make my users
upgrade to Python 2.3.

I suspect the upgrade issue will significantly slow the incorporation of
sets and the other new modules, but that over time they're likely to
become quite popular. I am certainly looking forward to using sets and
csv.

I think it'd speed the adoption of new modules if they were explicitly
written to be compatible with one previous generation of Python (and
documented as such) so users could manually include them with their code
until the current generation of Python had a bit more time to be adopted.


Wish granted!

The sets module now will run under Py2.2.
It should be available for download from CVS after 24 hours:
http://cvs.sourceforge.net/cgi-bin/v...src/Lib/sets.p
y
Raymond Hettinger
Jul 18 '05 #18

P: n/a
"Gary Feldman"
* Are the docs clear? Can you suggest improvements?


I haven't used them yet, but since I'm working my way through
the docs in general, I thought I'd check them out and comment.


All of the issues you found have been fixed (except for the discussion of
what an iterable parameter means -- that will be addressed elsewhere).
Raymond Hettinger
Jul 18 '05 #19

P: n/a
"John Smith"
Suggestion: How about adding Set.isProperSubset() and
Set.isProperSuperset()?
We have them in operator form: a<b a>b
Spelling them out did not seem to add much value.
This is doubly true because some people read it
as s.isProperSubsetOf(t) and others read it as
s.hasTheProperSubset(t).
Raymond Hettinger
Thanks for this wonderful module. I've been working on data mining and
machine
learning area using Python. Set operations are very important to me.


Great. You'll love it even more when I implement it in C.

Raymond Hettinger
Jul 18 '05 #20

P: n/a
On Tue, 12 Aug 2003 06:02:17 GMT, rumours say that "Raymond Hettinger"
<vz******@verizon.net> might have written:

[replying only to those that I have something substantial to say]
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
I have used sets in:
- Unix sysadm tasks (comparing usernames between passwd and shadow,
finding common files in sync requests et al)
- a hangman game (when the computer guesses words, to continuously
restrict the possibilities based on the human input)
- an image recognition program (comparing haar coefficients)

These come to mind at the moment, but I have used them even in the
python command line; and mostly I care about intersections.
* Does the performance meet your expectations?
In the game and image recognition programs I could use more power;-)
* Are sets helpful in your daily work or does the need arise
only rarely?
I use them often, it's a very helpful construct.
User feedback is essential to determining the future direction
of sets (whether it will be implemented in C, change API,
and/or be given supporting language syntax).


Reimplementation in C sounds appropriate, and supporting language syntax
would be nice.
A quick thought, in the spirit of C implementation: there are cases
where I would like to get the intersection of dicts (based on the keys),
without having to create sets from the dict keys and then getting the
relevant values. That is, given dicts a and b, I'd like:
a & b # imaginary
to mean
dict([x, a[x] for x in sets.Set(a) & sets.Set(b)]) # real


You may notice that a&b wouldn't be equivalent to b&a.
Perhaps the speed difference would not be much; I'll grow a function in
dictobject.c, run some benchmarks and come back with results for you.

Another thought: it is unfortunate that an intersection *has* to be
through continuous lookups (talking about the ordering of dict keys re
their hash values, I'll have to delve into dictobject.c it seems), even
taking into account the great speed of key lookups... although building
the result dict should account for more processing cycles than the
comparisons; and in some cases doing a dict.copy() and then removing the
uncommon elements would be faster. Hm, food for thought, and no more
than two hours to sleep now.

Another slogan: Python keeps your mind awake (and c.l.py keeps your body
away from bed :)
--
TZOTZIOY, I speak England very best,
Microsoft Security Alert: the Matrix began as open source.
Jul 18 '05 #21

P: n/a
On Wed, 20 Aug 2003 06:10:19 +0300, rumours say that Christos "TZOTZIOY"
Georgiou <tz**@sil-tec.gr> might have written:
A quick thought, in the spirit of C implementation: there are cases
where I would like to get the intersection of dicts (based on the keys),
without having to create sets from the dict keys and then getting the
relevant values. That is, given dicts a and b, I'd like:
a & b # imaginary
to mean
dict([x, a[x] for x in sets.Set(a) & sets.Set(b)]) # real
You may notice that a&b wouldn't be equivalent to b&a.
Perhaps the speed difference would not be much; I'll grow a function in
dictobject.c, run some benchmarks and come back with results for you.


I implemented dict.intersect(), and it is *quite* faster than the
equivalent Python code.

************************************************** ********************

Python 2.4a0 (#3, Aug 20 2003, 16:31:22)
[GCC 3.2 (Mandrake Linux 9.0 3.2-1mdk)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
help(dict.intersect) Help on method_descriptor:

intersect(...)
D.intersect(E) -> a subset of D having common keys with E
import sets
odds = dict(zip("abcdefghijklmn", range(1, 55, 2)))
evens= dict(zip("asdfghj", range(2, 55, 2)))

odds {'a': 1, 'c': 5, 'b': 3, 'e': 9, 'd': 7, 'g': 13, 'f': 11, 'i': 17, 'h':
15, 'k': 21, 'j': 19, 'm': 25, 'l': 23, 'n': 27} evens {'a': 2, 'd': 6, 'g': 10, 'f': 8, 'h': 12, 'j': 14, 's': 4}

dict([(k, odds[k]) for k in sets.Set(odds) & sets.Set(evens)]) {'a': 1, 'd': 7, 'g': 13, 'f': 11, 'h': 15, 'j': 19} odds.intersect(evens) {'a': 1, 'h': 15, 'j': 19, 'd': 7, 'g': 13, 'f': 11} dict([(k, evens[k]) for k in sets.Set(odds) & sets.Set(evens)]) {'a': 2, 'd': 6, 'g': 10, 'f': 8, 'h': 12, 'j': 14} evens.intersect(odds) {'a': 2, 'h': 12, 'j': 14, 'd': 6, 'g': 10, 'f': 8}

my_setup= 'import sets; odds=dict(zip("abcdefghijklmn", range(1, 55, 2))); evens=dict(zip("asdfghj", range(2, 55, 2)))'
from timeit import Timer

Timer(stmt="odds.intersect(evens)", setup=my_setup).repeat() [1.3545670509338379, 1.3367550373077393, 1.3366960287094116] Timer(stmt="evens.intersect(odds)", setup=my_setup).repeat() [1.321492075920105, 1.2869999408721924, 1.320341944694519] Timer(stmt="dict([(k, odds[k]) for k in sets.Set(odds) & sets.Set(evens)])", setup=my_setup).repeat() [63.413245916366577, 63.526772975921631, 63.503224968910217] Timer(stmt="dict([(k, evens[k]) for k in sets.Set(odds) & sets.Set(evens)])", setup=my_setup).repeat()

[63.498296976089478, 63.49311900138855, 63.425426959991455]

************************************************** ********************

A substantial difference, over 50x on an Athlon XP 1700. Also note the
difference in the key order of the results.

I believe that dicts should grow such a method, perhaps with another
name.

Attached is the diff -u for dictobject.c compared to the one in last
night's python-latest.tgz
--
TZOTZIOY, I speak England very best,
Microsoft Security Alert: the Matrix began as open source.
Jul 18 '05 #22

This discussion thread is closed

Replies have been disabled for this discussion.