473,320 Members | 2,112 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

set, dict and other structures

I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts. Sets also need the
same memory of dicts (can they be made to use less memory, not storing
values? Maybe this requires too much code rewriting).
I presume such sets are like this because they are kind of dicts. If
this is true, then converting a dict to a set (that means converting
just the keys; this is often useful for a successive processing with
set operators like intersection, etc) can be a very quick operation;
but this little code shows me that this isn't true, set-ify a dict is
even slower than doing it on a list. Can this operation made faster (by
people that writes Python interpreter)?

..from time import clock
..for p in xrange(16, 20):
.. n = 2**p
.. l = range(n)
.. t = clock()
.. set(l)
.. print n, round(clock()-t,3),
.. d = {}.fromkeys(xrange(n))
.. t = clock()
.. set(d)
.. print round(clock()-t,3)

==>
65536 0.082 0.1
131072 0.205 0.218
262144 0.45 0.562
524288 1.014 1.029

----------------------

Dicts and sets are two different data structures, but they share some
similarities. So instead of always converting dicts to sets, maybe some
fast set-like operations/methods can be added to dicts, like
intersection, mass removal, etc. (The result of *some* of such
operations between two dicts can be a regular set, and not a dict. This
is maybe not very "clear" from an formal/theoretical point of view).

-------------

I've studied the nice sets python module from Py2.3, and I am...
well... writing something similar, a (python) graph module for sparse
graphs. I've found 3 modules like that around on the Net, but I think
they can be improved (even if I'm not an expert of Python, and I know,
my code is probably quite rough/naive compared to "sets" module one...)
Graphs are complex data structures, so probably one person needs from a
graph data structure are different from other people needs, so it's
probably not easy to define a "standard" graph module fit for everyone.
Still, I think it can be useful to have a standard module to manage
them.
Do you think it be useful to add a (well made) standard module like
that?

Bear hugs,
Bearophile

Jul 18 '05 #1
8 2126
be************@lycos.com wrote:
I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts.


They look exactly the same speed-wise to me:
t1 = Timer('randrange(100) in foo', 'from random import randrange; foo = set(xrange(1000))') t2 = Timer('randrange(100) in foo', 'from random import randrange; foo = dict.fromkeys(xrange(1000))') t1.timeit() 3.0573790073394775 t2.timeit() 3.064924955368042 t2.timeit()

3.0590860843658447
Jul 18 '05 #2
<be************@lycos.com>
I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts.
Glad to hear that you find them to be a useful abstraction.

Since sets are internally implemented as dicts, they should have almost
identical performance (net of a small difference for the time to forward the
calls).

Sets also need the
same memory of dicts (can they be made to use less memory, not storing
values? Maybe this requires too much code rewriting).
Yes, an alternate implementation could be written that would take less memory
(by about 1/3) and be a bit faster. The approach would be to copy relevant
sections of the dictionary implementation and then remove anything pertaining to
dict values. If the need arises and I find the time, this will probably happen
someday.

In the meantime, the current implementation is rock solid and gives great
performance. Have you encountered situations where a 1/3 memory savings and a
slight speed improvement would have made a critical difference in a real
application?
I presume such sets are like this because they are kind of dicts. If
this is true, then converting a dict to a set (that means converting
just the keys; this is often useful for a successive processing with
set operators like intersection, etc) can be a very quick operation;
but this little code shows me that this isn't true, set-ify a dict is
even slower than doing it on a list. Can this operation made faster (by
people that writes Python interpreter)?
Dicts take longer than lists because dict iteration takes longer than for lists.

If set-ifying a dictionary turns out to be an essential and time-critical
operation, it is possible to add some special case code for this. The question
is whether it is worth it. If you find the need is pressing, file a feature
request explaining why it is essential. Then, I'll implement it for you. On
the other hand, if this is a theoretical nice-to-have, then why clutter up the
code (with pre-sizing and forcing all values to True).
I've studied the nice sets python module from Py2.3, and I am...
well... writing something similar, a (python) graph module for sparse
graphs. I've found 3 modules like that around on the Net, but I think
they can be improved (even if I'm not an expert of Python, and I know,
my code is probably quite rough/naive compared to "sets" module one...)
Graphs are complex data structures, so probably one person needs from a
graph data structure are different from other people needs, so it's
probably not easy to define a "standard" graph module fit for everyone.
Still, I think it can be useful to have a standard module to manage
them.
Do you think it be useful to add a (well made) standard module like
that?


Start by posting a module outside the standard library; gathering user feedback;
and, if popular, propose it for inclusion in the standard library.

My guess is that there will be two issues. One is that no one implementation of
graphs seems to satisfy all users. The second is that only a small fraction of
Python users need for graph support (there is probably a much greater demand for
combinatorics, numeric/numarray, or financial modules). If the appeal is not
broad, it has little chance of making it into the standard library.
Raymond Hettinger
Jul 18 '05 #3
You are really gentle Raymond Hettinger, but surely I'm not asking
you/anyone to write some code just for me; I don't have "real
applications", I'm just learning/playing.
Your words are quite good answers to most of my questions.
The only left small topic is about the group/set-like
operations/methods added to dicts; maybe some other person can comment
that idea too. The semantic of such operations can be a little tricky,
but sometimes they can avoid the use of sets (and the set-ify of
dicts), sometimes they can carry the dict values with them (even if
they work just on the dict keys), etc.

A bear hug,
Bearophile

Jul 18 '05 #4
Raymond Hettinger wrote:
If set-ifying a dictionary turns out to be an essential and
time-critical operation, it is possible to add some special case code
for this. The question is whether it is worth it. If you find the
need is pressing, file a feature request explaining why it is
essential. Then, I'll implement it for you. On the other hand, if
this is a theoretical nice-to-have, then why clutter up the code
(with pre-sizing and forcing all values to True).

Just today I was writing some code where I wanted to use sets for the
abstraction (intersection, etc.), but also carry some values with them to
process. So, yes, I believe that having set-like abstraction for dictionaries
would be great. In fact, for a moment I wondered if dict.keys() was already of
type set in 2.4, because it could have made sense.

My concrete situation was a routine that was updating a current snapshot of
data (stored in a dictionary) with a new snapshot of data (another dictionary).
With 'updating' I mean some kind of algorithm where you have to do different
things for:

- items that were present in the current snapshot but not in the new snapshot
anymore (obsolete items)
- items that were not present in the current snapshot but are presente in the
new snapshot (new items)
- items that were present in both the current and the new snapshot (possibly
modified items)

So, this was a perfect suit for sets. Given my dictionaries d1 and d2, I would
have liked to be able to do:

for k,v in (d1 - d2).iteritems():
...

for k,v in (d2 - d1).iteritems():
...

for k,v in (d1 & d2).iteritems(): # uhm, not fully correct
...

Of course, there are some nitpicks. For instance, in the last case the
dictionary at the intersection should probably hold a tuple of two values, one
coming from each dictionary (since the intersection finds the items whose key
is present in both dictionaries, but the value can obviously be different). So
you'd probably need something:

for k,(v1,v2) in (d1 & d2).iteritems():
...

Another solution would be to say that set-like operations, like (d1 & d2),
return an iterator to a sequence of keys *only*, in this case the sequence of
keys available in both dictionaries.

I also tried a different approach, that is putting my data in some structure
that was suitable for a set (with __hash__ and __key__ methods that were caring
of the key part only):

class FakeForSet:
def __init__(self, k, v):
self.key = key
self.value = value
def __hash__(self):
return hash(self.key)
def __cmp__(self, other):
return cmp(self.key, other.key)

but this looked immediatly weird because I was lying to Python somehow. It then
became clear that it's not going to work, because when you go through (s1 & s2)
you do not know if the items you get are coming from s1 or s2, and that is
important because the value member is going to be different (even if you're
lying to Python saying that those items are equal anyway).

I know of course there are other ways of doing the same thing, like:

# intersection
for k in d1:
if k not in d2:
continue
v1, v2 = d1[k], d2[k]
...

But I don't think there is anything wrong in providing an abstraction for this,
especially since we already decided that set-like abstractions are useful.

So, FWIW, I would find set-like operations on dictionaries an useful addition
to Python. Besides, I guess they can be easily experimented and studied by
subclassing dict.
--
Giovanni Bajo
Jul 18 '05 #5
On Tue, 01 Feb 2005 00:43:21 +0000, Raymond Hettinger wrote:
My guess is that there will be two issues. One is that no one
implementation of graphs seems to satisfy all users. The second is that
only a small fraction of Python users need for graph support (there is
probably a much greater demand for combinatorics, numeric/numarray, or
financial modules). If the appeal is not broad, it has little chance of
making it into the standard library.


This reminded me of a group of people that were supposed to have started
in on this:

http://www.python.org/moin/PythonGraphApi

It's too soon to call it dead, but I don't see a success pattern for it;
I'd say they foundered on trying to shoot for the moon in advance, instead
of starting to write code and incrementally improving it. (Hint, hint, if
any of you are reading this.)

Trying to create a graph library that meets any significant percentage of
the needs of the people who would use it is quite non-trivial. It's not
impossible, but it's a lot lot lot lot lot of work. I remember counting at
least 48 distinct types of graphs when I enumerated my concerns to that
project and I'm sure there are a few more dimensions that didn't make it
into that list (the Wiki talks about "hierachial" graphs which adds
another dimension I didn't count, for instance, and I'm sure there are
more...).
Jul 18 '05 #6
Leif K-Brooks:
They look exactly the same speed-wise to me:
There's a little overhead, you can see it with a bigger test:

..from time import clock
..n = 1*10**6
..
..t1 = clock()
..d = set()
..for i in xrange(n):
.. d.add(i)
..t2 = clock()
..for i in xrange(n):
.. d.remove(i)
..t3 = clock()
..d = set(xrange(n))
..t4 = clock()
..print round(t2-t1,2), round(t3-t2,2), round(t4-t3,2), round(t4-t1,2)
..
..t1 = clock()
..d = {}
..for i in xrange(n):
.. d[i] = None
..t2 = clock()
..for i in xrange(n):
.. del d[i]
..t3 = clock()
..d = {}.fromkeys(xrange(n))
..t4 = clock()
..print round(t2-t1,2), round(t3-t2,2), round(t4-t3,2), round(t4-t1,2)

----------------------

Jeremy Bowers:
I don't see a success pattern for it; I'd say they foundered on trying to shoot for the moon in advance, instead of starting to write code and
incrementally improving it.<

I agree, their ideas seem too much abstract and big (and the resulting
code looks sloow); I am aiming to something quite simpler, like the
"sets" python module of Py2.3 (but the after the python version
development & testing, a C version can be useful, just like the set
objects in Py2.4).
My (simple) graph module is already nearly done ^_^
Trying to create a graph library that meets any significant percentage of
the needs of the people who would use it is quite non-trivial.<

This is right. But:
1) I've found some implementations of graphs:
- One inside Gato: http://www.zpr.uni-koeln.de/~gato/About/
- http://pygraphlib.sourceforge.net/dist/
- http://pygraphlib.sourceforge.net/do...ib-module.html
- http://www.ics.uci.edu/~eppstein/PADS/
Some modules from Julien Burdy can be found, etc. I think I can find 5
different graph python modules, this number tells me that there's some
people that need/want them.
2) A graph data structure written in C probably cannot be the right one
for every person, because there are so many kinds of graphs... but if
it's fast and general enough and it's already there (and simple
enough!), then most people will start using it, adjusting their needs
to it...

----------------------

Giovanni Bajo:
In fact, for a moment I wondered if dict.keys() was already of type set in 2.4, because it could have made sense.<

Well, I presume sometimes you need a list, so maybe something like
dict.keysset() can be a better idea.

Of course, there are some nitpicks.<
Yes.

For instance, in the last case the dictionary at the intersection should probably hold a tuple of two values, one coming from each
dictionary (since the intersection finds the items whose key is present
in both dictionaries, but the value can obviously be different).<

There are other possibilities.

Another solution would be to say that set-like operations, like (d1 &

d2),
return an iterator to a sequence of keys *only*, in this case the
sequence of
keys available in both dictionaries.<

Right ^_^

Then let's see something:

a = {}.fromkeys(range(10),0)
b = {}.fromkeys(range(5,13),1)

1) a - b ==> dict
Its meaning can be something like:
r = {}
for k in set(a).difference(b): r[k] = a[k]

For functional programming people, it can even mean something like:
r = {}
scan(curry(r.__setitem__(Blank, a[k])), set(a).difference(b))
(Unlike map, scan does not build up a new expression to return.)
2) a + b ==> dict
Its meaning:
r = a.copy()
r.update(b)
3) The intersection is a bit tricky. I don't like the idea of resulting
a dict with tuple values, with both the values of the shared keys. The
other idea is nicer:
a.intersection(b)
Its possible meaning:
3a) set(a).intersection(b)
Alternative meaning:
3b) list(set(a).intersection(b))
(Instead of adding this intersection to dicts, it can just be made a
quite faster dict=>set conversion to make set(a).intersection(b) fast.)
4) Another possibility is less common, but easy to implement, so this
is optional:
a.xor(b)
Its meaning:
r = {}
for x in a: if x not in b: r[x] = r[a]
for x in b: if x not in a: r[x] = r[b]

(Probably there are better ways to implement them, here I've just used
python as a kind of pseudocode).

Bear hugs,
Bearophile

Jul 18 '05 #7
r = {}
for x in a: if x not in b: r[x] = r[a]
for x in b: if x not in a: r[x] = r[b]

I know, this is wrong :-]
This looks better:
r = {}
for x in a: if x not in b: r[x] = a[x]
for x in b: if x not in a: r[x] = b[x]

Bearophile

Jul 18 '05 #8
"Giovanni Bajo" <no***@sorry.com> writes:
Just today I was writing some code where I wanted to use sets for
the abstraction (intersection, etc.), but also carry some values
with them to process. So, yes, I believe that having set-like
abstraction for dictionaries would be great. In fact, for a moment I
wondered if dict.keys() was already of type set in 2.4, because it
could have made sense.


Which sets you need depends on each case, so a generic set-ified
extension would be both limited and potentially confusing. For
example, a set of (key,value) tuples is the simplest and unambiguous.
On the other hand, if the values differ you really just need a set of
keys and then you can do the value lookup and collision handling
yourself using the result set as keys to both of your dictionaries.
You could subclass dict to provide aliases such as:

dict.itemset ()
dict.keyset ()
dict.valueset ()

for the following if these look overly bloated to you:

set (dict.iteritems ())
set (dict.iterkeys ())
set (dict.itervalues ())

For example:

py> for k,v in set (d1.iteritems ()) - set (d2.iteritems ()):
py> print k, v

looks quite concise and self-explanatory to me already. A _notable_
performance gain would probably justify native set-ifying getters,
especially if there indeed were implementation-level data structures
to share.
br,
S
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Robin Cull | last post by:
Imagine I have a dict looking something like this: myDict = {"key 1": , "key 2": , "key 3": , "key 4": } That is, a set of keys which have a variable length list of associated values after...
1
by: Alexander Kervero | last post by:
Hi ,today i was reading diveinto python book,in chapter 5 it has a very generic module to get file information,html,mp3s ,etc. The code of the example is here :...
2
by: Alex | last post by:
Entering the following in the Python shell yields >>> help(dict.copy) Help on method_descriptor: copy(...) D.copy() -> a shallow copy of D >>>
0
by: barnesc | last post by:
Nifty, Tim Peters responded to my e-mail. I must've said something interesting. Cool, a PyCelebrity! > >> ... >> I've gotten bored and went back to one of my other projects: >>...
3
by: Bengt Richter | last post by:
Has anyone found a way besides not deriving from dict? Shouldn't there be a way? TIA (need this for what I hope is an improvement on the Larosa/Foord OrderedDict ;-) I guess I can just document...
3
by: Peter Beattie | last post by:
I was wondering whether certain data structures in Python, e.g. dict, might have limits as to the amount of memory they're allowed to take up. Is there any documentation on that? Why am I...
15
by: George Sakkis | last post by:
Although I consider dict(**kwds) as one of the few unfortunate design choices in python since it prevents the future addition of useful keyword arguments (e.g a default value or an orderby...
8
by: Almad | last post by:
Hello, I discovered this behaviour in dictionary which I find confusing. In SneakyLang, I've tried to extend dictionary so it visits another class after something is added: class...
20
by: Seongsu Lee | last post by:
Hi, I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers. Follow is a simple example with 5 keys. dict = {1: , 2: , 900000: , 900001:...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.