By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,677 Members | 1,236 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,677 IT Pros & Developers. It's quick & easy.

How to test if one dict is subset of another?

P: n/a
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()
where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).

JT

Feb 19 '07 #1
Share this Question
Share on Google+
15 Replies


P: n/a
Jay Tee schrieb:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()
where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:
This would still need to run over all items in jobs. No gain.
>
j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).
If you're jobs dictionary is immutable regarding the key-set (not from
it's implementation, but from its usage), the thing you can do to
enhance performance is to create an index. Take a predicate like

def p(j):
return j.get('user') == 'jeff'

and build a list

jeffs_jobs = [j for j in jobs if p(j)]

Then you can test only over these. Alternatively, if you have quite a
few of such predicate/action-pairs, try and loop once over all jobs,
applynig the predicates and actions accordingly.

Diez
Feb 19 '07 #2

P: n/a
Jay Tee wrote:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:

if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()
where subset_attr would see if the dict passed in was a subset of the
underlying attribute dict of j:

j1's dict : { 'user' : 'jeff', 'start' : 43, 'queue' : 'qlong',
'state' : 'running' }
j2's dict : { 'user' : 'jeff', 'start' : 57, 'queue' : 'qlong',
'state' : 'queued' }

so in the second snippet, if j was j1 then subset_attr would return
true, for j2 the answer would be false (because of the 'state' value
not being the same).

Any suggestions? Constraint : the answer has to work for both python
2.2 and 2.3 (and preferably all pythons after that).
Use a RDBMS (a database), they tend to be good at this kind of operations.

Peter

Feb 19 '07 #3

P: n/a
On Feb 19, 11:07 am, Peter Otten <__pete...@web.dewrote:
Use a RDBMS (a database), they tend to be good at this kind of operations.
yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.

JT

Feb 19 '07 #4

P: n/a
Jay Tee wrote:
On Feb 19, 11:07 am, Peter Otten <__pete...@web.dewrote:
>Use a RDBMS (a database), they tend to be good at this kind of
operations.

yeah, one of the options is metakit ... sqlite and buzhug both looked
promising but the constraint of pythons 2.2 and 2.3 ruled that out.
disadvantage of metakit is that it's not pure python, meaning possible
integration problems. the system has to be deployed at 200+ sites
worldwide on a mix of RHEL 3 and 4 based systems, with some Debian
clusters thrown in, and running real production ...

hence my desire to find a pure-python solution if at all possible.
it's looking grim.
The following may speed up things a bit if you have enough memory and your
data is queried more often than changed.

import sys

def generate_id():
for i in xrange(sys.maxint):
yield i
raise ImplementationRestriction

get_id = generate_id().next

class Record(dict):
def __init__(self, *args, **kw):
dict.__init__(self, *args, **kw)
assert not hasattr(self, "_id")
self._id = get_id()
def __setitem__(self, key, value):
raise ImmutableException
def __hash__(self):
return self._id
def __str__(self):
items = self.items()
items.sort()
return ", ".join(["%s: %s" % p for p in items])

records = dict.fromkeys([
Record(user="jack", start=42, state="running"),
Record(user="jack", start=47, state="running"),
Record(user="jack", start=46, state="queued"),
Record(user="jane", start=42, state="running"),
Record(user="jane", start=7),
])

def fill_cache(records):
cache = {}
for record in records:
for p in record.iteritems():
cache.setdefault(p, {})[record] = None
return cache

_cache = fill_cache(records)

def select(*data, **kw):
[data] = data
result = data
for p in kw.iteritems():
try:
c = _cache[p]
except KeyError:
c = {}
if not c:
return {}
result = dict.fromkeys([k for k in result if k in c])
if not result:
break
return result

if __name__ == "__main__":
for filter in [
dict(user="jack"),
dict(state="running"),
dict(user="jack", state="running"),
dict(state="undefined"),
dict(user="jane", state="queued")
]:
print "--- %s ---" % Record(filter)
result = select(records, **filter)
if result:
for record in result:
print record
else:
print "#no matching records"
print

The above code runs with Python 2.3; don't know about 2.2 as I don't have it
on my machine. Of course with sets it would look better...

Peter
Feb 19 '07 #5

P: n/a
"Jay Tee" <je**********@gmail.comwrote:
Hi,

I have some code that does, essentially, the following:

- gather information on tens of thousands of items (in this case, jobs
running on a
compute cluster)
- store the information as a list (one per job) of Job items
(essentially wrapped
dictionaries mapping attribute names to values)

and then does some computations on the data. One of the things the
code needs to do, very often, is troll through the list and find jobs
of a certain class:

for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()

This operation is ultimately the limiting factor in the performance.
What I would like to try, if it is possible, is instead do something
like this:
When you are gathering the data and building the lists - why do you not
simultaneously build a dict of running jobs keyed by the Jeffs?

- Hendrik

Feb 20 '07 #6

P: n/a
"Jay Tee" <je**********@gmail.comwrites:
for j in jobs:
if (j.get('user') == 'jeff' and j.get('state')=='running') :
do_something()
Sounds like you need some backing data structures, like indexes
in a database, e.g. (untested, uses the cool new defaultdicts of 2.5):

index = defaultdict(lambda: defaultdict(set))
for j in jobs:
for k in j's dict:
index[k][j.get(k)].add(j)

Now for
if j.subset_attr({'user' : 'jeff', 'state' : 'running'}) :
do_something()
you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()
Feb 20 '07 #7

P: n/a
On Feb 20, 9:12 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
do_something()

you'd just write:

for j in (index['user']['jeff'] & index['state']['running']):
do_something()
Hi,

it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.
production systems, certified OS releases, that sort of stuff ...
can't just upgrade a few thousand machines worldwide for my little
program ;-)

Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.

JT

Feb 20 '07 #8

P: n/a
"Jay Tee" <je**********@gmail.comwrites:
it looks very cool, except that one of the constraints mentioned is
that the solution has to work properly on pythons 2.2 and 2.3.
That thing doesn't really deeply depend on defaultdict, it's just
convenient. You can add a few more lines of code in the preprocessing
step to get around it.
Thanks ... I took the previous poster's suggestion (had also been
suggested earlier) and built cached versions of the job lists, indexed
on common queries, while building the master job list.
You can also cache queries with the scheme I suggested, and it lets
you make the cached lists for new queries quickly.
Feb 20 '07 #9

P: n/a
Hi

your post had the following construct:
for j in (index['user']['jeff'] & index['state']['running']):
do_something()
but

Python 2.3.4 (#1, Oct 11 2006, 06:18:43)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?

Thanks

JT

Feb 20 '07 #10

P: n/a
"Jay Tee" <je**********@gmail.comwrites:
>l1= [3, 4, 7, 2]
l2 = [2, 3]
l2 = [2, 3, 99]
l1 & l2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for &: 'list' and 'list'

what am I missing?
They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2
Feb 20 '07 #11

P: n/a
Vishal Bhargava wrote:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal
Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
Feb 20 '07 #12

P: n/a
Vishal Bhargava wrote:
Is there an inbuilt library in Python which you can use to convert time in
seconds to hh:mm:ss format?
Thanks,
Vishal
Please don't ask a question by editing a reply to an existing thread:
your question is now filed on many people's computers under "How to test
if one dict us a subset of another".

Look at the strftime() function in the time module.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 20 '07 #13

P: n/a
On Feb 20, 6:44 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
They are sets, not lists.

from sets import Set as set # use in 2.3 and earlier

l1= set([3, 4, 7, 2])
l2 = set([2, 3])
l2 = set([2, 3, 99])
print l1 & l2
Thanks Paul, but:

bosui:~python
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-20)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>from sets import Set as set
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ImportError: No module named sets

Feb 20 '07 #14

P: n/a
"Jay Tee" <je**********@gmail.comwrites:
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets
Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.
Feb 20 '07 #15

P: n/a
Hi,

thanks! the code lift from 2.3 to 2.2 worked (thank Guido et al for
BACKWARDS COMPATIBILITY ;-)) ... unfortunately I was in a hurry to get
the release out since a colleague's cluster was croaking under the
load of the old, non-indexed version. Your solution is nicer looking
than mine, and leads to a less complex implementation. Rolling it
into the next release is going to have to wait a few weeks, I hope I
can remember it that long. Thanks!!

JT

On Feb 20, 8:54 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
"Jay Tee" <jeff.temp...@gmail.comwrites:
Python 2.2.3 (#1, Oct 26 2003, 11:49:53)
ImportError: No module named sets

Hmm, well I think the sets module in 2.3 is written in Python, so you
could drop it into your application for use in 2.2. Better would be
to use the C version from 2.4 if you can. Or you could fake it with
dicts. Sets are really just dicts under the skin. Instead of
set([1,2,3]) you'd use {1:True, 2:True, 3:True}. To intersect
two of them (untested):

def intersection(d1,d2):
if len(d2) < len(d1):
# swap them so d1 is the smaller one
d1,d2 = d2,d1
r = {}
for k in d1.iterkeys():
if k in d2:
r[k] = True
return r

The d1,d2 swap is just an optimization, since access is constant-time
you want to scan the smaller dictionary to minimize the number of
lookups.

Feb 20 '07 #16

This discussion thread is closed

Replies have been disabled for this discussion.