By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,621 Members | 1,074 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,621 IT Pros & Developers. It's quick & easy.

Pickling limitation with instances defining __cmp__/__hash__?

P: n/a
I've come across a limitation in unpickling certain types of complex
data structures which involve instances that override __hash__, and was
wondering if it was known (basic searches didn't seem to come up with
anything similar) and if there is a workaround for it short of
restructuring the data structures in question.

The fundamental issue rests with defining classes which override __cmp__
and __hash__ in order to be used as keys in dictionaries (and elements
of sets). __cmp__ and __hash__ are defined to manipulate a single
attribute of the class, which never changes for the lifetime of an
object. In a simplified form:

class C:

def __init__(self, x):
self.x = x

def __cmp__(self, other):
return cmp(self.x, other.x)

def __hash__(self):
return hash(self.x)

Even if C contains other members which are manipulated, making it
technically mutable, since the one attribute (in this example, x) which
is used for __cmp__ and __hash__ is never changed after the creation of
the object, it is legal to use as a dictionary key. (Formally, the
atrribute in question is a name which is guaranteed to be unique.)

The difficulty arises when the data structures that are built up in C
contain a circular reference to itself as a dictionary key. In my
particular case the situation is rather involved, but the simplest
example which reproduces the problem (using C) would be:

c = C(1)
c.m = {c: '1'}

So far this is fine and behaves as expected. Pickling the object c
results in no problems. Unpickling it, however, results in an error:

data = pickle.dumps(c)
d = pickle.loads(data) # line 25

Traceback (most recent call last):
File "/home/max/tmp/hash.py", line 25, in ?
d = pickle.loads(data)
File "/usr/local/lib/python2.4/pickle.py", line 1394, in loads
return Unpickler(file).load()
File "/usr/local/lib/python2.4/pickle.py", line 872, in load
dispatch[key](self)
File "/usr/local/lib/python2.4/pickle.py", line 1218, in load_setitem
dict[key] = value
File "/home/max/tmp/hash.py", line 15, in __hash__
return hash(self.x)
AttributeError: C instance has no attribute 'x'

By poking around, one can see that the error is occurring because the
unpickler algorithm is trying to use the instance as a key in a
dictionary before the instance has been completely initialized (in fact,
the __dict__ of this object is the empty dictionary!).

The error happens regardless of whether pickle or cPickle is used (so I
used pickle to give a more meaningful traceback above), nor whether the
protocol is 0 or HIGHEST_PROTOCOL.

Is this issue known? I don't see any mention of this kind of
circularity in the Python Library Reference 3.14.4. Second, is there
any reasonably straightforward workaround to this limitation, short of
reworking things so that these self-referenced objects aren't used as
dictionary keys?

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
You'll learn / Life is worth it / Watch the tables turn
-- TLC
Jul 19 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Erik Max Francis wrote:
I've come across a limitation in unpickling certain types of complex
data structures which involve instances that override __hash__, and was
wondering if it was known (basic searches didn't seem to come up with
anything similar) and if there is a workaround for it short of
restructuring the data structures in question.


Replying to my own (old) post here, I finally got back to this and found
the best solution was to define surrogate set and dictionary classes
that internally used the IDs as keys, eliminating the circular
dependency. Examples of SafeSet and SafeDict serving this purpose
follow, though note that I only defined the methods that I used, rather
than the full and complete interfaces for sets and dictionaries (though
it should serve as an example for those who need to do more):

class SafeSet(_ReprMixin):

@staticmethod
def ider(thing):
return thing.id

def __init__(self, ider=None):
if ider is not None:
self.ider = ider
self._map = {} # id -> thing

def __len__(self):
return len(self._map)

def __contains__(self, thing):
return self.ider(thing) in self._map

def add(self, thing):
key = self.ider(thing)
if self._map.has_key(key):
assert self._map[key] is thing
self._map[key] = thing

def remove(self, thing):
del self._map[self.ider(thing)]

def pop(self):
iterator = self._map.iterkeys()
next = iterator.next()
return self._map.pop(next)

def clear(self):
self._map.clear()

def copy(self):
return copy.copy(self)

def update(self, sequence):
for thing in sequence:
self.add(thing)

def difference(self, other):
thisSet = set(self._map.iterkeys())
otherSet = set(other._map.iterkeys())
newSet = thisSet.difference(otherSet)
safeSet = SafeSet()
for key in newSet:
safeSet.add(self._map[key])
return safeSet

def __iter__(self):
return self._map.itervalues()

def __str__(self):
return 'set(' + str(self._map.keys()) + ')'
class SafeDict(_ReprMixin):

@staticmethod
def ider(thing):
return thing.id

def __init__(self, ider=None):
if ider is not None:
self.ider = ider
self._keys = {} # id -> key
self._values = {} # id -> value

def __len__(self):
return len(self._keys)

def __contains__(self, thing):
return self.ider(thing) in self._keys

def __getitem__(self, thing):
return self._values[self.ider(thing)]

def __setitem__(self, thing, value):
key = self.ider(thing)
self._keys[key] = thing
self._values[key] = value

def __delitem__(self, thing, value):
key = self.ider(thing)
del self._keys[key]
del self._values[key]

def keys(self):
return self._keys.values()

def iterkeys(self):
return self._keys.itervalues()

def values(self):
return self._values.values()

def itervalues(self):
return self._values.itervalues()

def items(self):
return [(self._keys[x], self._values[x]) for x in self._keys]

def iteritems(self):
return ((self._keys[x], self._values[x]) for x in self._keys)

def clear(self):
self._keys.clear()
self._values.clear()

def copy(self):
return copy.copy(self)

def update(self, mapping):
for key, value in mapping.iteritems():
self[key] = value

def has_key(self, thing):
return self._keys.has_key(self.ider(thing))

def get(self, thing, default=None):
return self._values.get(self.ider(thing), default)

def setdefault(self, thing, default):
key = self.ider(thing)
if key in self._keys:
return self._values[key]
else:
self._keys[key] = thing
self._values[key] = default

def __iter__(self):
return self._keys.itervalues()

def __str__(self):
return str(self._values)

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
The only completely consistent people are the dead.
-- Aldous Huxley
Aug 9 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.