473,388 Members | 1,390 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,388 software developers and data experts.

Better dict of dicts

I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:
>>a = {"string_1": {"string_2":1,
.... "string_3":5,
.... "string_4":10},
.... "string_2": {"string_2":12,
.... "string_6":2,
.... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.

My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}

Then the underlying dictionary of dictionaries would look like:

a = {1:{2:1,3:5,4:10},2:{2:12,5:2,1:4}}

Somehow I need to intercept every possible call though....such that

a["string_1"]["string_2"] actually calls a[1][2]

and

a.keys() returns ["string_1", "string_2", "string_3"....]
rather than [1,2,3,4,5]

etc.

Ideally, I would like the option to have different key hashes for the
rows and columns as well.

Any ideas?
Apr 19 '07 #1
12 1799
On Apr 19, 5:24 pm, Bill Jackson <jack...@hotmail.comwrote:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.
I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.

Apr 19 '07 #2
Adam Atlas wrote:
On Apr 19, 5:24 pm, Bill Jackson <jack...@hotmail.comwrote:
>I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.

I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.
Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.
Apr 19 '07 #3
Bill Jackson schrieb:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary.
What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?
The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.

Here is an example:
>>>a = {"string_1": {"string_2":1,
... "string_3":5,
... "string_4":10},
... "string_2": {"string_2":12,
... "string_6":2,
... "string_1":4}}

So as my strings get longer and longer, it seems that the dictionary of
dictionary representation is less and less efficient.
You seem to assume that every occurrence of "string_2" has separate
memory. That may or may not be the case, depending on where the strings
come from. For example, it may be that the keys are the very same strings:

s1 = "string_1"
s2 = "string_2"

a[s1][s2] = 1
a[s2][s1] = 4

Then, the memory for the string would exist only once.
My thought was to subclass the dictionary structure....

keys = {"string_1":1,
"string_2":2,
"string_3":3,
"string_4":4,
"string_6":5}
Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.
Any ideas?
HTH,
Martin
Apr 19 '07 #4
John Bauman wrote:
Adam Atlas wrote:
>On Apr 19, 5:24 pm, Bill Jackson <jack...@hotmail.comwrote:
>>I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.

I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.
Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.
Maybe this will help:

http://wiki.python.org/moin/DictionaryKeys

-Larry
Apr 19 '07 #5
On Thursday 19 April 2007, Bill Jackson wrote:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.
I think you may want to look into that rarely used function "intern"
(under "on-essential Built-in Functions").

Basically, Python keeps a cache of certain strings are are frequently used so
comparisons and dictionary lookups only require a pointer comparison. You
could then subclass dict (though using "DictMixin" could be better) like:

class IDict(DictMixin):
def __setitem__(self, key, value):
key=intern(key)
self.__dict[key]=value

That's totally untested and incomplete, but you hopefully get the idea.

Python (or at least CPython) seems to auto intern some strings occasionally
(you could look at the source if you care about the exact rules).
Example:
>>a="1234567890"
b="1234567890"
a is b
True

So you don't have all that much to worry about.

Apr 19 '07 #6
Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:
Bill Jackson schrieb:
>I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary.

What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?
Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.
Instead of doing that, I would use a procedure called "interning".
You may want to use the builtin intern function, or your can
come up with your own interning:

interns = {}
def local_intern(s):
return interns.setdefault(s, s)

Then, instead of

a[k1][k2] = value

do

a[local_intern(k1)][local_intern(k2)] = value

then all strings occur only once as keys, as long as the interns
dictionary isn't cleared.
So, my structure is something like this:

a = {tuple_1: {tuple_2:value_1, tuple_3:value_2},
tuple_4: {tuple_1:value_3, tuple_3:value_4}
tuple_5: {tuple_2:value_5, tuple_3:value_6, tuple_4:value_7}}

Since the tuples from the inner dictionaries and the outer dictionaries
are frequently the same, I would benefit from using a single intern
function. Then, the tuples will always be "pointing" to the values
stored in the intern dictionary.

Now suppose there is little overlap between the keys for the outer
dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?
Apr 19 '07 #7
On Thu, 19 Apr 2007 17:40:27 -0400, John Bauman wrote:
Adam Atlas wrote:
>On Apr 19, 5:24 pm, Bill Jackson <jack...@hotmail.comwrote:
>>I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory.

I wouldn't worry about it. Try doing hash('string_2') in the
interpreter -- the output thereof is what's really being used as the
key. It doesn't use up any more memory than the integer 2.
Are you sure about that? Most dictionaries need to store the actual key,
in case of a collision, so when you lookup a key they can tell which
you're really looking for.

The key is already stored, so long as the string 'string_2' exists. And it
will continue to exist so long as the dictionary includes it as a key. An
extra copy isn't made (unless you make an extra copy yourself).

In other words, the dictionary stores a reference to the string, not a
copy of it:
>>s = "this is a long string"
d = {s: 1}
id(s)
-1209073840
>>id(d.keys()[0])
-1209073840
--
Steven.

Apr 19 '07 #8
On Thursday 19 April 2007, Bill Jackson wrote:
Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:
Bill Jackson schrieb:
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary.
What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?

Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.
That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)

>>t1=(1,2,3,4); t2=(1,2,3,4)
ti1=ITuple(t1); ti2=ITuple(t2)
print t1==t2, t1 is t2
True False
>>print ti1==ti2, ti1 is ti2
True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.
Apr 20 '07 #9
On Apr 20, 1:38 am, DillonCo <dillo...@comcast.netwrote:
On Thursday 19 April 2007, Bill Jackson wrote:
Martin v. Löwis wrote the following on 04/19/2007 02:43 PM:
Bill Jackson schrieb:
>I have a dictionary of dictionaries where the keys are typically very
>long tuples and repeated in each inner dictionary.
What I don't understand here: you say the keys are tuples, yet later,
you show that the keys are strings. Which one is it?
Sorry, I was just lazy. The keys will always be tuples...tuple of
strings, tuples of numbers, tuples of objects....simply tuples.

That'll change things a bit because intern only works with strings.

Of course, It's not so big a deal, but you will have to put together a class
to implement interning.
I wrote one for fun:

class ITuple(tuple):
_interns={}
def __new__(cls, tup):
if tup not in cls._interns:
itup=tuple.__new__(cls, tup)
cls._interns[tup]=itup
return cls._interns[tup]
def __init__(self, *args):
#Prevent multiple calls to __init__
if hasattr(self, "_inited"): return
tuple.__init__(self, *args)
self._inited=True
def __eq__(self, o):
#If the other is an ITuple, self==o iff self is o
if isinstance(o, ITuple):
return self is o
return tuple.__eq__(self, o)
>t1=(1,2,3,4); t2=(1,2,3,4)
ti1=ITuple(t1); ti2=ITuple(t2)
print t1==t2, t1 is t2
True False
>print ti1==ti2, ti1 is ti2

True True

That seems to work. Something to note is that the call overhead of the __eq__
function is large enough that unless you have a slow comparing tuple,
comparisons will be faster without it. Comparisons are fast if they are done
internally; so between builtin objects or identical (" is ") objects.

For an example, suppose you have:

class TTT(object):
def __eq__(self, o): return True
a,b=TTT(),TTT()

Then the follow comparisons are fast:
(1,2,3)==(1,2,3)
(1,2,3,a)==(1,2,3,a)
(0,0,0,a)==(1,2,3,b)

The following are slow:
(1,2,3,a)==(1,2,3,b)

Note that the only slow case is the one where a.__eq__(b) is called. However,
a.__eq__(b) is assumed True is "a is b" is True. So chances are you'll want
to comment out the __eq__ function.
Hi DillonCo,
Martins earlier local_intern function would work for tuples as well as
strings.

- Paddy.

Apr 20 '07 #10
On Thursday 19 April 2007, Paddy wrote:
Martins earlier local_intern function would work for tuples as well as
strings.
It certainly would. I had written that class, though, primarily to offer a
performance improvement in the __eq__ and perhaps __hash__ methods. However,
I ended up being rather surprised by just how much overhead there was in
calling the Python methods vs. the builtin ones.

So really mine only ends up being useful if the tuple consists of a couple
(i.e. 2+) of objects with Python __eq__ methods. Oh well.

As an interesting aside:
>>class A(object):
.... def __eq__(self, o):
.... return False
....
>>a=A()
a==a
False
>>t=(a,)
t==t
True

Apparently the tuple's __eq__ assumes: "a is b" ="a==b"
Bug? Or am I creating evil corner cases ;)?


Apr 20 '07 #11
Now suppose there is little overlap between the keys for the outer
dictionary and the inner dictionaries...but still much overlap between
the various inner dictionaries. Then, there is no point in using an
intern function for the outer dictionary, but still a benefit for the
inner dictionary. Thus, something like the following would be appropriate:

a[k1][local_intern(k2)] = value

Have I understood this properly?
Correct. Notice that this thing is a time-space-tradeoff resolved in
favor of space - quite uncommon, as people these days always chose to
resolve them in favor of time. Interning a tuple will take some time,
as it is a dictionary lookup (requiring to hash the tuple); then
you will pass the interned tuple again for another dictionary lookup.

So make sure that you don't apply local_intern() twice to the very
same tuple object. E.g. when you also store the tuple in a variable,
don't do

var = (..., ..., ...)
a[foo][local_intern(var)] = foovar
print a[foo][local_intern(var)]

instead do

var = local_intern((..., ..., ...))
a[foo][var] = foovar
print a[foo][var]

Strictly speaking, you can omit the interning on read operation,
as that won't affect the keys stored in the dictionary. However,
performing the lookup with the interned key gives you speed
back, as then dictionary lookup won't need to compare the tuples,
but will find that the search key is identical to the stored
key, so the keys are certainly equal.

Regards,
Martin
Apr 20 '07 #12
Bill Jackson a écrit :
I have a dictionary of dictionaries where the keys are typically very
long tuples and repeated in each inner dictionary. The dictionary
representation is nice because it handles sparseness well...and it is
nice to be able to look up values based on a string rather than a
number. However, since my keys are quite long, I worry that I am
wasting a lot of memory. I'm looking for better data structures.
Is this actually a *real* problem (or do you have evidences - based on
both measurements of the behaviour with test data sets and knowledge of
the real data sets- that there will be a problem) ? Or this this just
"worrying" ? In the second case, I suggest that you bench and profile
your code to know for sure...
Apr 21 '07 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
by: Pierre Fortin | last post by:
Hi, Is the following a reasonable generic approach for converting python returned tuples/lists into dicts...? I'm not advocating library functions also return dicts (I'd probably spend more...
8
by: bearophileHUGS | last post by:
I'm frequently using Py2.4 sets, I find them quite useful, and I like them, even if they seem a little slower than dicts. Sets also need the same memory of dicts (can they be made to use less...
6
by: David Rasmussen | last post by:
If I have a collection of dicts like: john = {'id': 1, 'name': "John Cleese", 'year': 1939} graham = {'id': 2, 'name': "Graham Chapman", 'year': 1941} I could store all of them in a list. But...
5
by: Alexander Kozlovsky | last post by:
Hello all! I have small silly syntax suggestion (SSSS) In many cases, keys in dictionary-like objects are strings, and moreover - valid Python identifiers. Something like: foo.x = y How...
15
by: George Sakkis | last post by:
Although I consider dict(**kwds) as one of the few unfortunate design choices in python since it prevents the future addition of useful keyword arguments (e.g a default value or an orderby...
3
by: aking | last post by:
Dear Python people, im a newbie to python and here...so hello! Im trying to iterate through values in a dictionary so i can find the closest value and then extract the key for that...
13
by: vd12005 | last post by:
Hello, i would like to sort(ed) and reverse(d) the result of many huge dictionaries (a single dictionary will contain ~ 150000 entries). Keys are words, values are count (integer). i'm...
4
by: bearophileHUGS | last post by:
I have started doing practice creating C extensions for CPython, so here are two ideas I have had, possibly useless. If you keep adding elements to a CPython dict/set, it periodically rebuilds...
7
by: Benjamin | last post by:
How would I go about "flattening" a dict with many nested dicts within? The dicts might look like this: {"mays" : {"eggs" : "spam"}, "jam" : {"soda" : {"love" : "dump"}}, "lamba" : 23 } I'd...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.