472,126 Members | 1,554 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,126 software developers and data experts.

a question on python dict

Python dict is a hash table, isn't it?

I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying thing.
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python? I can't figure it out so I come here for
help.

Thanks!

Dec 21 '06 #1
10 1777
[co*******@gmail.com]
Python dict is a hash table, isn't it?
Yup.
I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,
Some implementations of hash tables do. Python's does not. Python's
uses what's called "open addressing" instead.
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.
That's several distinct claims, each of which is true of some hash
table implementations but not of others. Python's implementation has
no "buckets", so all that's simply irrelevant. Python's
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
I think this will waste some time doing the increasing-copying thing.
It does take time to resize. "Waste" is a value judgment ;-)
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?
You can't. Unless you build up to 20000 items over and over (many
thousands of times), it's unlikely you'd be able to measure the time
difference even if you could.
Dec 21 '06 #2
Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated each
time the dict resize, what I wanted to say was that the position of
each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I'll some here for help then.

Thanks again!
Tim Peters wrote:
[co*******@gmail.com]
Python dict is a hash table, isn't it?

Yup.
I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,

Some implementations of hash tables do. Python's does not. Python's
uses what's called "open addressing" instead.
and they should be configurable so I can set them to the proper value
when I know how many items I'm going to handle.

If these two values can't be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

That's several distinct claims, each of which is true of some hash
table implementations but not of others. Python's implementation has
no "buckets", so all that's simply irrelevant. Python's
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
I think this will waste some time doing the increasing-copying thing.

It does take time to resize. "Waste" is a value judgment ;-)
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?

You can't. Unless you build up to 20000 items over and over (many
thousands of times), it's unlikely you'd be able to measure the time
difference even if you could.
Dec 21 '06 #3
co*******@gmail.com wrote:
Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated
each time the dict resize, what I wanted to say was that the position
of each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I'll some here for help then.
In your original message you said:
I think this will waste some time doing the increasing-copying thing.
If I know I'm going to handle about 20000 items, I can set the size of
hashtable to 30000.
Have you actually tried timing how much time is wasted doing this? 20000
items really isn't very many for a dictionary to handle.

I ran a quick test to see how long it takes to create a dictionary with
20000 items. Remember, as has been explained to you, the items are not
copied, nor are the hash values recalculated when the dictionary is
resized, so if your dictionary takes much longer to create than the test
below gives you (on your machine that is rather than mine), then the
problem lies elsewhere. e.g. a slow hash function, or the time taken to
create whatever you are storing in the dictionary.

Just creating the dictionary with no other Python overhead:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d = dict.fromkeys(r)"
100 loops, best of 3: 11.3 msec per loop

Creating the dictionary using a Python for loop:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d={}" "for k in r: d[k] = None"
100 loops, best of 3: 11.7 msec per loop

Assigning to the elements in a pre-populated (and therefore the right
size) dictionary:

timeit.py -s "import random; r = [ random.random() for i in range(20000)];
d=dict.fromkeys(r)" "for k in r: d[k] = None"
100 loops, best of 3: 10.1 msec per loop

How many times do you create this dictionary that shaving 1.5 milliseconds
off 11 millseconds matters to you?

FWIW, for a 2,000,000 element dictionary the times are 2.5s and 1.48s
respectively so the situation does get progressively worse as dictionaries
get very large.
Dec 21 '06 #4
In message <ma***************************************@python. org>, Tim
Peters wrote:
You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
Actually, no. It also consists of updating reference counts as well.
Dec 31 '06 #5
[Tim Peters]
>You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
[Lawrence D'Oliveiro]
Actually, no. It also consists of updating reference counts as well.
Not in context: dict resizing is refcount-neutral, and the CPython
implementation of dict resizing exploits that (quite directly in 2.5;
indirectly before 2.5). It's not just that refcounts end up the same
/across/ dict resizing, it's that they're not changed internally either for
the duration of the resizing operation.
Dec 31 '06 #6
In message <Xn***********************@216.196.97.136>, Tim Peters wrote:
[Tim Peters]
>>You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).

[Lawrence D'Oliveiro]
>Actually, no. It also consists of updating reference counts as well.

Not in context: dict resizing is refcount-neutral...
The question wasn't about resizing, it was about "copying a dict key or
value".
Dec 31 '06 #7
[Tim Peters]
>>>You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).
[Lawrence D'Oliveiro]
>>Actually, no. It also consists of updating reference counts as well.
[Tim Peters]
>Not in context: dict resizing is refcount-neutral...
[Lawrence D'Oliveiro]
The question wasn't about resizing, it was about "copying a dict key or
value".
Then you misunderstood the OP. He was asking about dict resizing:

...

When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the
new one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying
thing.

Taking my response out of context to begin with doesn't really change that
I answered the question he asked ;-)
Dec 31 '06 #8
In message <Xn***********************@216.196.97.136>, Tim Peters wrote:
[Tim Peters]
>>>>You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).

[Lawrence D'Oliveiro]
>>>Actually, no. It also consists of updating reference counts as well.

[Tim Peters]
>>Not in context: dict resizing is refcount-neutral...

[Lawrence D'Oliveiro]
>The question wasn't about resizing, it was about "copying a dict key or
value".

Then you misunderstood the OP. He was asking about dict resizing:
Which is not what you were talking about, in the part of your posting that I
was responding to (above).
Jan 1 '07 #9
Tim Peters wrote:
Taking my response out of context to begin with doesn't really change that
I answered the question he asked ;-)
welcome to comp.lang.python.

</F>

Jan 3 '07 #10
[Tim Peters]
>...
Taking my response out of context to begin with doesn't really change
that I answered the question he asked ;-)
[Fredrik Lundh]
welcome to comp.lang.python.

</F>
Thanks for the welcome! It's tough to be a newbie here ;-)
Jan 3 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Dave Benjamin | last post: by
105 posts views Thread by Peter Hickman | last post: by
3 posts views Thread by fdsl ysnh | last post: by
13 posts views Thread by Steven Bethard | last post: by
18 posts views Thread by Marko.Cain.23 | last post: by
reply views Thread by rkmr.em | last post: by
20 posts views Thread by Mr.SpOOn | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.