471,107 Members | 1,700 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,107 software developers and data experts.

dict.get and str.xsplit

When I use dictionaries I find the dict.get() method very handy, and
I'd like to use it often, but I think it's quite slow. This is a small
benchmark:

from random import randrange
from timeit import default_timer as clock

def main(N, M):
keys = list(set(randrange(2*N) for n in xrange(N)))
n2 = len(keys) / 2
keys1, keys2 = keys[:n2], keys[n2:]
adict = dict.fromkeys(keys1)

t = clock()
for _ in xrange(M):
for k in keys1:
r = adict.get(k, None)
for k in keys2:
r = adict.get(k, None)
print round(clock() - t, 2), "s"

t = clock()
for _ in xrange(M):
for k in keys1:
if k in adict:
r = adict[k]
else:
r = None
for k in keys2:
if k in adict:
r = adict[k]
else:
r = None
print round(clock() - t, 2), "s"

main(40000, 30)
On my old PC the output is:
2.36 s
1.59 s
(Your timings may be 4-5 times smaller, so you can tweak N and M to
have significant results).
And note that the if/then version is faster even if it calls the hash
function two times, while get() is supposed to do it once (in this
case the hashing function is very simple and fast. If the keys are
long strings the results of a similar benchmark may be different,
because the computing of the hashing function may become the slowest
part of that code).

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

------------------

Another method I'd like to see, is the str.xsplit() I was talking
about time ago too. It's like str.split() but it's lazy. It avoids to
build a potentially huge list. In real D programs I have seen such
string function may speed up heavy-string-processing code (the kind of
code you may want to use Python/Perl/Ruby for) up to 2-3 times. Seeing
how such kind of processing is common in Python I think this isn't an
optimization to ignore.
(I am now getting better in writing C code, so in few months I may
even become able to submit a xsplit patch myself. I think it's not too
much complex to write, it can borrow lot of code from the str.split
method).

Bye, and thank you for your kind attention,
bearophile
Feb 26 '08 #1
6 1507
On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?
I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

Ciao,
Marc 'BlackJack' Rintsch
Feb 26 '08 #2
Marc 'BlackJack' Rintsch wrote:
On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
>This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)
And think about using the timeit module, since it's been included among
the batteries for your convenience, and it removes some common pitfalls
with cross-platform timings.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Feb 26 '08 #3
Marc 'BlackJack' Rintsch:
I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::
I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :-) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile
Feb 26 '08 #4
bearophileH...@lycos.com napisał(a):
It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :-) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.
AFAIK, method/function call is generally slow in Python (similarly to
the dot attribute lookup). As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

from collections import defaultdict

def main(N, M):

# <snip>

addict = defaultdict(lambda: None, adict)
t = clock()
for _ in xrange(M):
for k in keys1:
r = addict[k]
for k in keys2:
r = addict[k]
print round(clock() - t, 2), "s"

adict.get: 3.12 s
adict_get: 2.24 s
in: 1.62 s
defaultdict: 1.05 s
Feb 26 '08 #5
marek.ro...@wp.pl:
As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:
It adds the new keys, I can't accept that:
>>from collections import defaultdict as dd
adict = {1:2, 3:4}
addict = dd(lambda: None, adict)
r = addict[10]
addict
defaultdict(<function <lambdaat 0x008E6FB0>, {1: 2, 10: None, 3: 4})

Bye,
bearophile
Feb 26 '08 #6
bearophileH...@lycos.com napisał(a):
marek.ro...@wp.pl:
As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

It adds the new keys, I can't accept that:
Right, my fault not noticing that. I experimented further with
__missing__ method and with "ask forgiveness" idiom (try... except
KeyError) and they were both noticeably slower. So checking with "in"
may be the most efficient way we have.

Regards,
Marek
Feb 26 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Robin Cull | last post: by
2 posts views Thread by GrelEns | last post: by
15 posts views Thread by Cruella DeVille | last post: by
15 posts views Thread by George Sakkis | last post: by
12 posts views Thread by jeremito | last post: by
10 posts views Thread by Gerard flanagan | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.