Defaultdict and speed

bearophileHUGS

This post sums some things I have written in another Python newsgroup.
More than 40% of the times I use defaultdict like this, to count
things:

>>from collections import defaultdict as DD
s = "abracadabra"
d = DD(int)
for c in s: d[c] += 1

....

>>d

defaultdict(<type 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})

But I have seen that if keys are quite sparse, and int() becomes called
too much often, then code like this is faster:

>>d = {}
for c in s:

.... if c in d: d[c] += 1
.... else: d[c] = 1
....

>>d

{'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}

So to improve the speed for such special but common situation, the
defaultdict can manage the case with default_factory=int in a different
and faster way.

Bye,
bearophile

Nov 3 '06 #1

Subscribe Post Reply

4131

Klaas

be************@lycos.com wrote:

This post sums some things I have written in another Python newsgroup.
More than 40% of the times I use defaultdict like this, to count
things:

>from collections import defaultdict as DD
s = "abracadabra"
d = DD(int)
for c in s: d[c] += 1

...

>d

defaultdict(<type 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})

But I have seen that if keys are quite sparse, and int() becomes called
too much often, then code like this is faster:

>d = {}
for c in s:

... if c in d: d[c] += 1
... else: d[c] = 1
...

>d

{'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}

So to improve the speed for such special but common situation, the
defaultdict can manage the case with default_factory=int in a different
and faster way.

Benchmarks? I doubt it is worth complicating defaultdict's code (and
slowing down other uses of the class) for this improvement...
especially when the faster alternative is so easy to code. If that
performance difference matters, you would likely find more fruitful
gains in coding it in c, using PyDict_SET.

-Mike

Nov 4 '06 #2

bearophileHUGS

Klaas wrote:

Benchmarks?

There is one (fixed in a succesive post) in the original thread I was
referring to:
http://groups.google.com/group/it.co...f60c644969f9b/
If you want I can give more of them (and a bit less silly, with strings
too, etc).

def ddict(n):
t = clock()
d = defaultdict(int)
for i in xrange(n):
d[i] += 1
print round(clock()-t, 2)

def ndict(n):
t = clock()
d = {}
for i in xrange(n):
if i in d:
d[i] += 1
else:
d[i] = 1
print round(clock()-t, 2)

ddict(300000)
ndict(300000)

(and slowing down other uses of the class)

All it has to do is to cheek if the default_factory is an int, it's
just an "if" done only once, so I don't think it slows down the other
cases significantly.

especially when the faster alternative is so easy to code.

The faster alternative is easy to create, but the best faster
alternative can't be coded, because if you code it in Python you need
two hash accesses, while the defaultdict can require only one of them:

if n in d:
d[n] += 1
else:
d[n] = 1

>If that performance difference matters,

With Python it's usually difficult to tell if some performance
difference matters. Probably in some programs it may matter, but in
most other programs it doesn't matter. This is probably true for all
the performance tweaks I may invent in the future too.

you would likely find more fruitful
gains in coding it in c, using PyDict_SET

I've just started creating a C lib for related purposes, I'd like to
show it to you all on c.l.p, but first I have to find a place to put it
on :-) (It's not easy to find a suitable place, it's a python + c +
pyd, and it's mostly an exercise).

Bye,
bearophile

Nov 4 '06 #3

Klaas

be************@lycos.com wrote:

Klaas wrote:
Benchmarks?

There is one (fixed in a succesive post) in the original thread I was
referring to:
http://groups.google.com/group/it.co...f60c644969f9b/
If you want I can give more of them (and a bit less silly, with strings
too, etc).

<>

Sorry, I didn't see any numbers. I ran it myself and found the
defaultdict version to be approximately twice as slow. This, as you
suggest, is the worst case, as you are using integers as hash keys
(essentially no hashing cost) and are accessing each key exactly once.

>
(and slowing down other uses of the class)

All it has to do is to cheek if the default_factory is an int, it's
just an "if" done only once, so I don't think it slows down the other
cases significantly.

Once it makes that check, surely it must check a flag or some such
every time it is about to invoke the key constructor function?

especially when the faster alternative is so easy to code.

The faster alternative is easy to create, but the best faster
alternative can't be coded, because if you code it in Python you need
two hash accesses, while the defaultdict can require only one of them:

if n in d:
d[n] += 1
else:
d[n] = 1

How do you think that defaultdict is implemented? It must perform the
dictionary access to determine that the value is missing. It must then
go through the method dispatch machinery to look for the __missing__
method, and execute it. If you _really_ want to make this fast, you
should write a custom distionary subclass which accepts an object (not
function) as default value, and assigns it directly.

If that performance difference matters,

With Python it's usually difficult to tell if some performance
difference matters. Probably in some programs it may matter, but in
most other programs it doesn't matter. This is probably true for all
the performance tweaks I may invent in the future too.

In general, I agree, but in this case it is quite clear. The only
possible speed up is for defaultdict(int). The re-write using regular
dicts is trivial, hence, for given piece of code is it quite clear
whether the performance gain is important. This is not an
interpreter-wide change, after all.

Consider also that the performance gains would be relatively
unsubstantial when more complicated keys and a more realistic data
distribution is used. Consider further that the __missing__ machinery
would still be called. Would the resulting construct be faster than
the use of a vanilla dict? I doubt it.

But you can prove me wrong by implementing it and benchmarking it.

you would likely find more fruitful
gains in coding it in c, using PyDict_SET

I've just started creating a C lib for related purposes, I'd like to
show it to you all on c.l.p, but first I have to find a place to put it
on :-) (It's not easy to find a suitable place, it's a python + c +
pyd, and it's mostly an exercise).

Would suggesting a webpage be too trite?

-Mike

Nov 5 '06 #4

Similar topics

detect internet speed with php

by: Yang Li Ke | last post by:

Hi guys, Is it possible to know the internet speed of the visitors with php? Thanx -- Yang

PHP

Speed: bytecode vz C API calls

by: Jacek Generowicz | last post by:

I have a program in which I make very good use of a memoizer: def memoize(callable): cache = {} def proxy(*args): try: return cache except KeyError: return cache.setdefault(args,...

Python

use cases for a defaultdict

by: Steven Bethard | last post by:

Steven Bethard wrote: > Agreed. I really hope that Python 3.0 applies Raymond Hettinger's > suggestion "Improved default value logic for Dictionaries" from > ...

Python

Fast constant functions for Py2.5's defaultdict()

by: Raymond Hettinger | last post by:

FWIW, here are three ways of writing constant functions for collections.defaultdict(): d = defaultdict(int) # slowest way; works only for zero d = defaultdict(lambda: 0) # faster way;...

Python

suggestion: recursive collections.defaultdict

by: tutufan | last post by:

It seems like x = defaultdict(defaultdict(list)) should do the obvious, but it doesn't. This seems to work y = defaultdict(lambda: defaultdict(list)) though is a bit uglier.

Python

Please explain collections.defaultdict(lambda: 1)

by: metaperl.com | last post by:

I'm reading http://norvig.com/spell-correct.html and do not understand the expression listed in the subject which is part of this function: def train(features): model =...

Python

Problem: Pickle and collections.defaultdict with default_factory set do not work

by: fizilla | last post by:

Hello all! I have the following weird problem and since I am new to Python I somehow cannot figure out an elegant solution. The problem reduces to the following question: How to pickle a...

Python

defaultdict.fromkeys returns a surprising defaultdict

by: Matthew Wilson | last post by:

I used defaultdict.fromkeys to make a new defaultdict instance, but I was surprised by behavior: defaultdict(None, {'y': <type 'list'>, 'x': <type 'list'>}) <type 'list'> ...

Python

Something on my computer is limiting my download speed

by: nestle | last post by:

I have DSL with a download speed of 32MB/s and an upload speed of 8MB/s(according to my ISP), and I am using a router. My upload speed is always between 8MB/s and 9MB/s(which is above the max upload...

Microsoft Windows

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General