Optimize function similiar to dict.update() but adds common values

Gregory Piñero

Hey guys,

I thought I'd throw this out there since everyone loves optimization
questions (it's true, check out the number of replies to those type of
questions!)

Right now I have a function shown below. I want to combine two
dictionaries and add the values together where-ever there is overlap.
I figured this function would be reasonably fast, but I have around 1
million keys in each dictionary (~80% overlap) and it just runs all
night.

So maybe my function is not O(n)? I really figured it was but maybe
dictionary lookups are O(nlogn)?

Anyway here is the function. Any ideas on doing this task a lot
faster would be appriciated.

Thanks!

<code apology=sorry if gmail kills my tabs>

def add_freqs(freq1,freq2):
"""Add two word freq dicts"""
newfreq={}
for key,value in freq1.items():
newfreq[key]=value+freq2.get(key,0)
for key,value in freq2.items():
newfreq[key]=value+freq1.get(key,0)
return newfreq

freq1={
'dog':1,
'cat':2,
'human':3
}
freq2={
'perro':1,
'gato':1,
'human':2
}
print add_freqs(freq1,freq2)

answer_I_want={
'dog':1,
'cat':2,
'perro':1,
'gato':1,
'human':5
}
</code>
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Dec 14 '05 #1

Subscribe Post Reply

1514

Peter Otten

Gregory Piñero wrote:

def add_freqs(freq1,freq2):
****"""Add*two*word*freq*dicts"""
****newfreq={}
****for*key,value*in*freq1.items():
********newfreq[key]=value+freq2.get(key,0)
****for*key,value*in*freq2.items():
********newfreq[key]=value+freq1.get(key,0)
****return*newfreq Any*ideas*on*doing*this*task*a*lot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter

Dec 14 '05 #2

Gregory Piñero

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:

Gregory Piñero wrote:
def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter
--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Dec 14 '05 #3

Steve Holden

Gregory Piñero wrote:
[top-posting rearranged]

On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

No, because the statement

total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Dec 14 '05 #4

Peter Otten

Gregory Piñero wrote:

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?

No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

Dec 14 '05 #5

Steve Holden

Gregory Piñero wrote:
[top-posting rearranged]

On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

Dec 14 '05 #6

Gregory Piñero

Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(

I'll report back the running time of the new function though, assuming
it finishes ;-)

Thanks again,

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:

Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?

No. As I start with a copy of freq1, all keys of freq1 are already there..
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Dec 14 '05 #7

Gregory Piñero

OK, I ran Peter's add_freq3 and it ran four times on really large
dictionaries in about 3000 seconds. So I'd say that at a minimum
that's ten times faster than my original function since it ran all
last night and didn't finish.

Much obliged, Peter!

-Greg
On 12/14/05, Gregory Piñero <gr********@gmail.com> wrote:

Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(

I'll report back the running time of the new function though, assuming
it finishes ;-)

Thanks again,

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?

No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Dec 15 '05 #8

Christopher Subich

Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value

Dec 15 '05 #9

bonono

Christopher Subich wrote:

Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value

Would that have a performance impact as it seems that key needs to be
hashed twice or may be += also need to do it twice ? Just curious.

Dec 15 '05 #10

Peter Otten

Christopher Subich wrote:

Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value

That depends on the data. For an 80/20 matches to misses ratio it ranges
between the two approaches I gave:

$ python -m timeit -r1 -n1 -s"from bm import fin as f" "f()"
1 loops, best of 1: 70.5 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fget as f" "f()"
1 loops, best of 1: 91.1 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fex as f" "f()"
1 loops, best of 1: 142 msec per loop

The break even for fget/fex is at about 92% hit rate:

$ python -m timeit -r1 -n1 -s"from bm import fget as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 88 msec per loop

$ python -m timeit -r1 -n1 -s"from bm import fex as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 87.4 msec per loop

Here's the bm.py module so you can verify my reasoning:

freqs = dict.fromkeys(range(8*10**4), 0)
keys = range(10**5)

def fin(d=freqs, keys=keys):
for key in keys:
if key in d:
d[key] += 42
else:
d[key] = 42
return d

def fget(d=freqs, keys=keys):
for key in keys:
d[key] = d.get(key, 0) + 42
return d

def fex(d=freqs, keys=keys):
for key in keys:
try:
d[key] += 42
except KeyError:
d[key] = 42
return d

if __name__ == "__main__":
assert fin(dict(freqs)) == fex(dict(freqs)) == fget(dict(freqs))

Peter

Dec 15 '05 #11

Similar topics

'inverting' a dict

by: Irmen de Jong | last post by:

Hi I have this dict that maps a name to a sequence of other names. I want to have it reversed, i.e., map the other names each to the key they belong to (yes, the other names are unique and they...

Python

set, dict and other structures

by: bearophileHUGS | last post by:

I'm frequently using Py2.4 sets, I find them quite useful, and I like them, even if they seem a little slower than dicts. Sets also need the same memory of dicts (can they be made to use less...

Python

changing local namespace of a function

by: Bo Peng | last post by:

Dear list, I have many dictionaries with the same set of keys and I would like to write a function to calculate something based on these values. For example, I have a = {'x':1, 'y':2} b =...

Python

dict and __cmp__() question

by: Alex | last post by:

Entering >>> help(dict) Help on class dict in module __builtin__: class dict(object) | dict() -> new empty dictionary. | dict(mapping) -> new dictionary initialized from a mapping object's...

Python

Editing a function in-memory and in-place

by: Ian Bicking | last post by:

I got a puzzler for y'all. I want to allow the editing of functions in-place. I won't go into the reason (it's for HTConsole -- http://blog.ianbicking.org/introducing-htconsole.html), except that...

Python

a dict trick

by: james_027 | last post by:

hi for example I have this dictionary dict = {'name':'james', 'language':'english'} value = 'sex' in dict and dict or 'unknown' is a right pythonic of doing this one? I am trying to get a...

Python

trap setting attribute when the attribute is dict

by: | last post by:

Hello all, I have a question which might be simple or need some work around. I want to do something like this. My class/instance has a dict as a property. I want the instance to catch the...

Python

dict.update() useful or not?

by: Steven D'Aprano | last post by:

dict1.update(dict2) is of course equivalent to this code: for key, value in dict2.iteritems(): dict1 = value Note that it replaces values in dict1 with the value taken from dict2. I don't...

Python

How best to pass arbitrary parameters from one function to another

by: John O'Hagan | last post by:

Hi Pythonistas, I'm looking for the best way to pass an arbitrary number and type of variables created by one function to another. They can't be global because they may have different values...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA