473,398 Members | 2,188 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Optimize function similiar to dict.update() but adds common values

Hey guys,

I thought I'd throw this out there since everyone loves optimization
questions (it's true, check out the number of replies to those type of
questions!)

Right now I have a function shown below. I want to combine two
dictionaries and add the values together where-ever there is overlap.
I figured this function would be reasonably fast, but I have around 1
million keys in each dictionary (~80% overlap) and it just runs all
night.

So maybe my function is not O(n)? I really figured it was but maybe
dictionary lookups are O(nlogn)?

Anyway here is the function. Any ideas on doing this task a lot
faster would be appriciated.

Thanks!

<code apology=sorry if gmail kills my tabs>

def add_freqs(freq1,freq2):
"""Add two word freq dicts"""
newfreq={}
for key,value in freq1.items():
newfreq[key]=value+freq2.get(key,0)
for key,value in freq2.items():
newfreq[key]=value+freq1.get(key,0)
return newfreq

freq1={
'dog':1,
'cat':2,
'human':3
}
freq2={
'perro':1,
'gato':1,
'human':2
}
print add_freqs(freq1,freq2)

answer_I_want={
'dog':1,
'cat':2,
'perro':1,
'gato':1,
'human':5
}
</code>
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Dec 14 '05 #1
10 1514
Gregory Piñero wrote:
def add_freqs(freq1,freq2):
****"""Add*two*word*freq*dicts"""
****newfreq={}
****for*key,value*in*freq1.items():
********newfreq[key]=value+freq2.get(key,0)
****for*key,value*in*freq2.items():
********newfreq[key]=value+freq1.get(key,0)
****return*newfreq Any*ideas*on*doing*this*task*a*lot faster would be appriciated.


With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter
Dec 14 '05 #2
Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.


With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter
--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Dec 14 '05 #3
Gregory Piñero wrote:
[top-posting rearranged]
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.


With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

No, because the statement

total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
Dec 14 '05 #4
Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?


No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

Dec 14 '05 #5
Gregory Piñero wrote:
[top-posting rearranged]
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq

Anyideasondoingthistaskalot faster would be appriciated.


With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

No, because the statement

total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Dec 14 '05 #6
Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(

I'll report back the running time of the new function though, assuming
it finishes ;-)

Thanks again,

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?


No. As I start with a copy of freq1, all keys of freq1 are already there..
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Dec 14 '05 #7
OK, I ran Peter's add_freq3 and it ran four times on really large
dictionaries in about 3000 seconds. So I'd say that at a minimum
that's ten times faster than my original function since it ran all
last night and didn't finish.

Much obliged, Peter!

-Greg
On 12/14/05, Gregory Piñero <gr********@gmail.com> wrote:
Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(

I'll report back the running time of the new function though, assuming
it finishes ;-)

Thanks again,

-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?


No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Dec 15 '05 #8
Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total


Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value
Dec 15 '05 #9

Christopher Subich wrote:
Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total


Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value

Would that have a performance impact as it seems that key needs to be
hashed twice or may be += also need to do it twice ? Just curious.

Dec 15 '05 #10
Christopher Subich wrote:
Peter Otten wrote:

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total


Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value


That depends on the data. For an 80/20 matches to misses ratio it ranges
between the two approaches I gave:

$ python -m timeit -r1 -n1 -s"from bm import fin as f" "f()"
1 loops, best of 1: 70.5 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fget as f" "f()"
1 loops, best of 1: 91.1 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fex as f" "f()"
1 loops, best of 1: 142 msec per loop

The break even for fget/fex is at about 92% hit rate:

$ python -m timeit -r1 -n1 -s"from bm import fget as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 88 msec per loop

$ python -m timeit -r1 -n1 -s"from bm import fex as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 87.4 msec per loop

Here's the bm.py module so you can verify my reasoning:

freqs = dict.fromkeys(range(8*10**4), 0)
keys = range(10**5)

def fin(d=freqs, keys=keys):
for key in keys:
if key in d:
d[key] += 42
else:
d[key] = 42
return d

def fget(d=freqs, keys=keys):
for key in keys:
d[key] = d.get(key, 0) + 42
return d

def fex(d=freqs, keys=keys):
for key in keys:
try:
d[key] += 42
except KeyError:
d[key] = 42
return d

if __name__ == "__main__":
assert fin(dict(freqs)) == fex(dict(freqs)) == fget(dict(freqs))

Peter
Dec 15 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

15
by: Irmen de Jong | last post by:
Hi I have this dict that maps a name to a sequence of other names. I want to have it reversed, i.e., map the other names each to the key they belong to (yes, the other names are unique and they...
8
by: bearophileHUGS | last post by:
I'm frequently using Py2.4 sets, I find them quite useful, and I like them, even if they seem a little slower than dicts. Sets also need the same memory of dicts (can they be made to use less...
31
by: Bo Peng | last post by:
Dear list, I have many dictionaries with the same set of keys and I would like to write a function to calculate something based on these values. For example, I have a = {'x':1, 'y':2} b =...
2
by: Alex | last post by:
Entering >>> help(dict) Help on class dict in module __builtin__: class dict(object) | dict() -> new empty dictionary. | dict(mapping) -> new dictionary initialized from a mapping object's...
5
by: Ian Bicking | last post by:
I got a puzzler for y'all. I want to allow the editing of functions in-place. I won't go into the reason (it's for HTConsole -- http://blog.ianbicking.org/introducing-htconsole.html), except that...
8
by: james_027 | last post by:
hi for example I have this dictionary dict = {'name':'james', 'language':'english'} value = 'sex' in dict and dict or 'unknown' is a right pythonic of doing this one? I am trying to get a...
1
by: | last post by:
Hello all, I have a question which might be simple or need some work around. I want to do something like this. My class/instance has a dict as a property. I want the instance to catch the...
3
by: Steven D'Aprano | last post by:
dict1.update(dict2) is of course equivalent to this code: for key, value in dict2.iteritems(): dict1 = value Note that it replaces values in dict1 with the value taken from dict2. I don't...
2
by: John O'Hagan | last post by:
Hi Pythonistas, I'm looking for the best way to pass an arbitrary number and type of variables created by one function to another. They can't be global because they may have different values...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.