Hey guys,
I thought I'd throw this out there since everyone loves optimization
questions (it's true, check out the number of replies to those type of
questions!)
Right now I have a function shown below. I want to combine two
dictionaries and add the values together where-ever there is overlap.
I figured this function would be reasonably fast, but I have around 1
million keys in each dictionary (~80% overlap) and it just runs all
night.
So maybe my function is not O(n)? I really figured it was but maybe
dictionary lookups are O(nlogn)?
Anyway here is the function. Any ideas on doing this task a lot
faster would be appriciated.
Thanks!
<code apology=sorry if gmail kills my tabs>
def add_freqs(freq1,freq2):
"""Add two word freq dicts"""
newfreq={}
for key,value in freq1.items():
newfreq[key]=value+freq2.get(key,0)
for key,value in freq2.items():
newfreq[key]=value+freq1.get(key,0)
return newfreq
freq1={
'dog':1,
'cat':2,
'human':3
}
freq2={
'perro':1,
'gato':1,
'human':2
}
print add_freqs(freq1,freq2)
answer_I_want={
'dog':1,
'cat':2,
'perro':1,
'gato':1,
'human':5
}
</code>
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
( www.blendedtechnologies.com) 10 1514
Gregory Piñero wrote: def add_freqs(freq1,freq2): ****"""Add*two*word*freq*dicts""" ****newfreq={} ****for*key,value*in*freq1.items(): ********newfreq[key]=value+freq2.get(key,0) ****for*key,value*in*freq2.items(): ********newfreq[key]=value+freq1.get(key,0) ****return*newfreq
Any*ideas*on*doing*this*task*a*lot faster would be appriciated.
With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.
With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.
Dump the symmetry and try one of these:
def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total
def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total
My guess is that add_freqs3() will perform best.
Peter
Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.
-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote: Gregory Piñero wrote:
def add_freqs(freq1,freq2): """Addtwowordfreqdicts""" newfreq={} forkey,valueinfreq1.items(): newfreq[key]=value+freq2.get(key,0) forkey,valueinfreq2.items(): newfreq[key]=value+freq1.get(key,0) returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.
With items() you copy the whole dictionary into a list of tuples; iteritems() just walks over the existing dictionary and creates one tuple at a time.
With "80% overlap", you are looking up and setting four out of five values twice in your for-loops.
Dump the symmetry and try one of these:
def add_freqs2(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): if key in freq1: total[key] += value else: total[key] = value return total
def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
My guess is that add_freqs3() will perform best.
Peter -- http://mail.python.org/mailman/listinfo/python-list
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
( www.blendedtechnologies.com)
Gregory Piñero wrote:
[top-posting rearranged] On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
def add_freqs(freq1,freq2): """Addtwowordfreqdicts""" newfreq={} forkey,valueinfreq1.items(): newfreq[key]=value+freq2.get(key,0) forkey,valueinfreq2.items(): newfreq[key]=value+freq1.get(key,0) returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.
With items() you copy the whole dictionary into a list of tuples; iteritems() just walks over the existing dictionary and creates one tuple at a time.
With "80% overlap", you are looking up and setting four out of five values twice in your for-loops.
Dump the symmetry and try one of these:
def add_freqs2(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): if key in freq1: total[key] += value else: total[key] = value return total
def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
My guess is that add_freqs3() will perform best. Thanks Peter, those are some really good ideas. I can't wait to try them out tonight.
Here's a question about your functions. if I only look at the keys in freq2 then won't I miss any keys that are in freq1 and not in freq2? That's why I have the two loops in my original function.
No, because the statement
total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
Gregory Piñero wrote: Here's a question about your functions. if I only look at the keys in freq2 then won't I miss any keys that are in freq1 and not in freq2?
No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.
What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.
Peter
Gregory Piñero wrote:
[top-posting rearranged] On 12/14/05, Peter Otten <__*******@web.de> wrote:
Gregory Piñero wrote:
def add_freqs(freq1,freq2): """Addtwowordfreqdicts""" newfreq={} forkey,valueinfreq1.items(): newfreq[key]=value+freq2.get(key,0) forkey,valueinfreq2.items(): newfreq[key]=value+freq1.get(key,0) returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.
With items() you copy the whole dictionary into a list of tuples; iteritems() just walks over the existing dictionary and creates one tuple at a time.
With "80% overlap", you are looking up and setting four out of five values twice in your for-loops.
Dump the symmetry and try one of these:
def add_freqs2(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): if key in freq1: total[key] += value else: total[key] = value return total
def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
My guess is that add_freqs3() will perform best. Thanks Peter, those are some really good ideas. I can't wait to try them out tonight.
Here's a question about your functions. if I only look at the keys in freq2 then won't I miss any keys that are in freq1 and not in freq2? That's why I have the two loops in my original function.
No, because the statement
total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(
I'll report back the running time of the new function though, assuming
it finishes ;-)
Thanks again,
-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote: Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in freq2 then won't I miss any keys that are in freq1 and not in freq2?
No. As I start with a copy of freq1, all keys of freq1 are already there.. There is probably a loop involved, but it in Python's underlying C implementation, which is a bit faster.
What is left to do is to (1) add key and value for the 20% of freq2 that are not in freq1, and to (2) increase the value for the 80% where the key occurs in both freq1 and freq2. This is done by the for-loop.
Peter
-- http://mail.python.org/mailman/listinfo/python-list
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
( www.blendedtechnologies.com)
OK, I ran Peter's add_freq3 and it ran four times on really large
dictionaries in about 3000 seconds. So I'd say that at a minimum
that's ten times faster than my original function since it ran all
last night and didn't finish.
Much obliged, Peter!
-Greg
On 12/14/05, Gregory Piñero <gr********@gmail.com> wrote: Yes, that makes sense. I can't wait to run it tonight. Sorry I can't give you the running time of my original function as it never finished :-(
I'll report back the running time of the new function though, assuming it finishes ;-)
Thanks again,
-Greg
On 12/14/05, Peter Otten <__*******@web.de> wrote: Gregory Piñero wrote:
Here's a question about your functions. if I only look at the keys in freq2 then won't I miss any keys that are in freq1 and not in freq2?
No. As I start with a copy of freq1, all keys of freq1 are already there. There is probably a loop involved, but it in Python's underlying C implementation, which is a bit faster.
What is left to do is to (1) add key and value for the 20% of freq2 that are not in freq1, and to (2) increase the value for the 80% where the key occurs in both freq1 and freq2. This is done by the for-loop.
Peter
-- http://mail.python.org/mailman/listinfo/python-list
-- Gregory Piñero Chief Innovation Officer Blended Technologies (www.blendedtechnologies.com)
--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
( www.blendedtechnologies.com)
Peter Otten wrote: def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:
total[key] = total.get(key,0) + value
Christopher Subich wrote: Peter Otten wrote: def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
Untested, but replacing the try/except pair with the following should be semantically equivalent and a bit faster:
total[key] = total.get(key,0) + value
Would that have a performance impact as it seems that key needs to be
hashed twice or may be += also need to do it twice ? Just curious.
Christopher Subich wrote: Peter Otten wrote: def add_freqs3(freq1, freq2): total = dict(freq1) for key, value in freq2.iteritems(): try: total[key] += value except KeyError: total[key] = value return total
Untested, but replacing the try/except pair with the following should be semantically equivalent and a bit faster:
total[key] = total.get(key,0) + value
That depends on the data. For an 80/20 matches to misses ratio it ranges
between the two approaches I gave:
$ python -m timeit -r1 -n1 -s"from bm import fin as f" "f()"
1 loops, best of 1: 70.5 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fget as f" "f()"
1 loops, best of 1: 91.1 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fex as f" "f()"
1 loops, best of 1: 142 msec per loop
The break even for fget/fex is at about 92% hit rate:
$ python -m timeit -r1 -n1 -s"from bm import fget as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 88 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fex as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 87.4 msec per loop
Here's the bm.py module so you can verify my reasoning:
freqs = dict.fromkeys(range(8*10**4), 0)
keys = range(10**5)
def fin(d=freqs, keys=keys):
for key in keys:
if key in d:
d[key] += 42
else:
d[key] = 42
return d
def fget(d=freqs, keys=keys):
for key in keys:
d[key] = d.get(key, 0) + 42
return d
def fex(d=freqs, keys=keys):
for key in keys:
try:
d[key] += 42
except KeyError:
d[key] = 42
return d
if __name__ == "__main__":
assert fin(dict(freqs)) == fex(dict(freqs)) == fget(dict(freqs))
Peter This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Irmen de Jong |
last post by:
Hi
I have this dict that maps a name to a sequence of other names.
I want to have it reversed, i.e., map the other names each to
the key they belong to (yes, the other names are unique and
they...
|
by: bearophileHUGS |
last post by:
I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts. Sets also need the
same memory of dicts (can they be made to use less...
|
by: Bo Peng |
last post by:
Dear list,
I have many dictionaries with the same set of keys and I would like to
write a function to calculate something based on these values. For
example, I have
a = {'x':1, 'y':2}
b =...
|
by: Alex |
last post by:
Entering
>>> help(dict)
Help on class dict in module __builtin__:
class dict(object)
| dict() -> new empty dictionary.
| dict(mapping) -> new dictionary initialized from a mapping object's...
|
by: Ian Bicking |
last post by:
I got a puzzler for y'all. I want to allow the editing of functions
in-place. I won't go into the reason (it's for HTConsole --
http://blog.ianbicking.org/introducing-htconsole.html), except that...
|
by: james_027 |
last post by:
hi
for example I have this dictionary
dict = {'name':'james', 'language':'english'}
value = 'sex' in dict and dict or 'unknown'
is a right pythonic of doing this one? I am trying to get a...
|
by: |
last post by:
Hello all,
I have a question which might be simple or need some work around.
I want to do something like this. My class/instance has a dict as a
property. I want the instance to catch the...
|
by: Steven D'Aprano |
last post by:
dict1.update(dict2) is of course equivalent to this code:
for key, value in dict2.iteritems():
dict1 = value
Note that it replaces values in dict1 with the value taken from dict2. I
don't...
|
by: John O'Hagan |
last post by:
Hi Pythonistas,
I'm looking for the best way to pass an arbitrary number and type of variables
created by one function to another. They can't be global because they may
have different values...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |