Pythonic search of list of dictionaries

Bulba!

Hello everyone,

I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

...and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n['English'] == o['English']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.

I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)
try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')

cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)
# duplicates

dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpo l')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all[i]['English']==all[j]['English']) and
all[i]['RowChanged']=='CHANGED':
dupes.writerow(all[i])
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()

--

Real world is perfectly indifferent to lies that
are the foundation of leftist "thinking".

Jul 18 '05 #1

Subscribe Post Reply

5812

Tim Hochberg

Bulba! wrote:

Hello everyone,

I'm reading the rows from a CSV file. csv.DictReader puts
those rows into dictionaries.

The actual files contain old and new translations of software
strings. The dictionary containing the row data looks like this:

o={'TermID':'4', 'English':'System Administration',
'Polish':'Zarzadzanie systemem'}

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

..and then search for matching source terms in two loops:

for o in oldl:
for n in newl:
if n['English'] == o['English']:
...

Now, this works. However, not only this is very un-Pythonic, but also
very inefficient: the complexity is O(n**2), so it scales up very
badly.

What I want to know is if there is some elegant and efficient
way of doing this, i.e. finding all the dictionaries dx_1 ... dx_n,
contained in a list (or a dictionary) dy, where dx_i contains
a specific value. Or possibly just the first dx_1 dictionary.
Sure, just do a little preprocessing. Something like (untested):

####

def make_map(l):
# This assumes that each English key is unique in a given l
# if it's not you'll have to use a list of o instead of o itself.
map = {}
for d in l:
if 'English' in d:
key = d['English']
map[key] = d

old_map = make_map(oldl)
new_map = make_map(newl)

for engphrase in old_map:
if engphrase in new_map:
o = old_map[engphrase]
n = new_map[engphrase]
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
# process....

####

I've assumed that the English key is unique in both the old and new
lists. If it's not this will need some adjustment. However, your
original algorithm is going to behave weirdly in that case anyway
(spitting out multiple lines with the same id, but potentially different
new terms and update status).

Hope that's useful.

-tim

I HAVE to search for values corresponding to key 'English', since
there are big gaps in both files (i.e. there's a lot of rows
in the old file that do not correspond to the rows in the new
file and vice versa). I don't want to do ugly things like converting
dictionary to a string so I could use string.find() method.

Obviously it does not have to be implemented this way. If
data structures here could be designed in a proper
(Pythonesque ;-) way, great.

I do realize that this resembles doing some operation on
matrixes. But I have never tried doing smth like this in
Python.
#---------- Code follows ---------

import sys
import csv

class excelpoldialect(csv.Dialect):
delimiter=';'
doublequote=True
lineterminator='\r\n'
quotechar='"'
quoting=0
skipinitialspace=False

epdialect=excelpoldialect()
csv.register_dialect('excelpol',epdialect)
try:
ofile=open(sys.argv[1],'rb')
except IOError:
print "Old file %s could not be opened" % (sys.argv[1])
sys.exit(1)

try:
tfile=open(sys.argv[2],'rb')
except IOError:
print "New file %s could not be opened" % (sys.argv[2])
sys.exit(1)
titles=csv.reader(ofile, dialect='excelpol').next()
orig=csv.DictReader(ofile, titles, dialect='excelpol')
transl=csv.DictReader(tfile, titles, dialect='excelpol')

cfile=open('cmpfile.csv','wb')
titles.append('New')
titles.append('RowChanged')
cm=csv.DictWriter(cfile,titles, dialect='excelpol')
cm.writerow(dict(zip(titles,titles)))
print titles
print "-------------"

oldl=[x for x in orig]
newl=[x for x in transl]

all=[]

for o in oldl:
for n in newl:
if n['English'] == o['English']:
if n['Polish'] == o['Polish']:
status=''
else:
status='CHANGED'
combined={'TermID': o['TermID'], 'English': o['English'],
'Polish': o['Polish'], 'New': n['Polish'], 'RowChanged': status}
cm.writerow(combined)
all.append(combined)
# duplicates

dfile=open('dupes.csv','wb')
dupes=csv.DictWriter(dfile,titles,dialect='excelpo l')
dupes.writerow(dict(zip(titles,titles)))

"""for i in xrange(0,len(all)-2):
for j in xrange(i+1, len(all)-1):
if (all[i]['English']==all[j]['English']) and
all[i]['RowChanged']=='CHANGED':
dupes.writerow(all[i])
dupes.writerow(all[j])"""

cfile.close()
ofile.close()
tfile.close()
dfile.close()

--

Real world is perfectly indifferent to lies that
are the foundation of leftist "thinking".

Jul 18 '05 #2

Skip Montanaro

Bulba> I put those dictionaries into the list:

Bulba> oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

Bulba> ..and then search for matching source terms in two loops:

Bulba> for o in oldl:
Bulba> for n in newl:
Bulba> if n['English'] == o['English']:
Bulba> ...

Bulba> Now, this works. However, not only this is very un-Pythonic, but
Bulba> also very inefficient: the complexity is O(n**2), so it scales up
Bulba> very badly.

How about using sets?

oenglish = set([item['English'] for item in oldl])
nenglish = set([item['English'] for item in newl])

matching = oenglish & nenglish

Once you have those that match, you can constrain your outer loop to just
those cases where

o['English'] in matching

If you're not using 2.4 yet, then get sets via:

from sets import Set as set

That's still not all that Pythonic, but should be a bit faster.

You might want to sort your lists by the 'English' key. I don't know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a['English'], b['English']))
newl.sort(lambda a,b: cmp(a['English'], b['English']))

Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.

Skip

Jul 18 '05 #3

Scott David Daniels

Skip Montanaro wrote:

...lotsa great stuff ...
You might want to sort your lists by the 'English' key. I don't know how to
use the new key arg to list.sort(), but you can still do it the
old-fashioned way:

oldl.sort(lambda a,b: cmp(a['English'], b['English']))
newl.sort(lambda a,b: cmp(a['English'], b['English']))
To complete the thought, for 2.4 and after the new-fashioned way is:

import operator

oldl.sort(key=operator.itemgetter('English'))
newl.sort(key=operator.itemgetter('English'))
Once sorted, you can then march through the lists in parallel, which should
give you an O(n) algorithm.

But overall you will have O(n log n) because of the sorts.

--Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #4

Marc 'BlackJack' Rintsch

In <lq********************************@4ax.com>, Bulba! wrote:

I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...

If you don't "do" anything with each `x` you can write this as:

oldl = list(orig)

Ciao,
Marc 'BlackJack' Rintsch

Jul 18 '05 #5

John Machin

Bulba! wrote:
[big snip]

Forget the csv-induced dicts for the moment, they're just an artifact
of your first solution attempt. Whether english = csv_row[1], or
english = csv_row_dict["english"], doesn't matter yet. Let's take a few
steps back, and look at what you are trying to do through a telescope
instead of a microscope.

Core basic root problem: you have a table with 3 columns, id, english,
polish. Nothing is said about id so let's assume nothing. Evidently
contents of "english" is the "key" for this exercise, but it's not
necessarily unique. You have two instances of the table, and you want
to "diff" them using "english" as the key.

You want to collect together all rows that have the same value of
"english", and then process them somehow. You need to define an object
containing a list of all "left" rows and a list of all "right" rows.

Processing the contents of the object: do whatever you have to with
obj.left_list and obj.right_list depending on the lengths; you have 3
cases of length to consider (0, 1, many). 3 x 3 = 9 but if you get both
zero you have a bug (which you should of course assert does not exist),
so that leaves 8 cases to think about.

Now, how do we get the rows together:

(a) The sort method

Step 1: sort each dataset on (english, id, polish) or (english, polish,
id) -- your choice; sorting on the whole record instead just english
makes the ordering predictable and repeatable.

Step 2: read the two sorted datasets ("left" and "right") in parallel:

when left key < right key: do_stuff(); read another left record
when left key > right key: converse
when left ley == right key: do_stuff(); read another record for both
where do_stuff() includes appending to left_list and right_list as
appropriate and at the right moment, process a completed object.
This is a little tricky, and handling end of file needs a little care.
However this algorithm can be implemented with minimal memory ("core"),
no disk drive at all :-O and a minimum of 3 (preferably 4+)
serially-readable rewindable re-writable storage devices e.g magnetic
tape drives. Once upon a time, it was *the* way of maintaining a
database (left = old database, right = transactions, output file -> new
database).

(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj)

As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Hoping this sketch of the view through the telescope (and the
rear-vision mirror!) is helpful,
John

Jul 18 '05 #6

Bulba!

On Tue, 04 Jan 2005 21:57:46 +0100, Marc 'BlackJack' Rintsch
<bj****@gmx.net> wrote:

In <lq********************************@4ax.com>, Bulba! wrote:
I put those dictionaries into the list:

oldl=[x for x in orig] # where orig=csv.DictReader(ofile ...
If you don't "do" anything with each `x` you can write this as:

oldl = list(orig)

Thanks!

I need to wrap my mind around Python more tightly. :-)

--
It's a man's life in a Python Programming Association.

Jul 18 '05 #7

Bulba!

On 4 Jan 2005 14:33:34 -0800, "John Machin" <sj******@lexicon.net>
wrote:

(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,
then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj) As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too much.
You are however correct to avoid O(N**2) solutions.

Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

...but still don't understand why the difference is so big.

Both versions use local variables, etc. Both have their
lists initially sorted. Both essentially use a loop with
conditional for comparison, then process the record in the
same way. The overhead of second version is that it also
uses cmp() function and two additional integer
variables - that should not slow the program _so much_.

I have measured the run of snippet 2 with time checkpoints
written to a log (write time delta to log every 200 records),
even made a graph of time deltas in spreadsheet and in fact
snippet 2 seems after initial slowdown looks exactly linear,
like that:

^ (time)
|
| /-----------
| /
|/
---------------> (# of records written)

So yes, it would scale to big files. However, why is it so
frigging slow?!

# snippet 1, this runs in about 6 seconds
!def prepend(l):
! map = {}
! for d in l:
! key = d['English']
! map[key] = d
! return map
!
!old_map = prepend(oldl)
!new_map = prepend(newl)
!
!for engphrase in old_map:
! if engphrase in new_map:
! o = old_map[engphrase]
! n = new_map[engphrase]
! cm.writerow(matchpol(o,n))
# snippet 2, this needs 190 seconds
!while 1:
! if len(oldl) == 0 or len(newl) == 0:
! break
! if oldl[o]['English'] == newl[n]['English']:
! cm.writerow(matchpol(oldl[o], newl[n]))
! del oldl[o]
! del newl[n]
! o, n = 0, 0
! continue
! elif cmp(oldl[o]['English'], newl[n]['English']) < 0:
! if o == len(oldl):
! cm.writerow(newl[0])
! del(newl[0])
! o, n = 0, 0
! continue
! o+=1
! elif cmp(oldl[o]['English'], newl[n]['English']) > 0:
! if n == len(newl):
! cm.writerow(newl[0])
! del(oldl[0])
! o, n = 0, 0
! continue
! n+=1

--
It's a man's life in a Python Programming Association.

Jul 18 '05 #8

Steven Bethard

Bulba! wrote:

Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

..but still don't understand why the difference is so big.
[snip]
# snippet 1, this runs in about 6 seconds
!def prepend(l):
! map = {}
! for d in l:
! key = d['English']
! map[key] = d
! return map
!
!old_map = prepend(oldl)
!new_map = prepend(newl)
!
!for engphrase in old_map:
! if engphrase in new_map:
! o = old_map[engphrase]
! n = new_map[engphrase]
! cm.writerow(matchpol(o,n))
# snippet 2, this needs 190 seconds
!while 1:
! if len(oldl) == 0 or len(newl) == 0:
! break
! if oldl[o]['English'] == newl[n]['English']:
! cm.writerow(matchpol(oldl[o], newl[n]))
! del oldl[o]
! del newl[n]
! o, n = 0, 0
! continue
! elif cmp(oldl[o]['English'], newl[n]['English']) < 0:
! if o == len(oldl):
! cm.writerow(newl[0])
! del(newl[0])
! o, n = 0, 0
! continue
! o+=1
! elif cmp(oldl[o]['English'], newl[n]['English']) > 0:
! if n == len(newl):
! cm.writerow(newl[0])
! del(oldl[0])
! o, n = 0, 0
! continue
! n+=1

I believe you're running into the fact that deleting from anywhere but
the end of a list in Python is O(n), where n is the number of items in
the list. Consider:

---------- test.py ----------
def delfromstart(lst):
while lst:
del lst[0]

def delfromend(lst):
for i in range(len(lst)-1, -1, -1):
del lst[i]
-----------------------------

[D:\Steve]$ python -m timeit -s "import test"
"test.delfromstart(range(1000))"
1000 loops, best of 3: 1.09 msec per loop

[D:\Steve]$ python -m timeit -s "import test" "test.delfromend(range(1000))"
1000 loops, best of 3: 301 usec per loop

Note that Python lists are implemented basically as arrays, which means
that deleting an item from anywhere but the end of the list is O(n)
because all items in the list must be moved down to fill the hole.

Repeated deletes from a list are generally not the way to go, as your
example shows. =)

Steve

Jul 18 '05 #9

John Machin

Bulba! wrote:

On 4 Jan 2005 14:33:34 -0800, "John Machin" <sj******@lexicon.net>
wrote:
(b) Fast forwarding 30+ years, let's look at the dictionary method,
assuming you have enough memory to hold all your data at once:

Step 1: read the "left" table; for each row, if english not in mydict,then do mydict[english] = MyObject(). In any case, do
mydict[english].left_list.append(row)
Step 2: same for the "right" table.
Step 3: for english, obj in mydict.iteritems(): process(english, obj)
As your datasets are stored in MS Excel spreadsheets, N < 64K so
whether your solution is O(N) or O(N*log(N)) doesn't matter too
much.You are however correct to avoid O(N**2) solutions.

Following advice of two posters here (thanks) I have written two
versions of the same program, and both of them work, but the
difference in speed is drastic, about 6 seconds vs 190 seconds
for about 15000 of processed records, taken from 2 lists of
dictionaries.

I've read "Python Performance Tips" at

http://manatee.mojam.com/~skip/python/fastpython.html

..but still don't understand why the difference is so big.

Both versions use local variables, etc. Both have their
lists initially sorted. Both essentially use a loop with
conditional for comparison,
then process the record in the
same way.

"process the record in the same way"??? That's an interesting use of
"same".
The overhead of second version is that it also
uses cmp() function and two additional integer
variables - that should not slow the program _so much_.

I have measured the run of snippet 2 with time checkpoints
written to a log (write time delta to log every 200 records),
even made a graph of time deltas in spreadsheet and in fact
snippet 2 seems after initial slowdown looks exactly linear,
like that:

^ (time)
|
| /-----------
| /
|/
---------------> (# of records written)

So yes, it would scale to big files.
On your empirical evidence, as presented. However, do read on ...
However, why is it so
frigging slow?!

Mainly, because you are (unnecessarily) deleting the first item of a
list. This requires copying the remaining items. It is O(N), not O(1).
You are doing this O(N) times, so the overall result is O(N**2). Your
graph has no obvious explanation; after how many cycles does the speed
become constant?

Secondly, you are calling cmp() up to THREE times when once is enough.
Didn't it occur to you that your last elif needed an else to finish it
off, and the only possible action for the else suite was "assert
False"?

It would appear after reading your "snippet 2" a couple of times that
you are trying to implement the old 3-tape update method.

It would also appear that you are assuming/hoping that there are never
more than one instance of a phrase in either list.

You need something a little more like the following.

Note that in your snippet2 it was not very obvious what you want to do
in the case where a phrase is in "new" but not in "old", and vice versa
-- under one circumstance (you haven't met "end of file") you do
nothing but in the the other circumstance you do something but seem to
have not only a polarity problem but also a copy-paste-edit problem. In
the following code I have abstracted the real requirements as
handle_XXX_unmatched()

!o = n = 0
!lenold = len(oldl)
!lennew = len(newl)
!while o < lenold and n < lennew:
! cmp_result = cmp(oldl[o]['English'], newl[n]['English'])
! if cmp_result == 0:
! # Eng phrase is in both "new" and "old"
! cm.writerow(matchpol(oldl[o], newl[n]))
! o += 1
! n += 1
! elif cmp_result < 0:
! # Eng phrase is in "old", not in "new"
! handle_old_unmatched(o)
! o += 1
! else:
! assert cmp_result > 0 # :-)
! # Eng phrase is in "new", not in "old"
! handle_new_unmatched(n)
! n += 1
!while o < lenold:
! # EOF on new, some old remain
! handle_old_unmatched(o)
! o += 1
!while n < lennew:
! # EOF on old, some new remain
! handle_new_unmatched(n)
! n += 1

Some general hints: Try stating your requirements clearly, to yourself
first and then to us. Try to ensure that your code is meeting those
requirements before you bother timing it. Try not to use single-letter
names -- in particularly using l (that's "L".lower(), not 1 i.e.
str(4-3)) is barf-inducing and makes people likely not to want to read
your code.

HTH,
John

Jul 18 '05 #10

Andrea Griffini

On Mon, 10 Jan 2005 17:52:42 +0100, Bulba! <bu***@bulba.com> wrote:

I don't see why should deleting element from a list be O(n), while
saying L[0]='spam' when L[0] previously were, say, 's', not have the
O(n) cost, if a list in Python is just an array containing the
objects itself?

Why should JUST deletion have an O(n) cost?

Because after deletion L[1] moved to L[0], L[2] moved to L[1],
L[3] moved to L[2] and so on. To delete the first element you
have to move n-1 pointers and this is where O(n) comes from.
When you reassign any element there is no need to move the
others around, so that's why you have O(1) complexity.

With a data structure slightly more complex than an array
you can have random access in O(1), deletion of elements
O(1) at *both ends* and insertion in amortized O(1) at
*both ends*. This data structure is called doubly-ended
queque (nickname "deque") and is available in python.

The decision was that for the basic list object the overhead
added by deques for element access (it's still O(1), but a
bit more complex that just bare pointer arithmetic) and, I
guess, the hassle of changing a lot of working code and
breaking compatibility with extensions manipulating directly
lists (no idea if such a thing exists) was not worth the gain.

The gain would have been that who doesn't know what O(n)
means and that uses lists for long FIFOs would get fast
programs anyway without understanding why. With current
solution they just have to use deques instead of lists.

After thinking to it for a while I agree that this is a
reasonable choice. The gain is anyway IMO very little
because if a programmer desn't understand what O(n) is
then the probability that any reasonably complex program
written is going to be fast is anyway zero... time would
just be wasted somewhere else for no reason.

Andrea

Jul 18 '05 #11

Pythonic search of list of dictionaries

Similar topics