469,272 Members | 1,403 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,272 developers. It's quick & easy.

how can I sort a bunch of lists over multiple fields?

I didn't think this would be as difficult as it now seems to me.

I am reading in a csv file that documents a bunch of different info on
about 200 books, such as title, author, publisher, isbn, date and
several other bits of info too.

I can do a simple sort over the first field (title as it turns out),
and that is fine as far as it gets:
import string

bookData = open(r'D:\path\to\books.csv', 'r')
sBookData = bookData.read()
lBooks = string.split(sBookData, '\n')

lBooks.sort()
sorted = string.join(lBooks, '\n')
output = open(r'D:\path\to\output.csv', 'w')
output.close()
I really want to be able to sort the list of books based on other
criterium, and even multiple criteria (such as by author, and then by
date.)

I am using python 2.4 and have found this site:

http://aspn.activestate.com/ASPN/Coo.../Recipe/305304

and so I tried doing things like
lBooks.sort(cmp=cmp5) Traceback (most recent call last):
File "<interactive input>", line 1, in ?
NameError: name 'cmp5' is not defined

(I was hoping that cmp5 meant it would use the 5th item in the lists to
sort across)
lBooks.sort(key=lambda i:i[4])

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 1, in <lambda>
IndexError: string index out of range
(I was hoping for similar things)
would you be so kind as to point me in the right direction?

THanks!

googleboy

Jul 19 '05 #1
18 5903
I think if you work backwards, it comes out:

a = ['bob', 'greg', 'cindy', 'alice']
b = ['fred',barney','betty','wilma','pebbles','bambam']
c = ['jed', 'granny', 'jethro', 'ellie-mae']
d = ['bob','carol','ted','alice']

e = [a,b,c,d]

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[1],y[1]))

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[0],y[0]))

for ary in e:
print ary

James

On Wednesday 27 April 2005 10:34 am, so sayeth googleboy:
I didn't think this would be as difficult as it now seems to me.

I am reading in a csv file that documents a bunch of different info on
about 200 books, such as title, author, publisher, isbn, date and
several other bits of info too.

I can do a simple sort over the first field (title as it turns out),
and that is fine as far as it gets:
import string

bookData = open(r'D:\path\to\books.csv', 'r')
sBookData = bookData.read()
lBooks = string.split(sBookData, '\n')

lBooks.sort()
sorted = string.join(lBooks, '\n')
output = open(r'D:\path\to\output.csv', 'w')
output.close()
I really want to be able to sort the list of books based on other
criterium, and even multiple criteria (such as by author, and then by
date.)

I am using python 2.4 and have found this site:

http://aspn.activestate.com/ASPN/Coo.../Recipe/305304

and so I tried doing things like
lBooks.sort(cmp=cmp5)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
NameError: name 'cmp5' is not defined

(I was hoping that cmp5 meant it would use the 5th item in the lists to
sort across)
lBooks.sort(key=lambda i:i[4])


Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 1, in <lambda>
IndexError: string index out of range
(I was hoping for similar things)
would you be so kind as to point me in the right direction?

THanks!

googleboy


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Jul 19 '05 #2
What you want I guess is to read first all lines of the file into a
string as you did, and then let the split method split it based on
newlines only - see example below.

Then you use split again to put all elements of one line into another
list - split it on commas.

Now you can define sortfunctions for all columns you want to sort, e.g.
like below - and use those to compare elements. You get a script like:
-#!/usr/bin/env python
-
-def cmp_index(a, b, ndx):
- if a[ndx] < b[ndx]:
- return -1
- elif a[ndx] > b[ndx]:
- return 1
- else:
- return 0
-
-def cmp_0(a, b):
- return cmp_index(a, b, 0)
-
-def cmp_1(a, b):
- return cmp_index(a, b, 1)
-
-s = 'Kikker en Eend,Max Veldhuis\nDikkie Dik,Jet Boeke\nRuminations on
C++,Andrew Koenig & Barbara Moo'
-s = s.split('\n')
-l = []
-for i in s:
- l.append(i.split(','))
-
-l.sort(cmp_0)
-print l
-l.sort(cmp_1)
-print l

with output like:
martin@ubuntu:~ $ ./test.py
[['Dikkie Dik', 'Jet Boeke'], ['Kikker en Eend', 'Max Veldhuis'],
['Ruminations on C++', 'Andrew Koenig & Barbara Moo']]
[['Ruminations on C++', 'Andrew Koenig & Barbara Moo'], ['Dikkie Dik',
'Jet Boeke'], ['Kikker en Eend', 'Max Veldhuis']]
martin@ubuntu:~ $

Jul 19 '05 #3
Oops, last one had a typo:
a = ['bob', 'greg', 'cindy', 'alice']
b = ['fred','barney','betty','wilma','pebbles','bambam']
c = ['jed', 'granny', 'jethro', 'ellie-mae']
d = ['bob','carol','ted','alice']

e = [a,b,c,d]

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[1],y[1]))

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[0],y[0]))

for ary in e:
print ary

James

On Wednesday 27 April 2005 10:34 am, so sayeth googleboy:
I didn't think this would be as difficult as it now seems to me.

I am reading in a csv file that documents a bunch of different info on
about 200 books, such as title, author, publisher, isbn, date and
several other bits of info too.

I can do a simple sort over the first field (title as it turns out),
and that is fine as far as it gets:
import string

bookData = open(r'D:\path\to\books.csv', 'r')
sBookData = bookData.read()
lBooks = string.split(sBookData, '\n')

lBooks.sort()
sorted = string.join(lBooks, '\n')
output = open(r'D:\path\to\output.csv', 'w')
output.close()
I really want to be able to sort the list of books based on other
criterium, and even multiple criteria (such as by author, and then by
date.)

I am using python 2.4 and have found this site:

http://aspn.activestate.com/ASPN/Coo.../Recipe/305304

and so I tried doing things like
lBooks.sort(cmp=cmp5)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
NameError: name 'cmp5' is not defined

(I was hoping that cmp5 meant it would use the 5th item in the lists to
sort across)
lBooks.sort(key=lambda i:i[4])


Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 1, in <lambda>
IndexError: string index out of range
(I was hoping for similar things)
would you be so kind as to point me in the right direction?

THanks!

googleboy


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Jul 19 '05 #4
James Stroud wrote:
Oops, last one had a typo:
a = ['bob', 'greg', 'cindy', 'alice']
b = ['fred','barney','betty','wilma','pebbles','bambam']
c = ['jed', 'granny', 'jethro', 'ellie-mae']
d = ['bob','carol','ted','alice']

e = [a,b,c,d]

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[1],y[1]))

for ary in e:
print ary

e.sort(lambda x,y:cmp(x[0],y[0]))

for ary in e:
print ary


I would probably use the key= argument instead of the cmp= argument.
Not only is it easier, but it'll probably be faster too:

py> lst = [['bob', 'greg', 'cindy', 'alice'],
.... ['fred', 'barney', 'betty', 'wilma', 'pebbles', 'bambam'],
.... ['jed', 'granny', 'jethro', 'ellie-mae'],
.... ['bob', 'carol', 'ted', 'alice']]
py> import operator
py> lst.sort(key=operator.itemgetter(3))
py> lst
[['bob', 'greg', 'cindy', 'alice'], ['bob', 'carol', 'ted', 'alice'],
['jed', 'granny', 'jethro', 'ellie-mae'], ['fred', 'barney', 'betty',
'wilma', 'pebbles', 'bambam']]
py> lst.sort(key=operator.itemgetter(0))
py> lst
[['bob', 'greg', 'cindy', 'alice'], ['bob', 'carol', 'ted', 'alice'],
['fred', 'barney', 'betty', 'wilma', 'pebbles', 'bambam'], ['jed',
'granny', 'jethro', 'ellie-mae']]
py> lst.sort(key=operator.itemgetter(slice(2)))
py> lst
[['bob', 'carol', 'ted', 'alice'], ['bob', 'greg', 'cindy', 'alice'],
['fred', 'barney', 'betty', 'wilma', 'pebbles', 'bambam'], ['jed',
'granny', 'jethro', 'ellie-mae']]

Note that you can pass slice objects to operator.itemgetter, so the last
example is like "key=lambda x: x[:2]".

STeVe
Jul 19 '05 #5
Might as well make a class for Book instead of dealing with buckets of
lists...

class Book(object):
def __init__(self, title, author, publisher, isbn, date): # add
more fields according to CSV
self.__dict__.update(locals()) # lazy!
def __repr__(self):
return '<"%s" by %s>' % (self.title, self.author)

def read_books(filename):
import csv # (this battery is included)
csv_file = open(filename, "rb")
reader = csv.reader(csv_file)
books = [Book(*[field.strip() for field in row]) for row in reader]
csv_file.close()
return books

if __name__ == '__main__':
books = read_books("my_csv_file.csv")
import operator

# example - Sort by author
books.sort(key = operator.attrgetter("author"))

for book in books:
print book

Jul 19 '05 #6
Lonnie Princehouse wrote:
Might as well make a class for Book instead of dealing with buckets of
lists...

class Book(object):
def __init__(self, title, author, publisher, isbn, date): # add
more fields according to CSV
self.__dict__.update(locals()) # lazy!


Just for any newbies not paying attention, a slightly less lazy approach
that doesn't leave the Book instance referencing itself would do
something like:

params = locals()
del params['self']
self.__dict__.update(params)

or just write it out:

self.title = title
self.author = author
self.publisher = publisher
self.isbn = isbn
self.date = date

=)

STeVe
Jul 19 '05 #7
I'd be just such a newbie; I don't understand why it would matter if I
left the book instance referencing itself....

However these wonderful responses have gotten me a very long way
towards my goal. I just have a couple of quick questions.

firstly, I am trying hard to figure out how to create a new file with
the list rather than print to standard out. I haev done this:

for book in books:
print book # just to be sure it works as I expect
sort1 = open(r'D:\path to\sort1.csv', 'w+')
print >> sort1, book
sort1.close()

and this creates the file as I expect, however it creates it populated
with only the information of the final book in the sorted list. I am
guessing I need to figure out how to append as part of this loop, but
the only info I have found so far suggests this should append by
default....?

Secondly, I am wondering how I can get a search algorithm that will
search by multiple fields here, so that I can (as one example) sort
the books out by author and then date, to present a list of the book
grouped by authors and having each group presented in a chronological
order, or by author and title, grouping all the books up into authors
presenting each group alphabetically by title. Or by publisher and
date, or by publisher and code....

I have tried things like

books.sort(key = operator.attrgetter("author"), key =
operator.attrgetter("title") and
books.sort(key = operator.attrgetter("author", "title")

but they both give errors.

Is this where using cmd functions instead of keys becomes necessary?

Thanks!

googleboy

Jul 19 '05 #8
How about using the csv module instead of splitting ?
wi******@hotmail.com wrote:
What you want I guess is to read first all lines of the file into a
string as you did, and then let the split method split it based on
newlines only - see example below.

Then you use split again to put all elements of one line into another
list - split it on commas.

Now you can define sortfunctions for all columns you want to sort, e.g.
like below - and use those to compare elements. You get a script like:
-#!/usr/bin/env python
-
-def cmp_index(a, b, ndx):
- if a[ndx] < b[ndx]:
- return -1
- elif a[ndx] > b[ndx]:
- return 1
- else:
- return 0
-
-def cmp_0(a, b):
- return cmp_index(a, b, 0)
-
-def cmp_1(a, b):
- return cmp_index(a, b, 1)
-
-s = 'Kikker en Eend,Max Veldhuis\nDikkie Dik,Jet Boeke\nRuminations on
C++,Andrew Koenig & Barbara Moo'
-s = s.split('\n')
-l = []
-for i in s:
- l.append(i.split(','))
-
-l.sort(cmp_0)
-print l
-l.sort(cmp_1)
-print l

with output like:
martin@ubuntu:~ $ ./test.py
[['Dikkie Dik', 'Jet Boeke'], ['Kikker en Eend', 'Max Veldhuis'],
['Ruminations on C++', 'Andrew Koenig & Barbara Moo']]
[['Ruminations on C++', 'Andrew Koenig & Barbara Moo'], ['Dikkie Dik',
'Jet Boeke'], ['Kikker en Eend', 'Max Veldhuis']]
martin@ubuntu:~ $


Jul 19 '05 #9
googleboy wrote:
firstly, I am trying hard to figure out how to create a new file with
the list rather than print to standard out. I haev done this:

for book in books:
print book # just to be sure it works as I expect
sort1 = open(r'D:\path to\sort1.csv', 'w+')
print >> sort1, book
sort1.close()

and this creates the file as I expect, however it creates it populated
with only the information of the final book in the sorted list.
You're reopening the file on each iteration of the loop. I think you
want to open it only once, before the loop, e.g.

sort1_file = open(r'D:\path to\sort1.csv', 'w+')
for book in books:
sort1_file.write('%s\n' % book) # same as "print >> sort1, book"
sort1_file.close()

Note that the opening and closing of the file is outside the loop.
Secondly, I am wondering how I can get a search algorithm that will
search by multiple fields here, so that I can (as one example) sort
the books out by author and then date, to present a list of the book
grouped by authors and having each group presented in a chronological
order, or by author and title, grouping all the books up into authors
presenting each group alphabetically by title. Or by publisher and
date, or by publisher and code....

I have tried things like

books.sort(key = operator.attrgetter("author"), key =
operator.attrgetter("title") and
books.sort(key = operator.attrgetter("author", "title")

but they both give errors.


The problem is that operator.attrgetter only accepts a single attribute.
Basically, attrgetter looks something like:

def attrgetter(attr_name):
def func(obj):
return getattr(obj, attr_name)
return func

So attrgetter can't really solve your problem. However, you can create
a similar function that should do the job. Something like (untested):

def get_key(*attr_names):
def key(book):
return [getattr(book, name) for name in attr_names)]
return key

Then you should be able to do something like:

books.sort(key=get_key("author", "title"))

The trick is that the inner function, 'key', looks up a sequence of
attributes on the book object, instead of just a single attribute like
attrgetter does.

HTH,

STeVe
Jul 19 '05 #10
> I'd be just such a newbie; I don't understand why it would matter if
I
left the book instance referencing itself....
It's just kind of sloppy and unnecessary to have self.self
firstly, I am trying hard to figure out how to create a new file with the list rather than print to standard out. I haev done this:
I think you want 'a' instead of 'w+' as the file's mode. You can also
open and close the file outside of the loop; it would be more
efficient.
Secondly, I am wondering how I can get a search algorithm that will
search by multiple fields here, so that I can (as one example) sort
the books out by author and then date, to present a list of the book
grouped by authors and having each group presented in a chronological
order, or by author and title, grouping all the books up into authors presenting each group alphabetically by title. Or by publisher and
date, or by publisher and code....


So far, we've been using the "key" parameter of list.sort. If you want
sort criteria more complicated than a single attribute, you can sort
based on a custom comparison function. Comparison functions compare
two objects (let's call them A and B), and return one of three possible
values:

A is greater than B => 1
A is less than B => -1
A and B are equal => 0

The built-in function "cmp" can be used to compare objects; behavior is
defined for built-in types:

cmp(0, 1) => -1
cmp("Zylophone", "Abstract") => 1 # alphabetical ordering for strings
cmp( [1,2,3], [1,2,3] ) => 0 # compare sequence elements from left to
right

So another way to do a sort-by-author for your books would be:

def compare_authors(book1, book2):
return cmp(book1.author, book2.author)

books.sort(compare_authors)

A more complicated comparison function might nest two others:

def compare_dates(book1, book2):
# Assuming that your dates are either numerical or are strings
for which
# alphabetical sorting is identical to chronological...
return cmp(book1.date, book2.date)

def compare_author_and_date(book1, book2):
different_authors = compare_authors(book1, book2)
if different_authors: # different authors
return different_authors
else: # same author. sort by date.
return compare_dates(book1, book2)

books.sort(compare_author_and_date)

Jul 19 '05 #11
Lonnie Princehouse wrote:
So far, we've been using the "key" parameter of list.sort. If you want
sort criteria more complicated than a single attribute, you can sort
based on a custom comparison function.
Actually, the key= parameter can do anything the cmp= parameter can:

class Key(object):
def __init__(self, item)
self.item = item
def __cmp__(self, other):
# put your usual cmp code here
cmp(self.item, other)
lst.sort(key=Key)

Of course this is a pretty silly way to write a cmp function. But the
point is that you shouldn't think of the key= parameter as only useful
for simple comparisons. See
http://mail.python.org/pipermail/pyt...il/277448.html
for a recent example of a pretty complex key function.

I would guess that 80-90% of all uses of sort that need a custom
comparison function can be met easily by using the key= parameter and
will be more efficient than using the cmp= parameter.

The above "Key" example has the same inefficiency problems that the cmp=
parameter normally does, but in most cases, you won't need to define a
custom __cmp__ function, and you can rely on __cmp__ functions
implemented in C, like those of strs and tuples (as I do below).
So another way to do a sort-by-author for your books would be:

def compare_authors(book1, book2):
return cmp(book1.author, book2.author)

books.sort(compare_authors)
This is definitely not a case where you want to use a comparison
function. It will be much more efficient to write:

def author_key(book):
return book.author
books.sort(key=author_key)
A more complicated comparison function might nest two others:

def compare_dates(book1, book2):
# Assuming that your dates are either numerical or are strings
for which
# alphabetical sorting is identical to chronological...
return cmp(book1.date, book2.date)

def compare_author_and_date(book1, book2):
different_authors = compare_authors(book1, book2)
if different_authors: # different authors
return different_authors
else: # same author. sort by date.
return compare_dates(book1, book2)

books.sort(compare_author_and_date)


Likewise, the above is basically just an inefficient way of writing:

def date_key(book):
return book.data

def author_and_date_key(book):
return (author_key(book), date_key(book))

books.sort(key=author_and_date_key)

Note that the thing I take advantage of here is that tuples are
comparable, and compare as you'd expect them to (in lexicographic order).

STeVe
Jul 19 '05 #12
> Likewise, the above is basically just an inefficient way of writing:

def date_key(book):
return book.data

def author_and_date_key(book):
return (author_key(book), date_key(book))


It's certainly more elegant, but I wanted to talk about the mechanics
of comparison functions =)

I don't know that it's more or less efficient in execution. That
depends on a few factors.

Jul 19 '05 #13
Lonnie Princehouse wrote:
Likewise, the above is basically just an inefficient way of writing:

def date_key(book):
return book.data

def author_and_date_key(book):
return (author_key(book), date_key(book))

It's certainly more elegant, but I wanted to talk about the mechanics
of comparison functions =)

I don't know that it's more or less efficient in execution. That
depends on a few factors.


For most cases, key= will be more efficient than cmp=. The key=
argument will only be called once for each item in the list. The cmp=
argument will be called once for each comparison:

py> class Cmp(object):
.... def __init__(self):
.... self.count = 0
.... def __call__(self, x, y):
.... self.count += 1
.... return cmp(x, y)
....
py> class Key(object):
.... def __init__(self):
.... self.count = 0
.... def __call__(self, x):
.... self.count += 1
.... return x
....
py> import random
py> lst = range(1000)
py> random.shuffle(lst)
py> lst2 = list(lst)
py> c = Cmp()
py> lst.sort(cmp=c)
py> c.count
8599
py> k = Key()
py> lst.sort(key=k)
py> k.count
1000

Since in all but a few corner cases (e.g. where the list is already
sorted) there will be way more comparisons than items, the key= argument
will minimize the number of function calls. Since function calls are
one of the more expensive operations in Python, my guess is that there
are very few cases where you would want to use the cmp= argument if you
know you can use the key= argument.

STeVe
Jul 19 '05 #14
key and cmp are equivalent for non-trivial cases (defined below), and
for trivial cases the performance will be a trade-off between more
function calls with cmp versus more memory use with key. Granted, for
the smallish lists that the majority of Python sorts are done on, key
is probably the better choice.

Non-trivial:
Any time the user decides to (1) cache intermediary results for a cmp
function or (2) define a __cmp__ function for the key return value.
These arise when comparison of partial keys is often sufficient to
establish order between two objects.

Jul 19 '05 #15
googleboy wrote:

I am reading in a csv file that documents a bunch of different info on about 200 books, such as title, author, publisher, isbn, date and
several other bits of info too.
...
I really want to be able to sort the list of books based on other
criterium, and even multiple criteria (such as by author, and then by
date.)


import string

input = open(r'c:\books.csv', 'r')
records = input.readlines()
input.close()

# assuming first line contains headers
headers = records.pop(0)
records = [x.strip().split(',') for x in records]

# header order
p_headers ='(title, author, publisher, isbn, date, other)'
p_sorts = '(author, title, date, publisher, other, isbn)'

temp_records = []
for r in records:
exec '%(p_headers)s = r' % vars()
exec 't = %(p_sorts)s' % vars()
temp_records.append(t)

temp_records.sort()

sorted_records = []
for t in temp_records:
exec '%(p_sorts)s = t' % vars()
exec 'r = %(p_headers)s' % vars()
sorted_records.append(r)

lines = [headers] + [','.join(x)+'\n' for x in sorted_records]
output = open(r'c:\output.csv', 'w')
output.writelines(lines)
output.close()

Jul 19 '05 #16
Lonnie Princehouse wrote:
Non-trivial:
Any time the user decides to (1) cache intermediary results for a cmp
function or (2) define a __cmp__ function for the key return value.
These arise when comparison of partial keys is often sufficient to
establish order between two objects.


If you want to compare partial keys, often the simplest approach is to
use tuples as keys, with the elements of the tuple being the "parts" of
the key. Python's standard tuple comparison mechanism takes care of the
rest.

STeVe
Jul 19 '05 #17
> If you want to compare partial keys, often the simplest approach is
to
use tuples as keys, with the elements of the tuple being the "parts" of the key. Python's standard tuple comparison mechanism takes care of the rest.


Right, but suppose it's expensive to generate each part of the key -
e.g. a database query or some kind of hash, so it doesn't make sense to
compute the whole key in advance if the first part is sufficient to
establish an ordering most of the time. This is what I had in mind (I
should have been more clear about that)

Sure, you could put together a lazy evaluation object to use for
elements of the tuple, but this again becomes equivalent to writing a
__cmp__ function...

Jul 19 '05 #18
Lonnie Princehouse wrote:
Right, but suppose it's expensive to generate each part of the key -
e.g. a database query or some kind of hash, so it doesn't make sense to
compute the whole key in advance if the first part is sufficient to
establish an ordering most of the time. This is what I had in mind (I
should have been more clear about that)


Yeah, that's probably a reasonable use case for the cmp= argument, as is
any case where you're *really* tight for memory (the key= sort requires
small extra auxiliary structures to hold the key values).

For newbies though, I still suspect that key= is the easiest and fastest
answer to most sorting problems. Advanced users will, of course, need
more advanced techniques. =)

STeVe
Jul 19 '05 #19

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Dave H | last post: by
12 posts views Thread by Eva | last post: by
1 post views Thread by neelu | last post: by
8 posts views Thread by markww | last post: by
3 posts views Thread by ian.brady1 | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.