Perl and Python, a practical side-by-side example.

Shawn Milo

I'm new to Python and fairly experienced in Perl, although that
experience is limited to the things I use daily.

I wrote the same script in both Perl and Python, and the output is
identical. The run speed is similar (very fast) and the line count is
similar.

Now that they're both working, I was looking at the code and wondering
what Perl-specific and Python-specific improvements to the code would
look like, as judged by others more knowledgeable in the individual
languages.

I am not looking for the smallest number of lines, or anything else
that would make the code more difficult to read in six months. Just
any instances where I'm doing something inefficiently or in a "bad"
way.

I'm attaching both the Perl and Python versions, and I'm open to
comments on either. The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

Thanks for taking the time to look at these.

Shawn

################################################## ########################
Perl code:
################################################## ########################
#! /usr/bin/env perl

use warnings;
use strict;

my $piid;
my $row;
my %input;
my $best;
my $curr;

foreach $row (<>){

chomp($row);
$piid = (split(/\t/, $row))[0];

push ( @{$input{$piid}}, $row );
}

for $piid (keys(%input)){

$best = "";

for $curr (@{$input{$piid}}){
if ($best eq ""){
$best = $curr;
}else{
#If the current record is the correct state

if ((split(/\t/, $curr))[1] eq (split(/\t/, $curr))[6]){
#If existing record is the correct state
if ((split(/\t/, $best))[1] eq (split(/\t/, $curr))[6]){
if ((split(/\t/, $curr))[5] gt (split(/\t/, $best))[5]){
$best = $curr;
}
}else{
$best = $curr;
}
}else{
#if the existing record does not have the correct state
#and the new one has a newer expiration date
if (((split(/\t/, $best))[1] ne (split(/\t/, $curr))[6]) and
((split(/\t/, $curr))[5] gt (split(/\t/, $best))[5])){
$best = $curr;
}
}
}
}
print "$best\n";
}

################################################## ########################
End Perl code
################################################## ########################

################################################## ########################
Python code
################################################## ########################

#! /usr/bin/env python

import sys

input = sys.stdin

recs = {}

for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
if recs.has_key(piid) is False:
recs[piid] = []
recs[piid].append(row)

for piid in recs.keys():
best = ""
for current in recs[piid]:
if best == "":
best = current;
else:
#If the current record is the correct state
if current.split("\t")[1] == current.split("\t")[6]:
#If the existing record is the correct state
if best.split("\t")[1] == best.split("\t")[6]:
#If the new record has a newer exp. date
if current.split("\t")[5] best.split("\t")[5]:
best = current
else:
best = current
else:
#If the existing record does not have the correct state
#and the new record has a newer exp. date
if best.split("\t")[1] != best.split("\t")[6] and
current.split("\t")[5] best.split("\t")[5]:
best = current

print best
################################################## ########################
End Python code
################################################## ########################

Mar 2 '07 #1

Subscribe Post Reply

2183

bearophileHUGS

Few suggestions, some important, some less important. All my
suggestions are untested.
Use 4 spaces to indent.
If you want to speed up this code you can move it inside a function.
After that, if you want to make it even faster you can use Psyco too.
Ho are the dates represented? How do you test what is the older one?
You seem to compute current.split("\t") and best.split("\t") many
times, so you can compute it once only.
You can keep best and best_splitted.
You can split the last line:
if best.split("\t")[1] != best.split("\t")[6] and \
current.split("\t")[5] best.split("\t")[5]:
input() is a builtin function, so you may want to use a different
name, or just:
for row in sys.stdin:
Instead of:
row = row.rstrip('\n')
You often may use just:
row = row.strip()
Instead of:
piid = row.split('\t')[0]
You can probably use (test it):
piid = row.split('\t', 1)[0]
Instead of:
if recs.has_key(piid) is False:
Better:
if piid not in recs:
Instead of (on Python 2.5):
recs = {}
for row in input:
row = ...
piid = ...
if recs.has_key(piid) is False:
recs[piid] = []
recs[piid].append(row)
You can probably use:
from collection import defaultdict
recs = defaultdict(list)
for row in input:
row = ...
piid = ...
recs[piid].append(row)
Instead of:
for piid in recs.keys():
You can use this, lazily:
for piid in recs:
Instead of:
for piid in recs.keys():
best = ""
for current in recs[piid]:
You can probably use:
for piid, piii_recs in recs.iteritems():
best = ""
for current in piii_recs:
But your version may be a little faster anyway.
Instead of:
best = ""
for current in recs[piid]:
if best == "":
best = current;
You may want to use the singleton None:
best = None
for current in recs[piid]:
if best is None:
best = current
Note that to read such files you may use the csv module (present in
the standard library).

You can try to modify the code and show us the second working version.

Bye,
bearophile

Mar 3 '07 #2

John Machin

On Mar 3, 9:44 am, "Shawn Milo" <S...@Milochik.comwrote:

I'm new to Python and fairly experienced in Perl, although that
experience is limited to the things I use daily.

I wrote the same script in both Perl and Python, and the output is
identical. The run speed is similar (very fast) and the line count is
similar.

Now that they're both working, I was looking at the code and wondering
what Perl-specific and Python-specific improvements to the code would
look like, as judged by others more knowledgeable in the individual
languages.

I am not looking for the smallest number of lines, or anything else
that would make the code more difficult to read in six months. Just
any instances where I'm doing something inefficiently or in a "bad"
way.

I'm attaching both the Perl and Python versions, and I'm open to
comments on either. The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

[big snip]
Here is my rewrite in what I regard as idiomatic reasonably-modern
Python (OMMV of course). A few of the comments are applicable
irrespective of the language used.

HTH,
John
8<------
#! /usr/bin/env python

### Layout: Use 4-space indent. Don't use tabs. Don't exceed 79 chars
per line.

import sys

def process_file(opened_file=sys.stdin):
### Local variable access is faster than global

### input = sys.stdin ### shadowing built_in function "input"

### Use names to access elements in rows
PIID = 0
ACTUAL_STATE = 1
DESIRED_STATE = 6
EXPIRY_DATE = 5

recs = {}

for row in opened_file:
row = row.rstrip('\n').split('\t')
### Do the split('\t') *once* per row
piid = row[PIID]
### if recs.has_key(piid) is False:
### has_key() is ancient
### "if boolean_expression is False" is megabletchworthy;
### use "if not boolean_expression"
if piid not in recs:
recs[piid] = []
recs[piid].append(row)

### for piid in recs.keys():
for piid in recs:
best = None ### use out-of-band sentinel
for current in recs[piid]:
if best is None:
best = current ### had cockroach crap (";") at EOL
else:
#If the current record is the correct state
### Clear code (like the next line) doesn't need
comments
### like the above line
if current[ACTUAL_STATE] == current[DESIRED_STATE]:
if best[ACTUAL_STATE] == best[DESIRED_STATE]:
if current[EXPIRY_DATE] best[EXPIRY_DATE]:
best = current
else:
best = current
else:
if (best[ACTUAL_STATE] != best[ACTUAL_STATE]
and current[EXPIRY_DATE] best[EXPIRY_DATE]):
best = current
print "\t".join(best)

if __name__ == "__main__":
### Standard idiom to avoid executing script content
### when/if file is imported
process_file()
8<----

Mar 3 '07 #3

Bruno Desthuilliers

Shawn Milo a écrit :

I'm new to Python and fairly experienced in Perl, although that
experience is limited to the things I use daily.

I wrote the same script in both Perl and Python, and the output is
identical. The run speed is similar (very fast) and the line count is
similar.

Now that they're both working, I was looking at the code and wondering
what Perl-specific and Python-specific improvements to the code would
look like, as judged by others more knowledgeable in the individual
languages.

I am not looking for the smallest number of lines, or anything else
that would make the code more difficult to read in six months. Just
any instances where I'm doing something inefficiently or in a "bad"
way.

(snip)

>
#! /usr/bin/env python

import sys

input = sys.stdin

recs = {}

for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
if recs.has_key(piid) is False:

'is' is the identity operator - practically, in CPython, it compares
memory addresses. You *dont* want to use it here.

Another way to test if a key is already in a dict is to use the 'in'
operator, ie:
if not piid in recs:

recs[piid] = []
recs[piid].append(row)

As a last point, dicts have a setdefault(key, default) method that
returns the value associated with key if it already exists, else add the
key=>default pair and returns default, but it's a little bit slower.
for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
recs.setdefault(piid, []).append(row)

or

for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
if piid not in recs:
recs[piid] = []
recs[piid].append(row)

for piid in recs.keys():

You don't need to call key() here - it's already the default iteration
for dicts.

But since you want the values too, you should use dict.items() or
dict.iteritems():

for piid, rows in recs.iteritems()

best = ""
for current in recs[piid]:
if best == "":
best = current;

Get rid of this ";" !-)

Better to use None - it's the "no value" object, and it let you use an
identity test:

best = None
for current in rows:
if best is None:
best = row

And while we're at it, you can save yourself a test here:

best = row[0]
for current in row[1:]:
# start tests

#If the current record is the correct state
if current.split("\t")[1] == current.split("\t")[6]:
#If the existing record is the correct state
if best.split("\t")[1] == best.split("\t")[6]:
#If the new record has a newer exp. date
if current.split("\t")[5] best.split("\t")[5]:

You're repeatingly calling split() on the same objects. This is a
serious waste of time and CPU.

best = current
else:
best = current
else:
#If the existing record does not have the correct state
#and the new record has a newer exp. date
if best.split("\t")[1] != best.split("\t")[6] and
current.split("\t")[5] best.split("\t")[5]:
best = current

print best

Here's a somewhat corrected version:
# ---
import sys

def findbests(input=sys.stdin, output=sys.stdout):
DATE = 5
TARGET = 6
STATE = 1
recs = {}

for row in input:
row = row.rstrip('\n').split("\t")
piid = row[0]
if piid not in recs:
recs[piid] = []
recs[piid].append(row)

for piid, rows in recs.iteritems():
best = rows[0]
for current in rows[1:]:
if current[STATE] == current[TARGET]:
if best[STATE] == best[TARGET]:
if current[DATE] best[DATE]:
best = current
else:
best = current
elif best[STATE] != best[TARGET] \
and current[DATE] best[DATE]:
best = current

print >output, "\t".join(best)

if __name__ == '__main__':
findbdest()

# ---

It's somewhat shorter, a bit more readable IMHO, and somewhat faster too
(=~ 30/40% faster on my machine). Also, it's usable as both a program
and an importable module, and with any line input and file-like output.

Now for the bad news: I'm afraid your algorithm is broken : here are my
test data and results:

input = [
#ID STATE ... ... ... TARG DATE
"aaa\tAAA\t...\t...\t...\tBBB\t20071212\n",
"aaa\tAAA\t...\t...\t...\tAAA\t20070120\n",
"aaa\tAAA\t...\t...\t...\tAAA\t20070101\n",
"aaa\tAAA\t...\t...\t...\tBBB\t20071010\n",
"aaa\tAAA\t...\t...\t...\tBBB\t20071111\n",
"ccc\tAAA\t...\t...\t...\tBBB\t20071201\n",
"ccc\tAAA\t...\t...\t...\tAAA\t20070101\n",
"ccc\tAAA\t...\t...\t...\tBBB\t20071212\n",
"ccc\tAAA\t...\t...\t...\tAAA\t20071212\n", # oops !
"bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20071212\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20070612\n",
"bbb\tAAA\t...\t...\t...\tBBB\t20071212\n",
]

=>
aaa AAA ... ... ... BBB 20071212
bbb AAA ... ... ... BBB 20071212
ccc AAA ... ... ... BBB 20071201

At least for piid 'ccc', there's a record with matching state and newest
date. (exact same result as with your original version).

HTH

Mar 3 '07 #4

Bruno Desthuilliers

John Machin a écrit :

On Mar 3, 9:44 am, "Shawn Milo" <S...@Milochik.comwrote:

(snip)

>
[big snip]
Here is my rewrite in what I regard as idiomatic reasonably-modern
Python (OMMV of course).

(snip)

John, I *swear* I didn't read your code before posting my own version !

Mar 3 '07 #5

Paul Rubin

Here's my version (not tested much). Main differences from yours:

1. It defines a Python class to hold row data, and defines the __cmp__ operation
on the class, so given two Row objects r1,r2, you can say simply
if r1 r2: ...
to see which is "better".

2. Instead of reading all the rows into memory and then scanning the
list of records of each piid, it simply remembers the best it has seen
for each piid.

By putting the "better than" logic into the class definition, the main
loop becomes very simple. It does parse out and store fields on the
Row objects consuming some extra memory, but you could eliminate that
at the cost of a little code and speed by re-parsing as needed in the
comparison function.

================================================== ==============

#! /usr/bin/env python

import sys

class Row:
def __init__(self, row):
self.row = row.rstrip('\n')
fields = self.row.split('\t')
self.piid = fields[0]
self.state = fields[1]
self.expiration_date = fields[5]
self.desired_state = fields[6]

def __cmp__(self, other):
# return +1 if self is better than other, -1 if other is better
# than self, or 0 if they are equally good
if self.state == self.desired_state:
if other.state != other.desired_state:
return 1
return cmp(self.expiration_date, other.expiration_date)
elif other.expiration_date self.expiration_date:
# other record is better only if its exp date is newer
return 1
return 0

best = {}
input = sys.stdin

for row in input:
r = Row(row)
if r.piid not in best or r best[r.piid]:
best[r.piid] = r

for piid,r in best.iteritems():
print r.row

Mar 3 '07 #6

John Machin

On Mar 3, 12:36 pm, Bruno Desthuilliers >
[snip]

DATE = 5
TARGET = 6

[snip]

Now for the bad news: I'm afraid your algorithm is broken : here are my
test data and results:

input = [
#ID STATE ... ... ... TARG DATE
"aaa\tAAA\t...\t...\t...\tBBB\t20071212\n",

[snip]

Bruno, The worse news is that your test data is broken. According to
the OP (and your own code (DATE = 5 etc)), the target state comes
last. GIGO. At least you have demonstrated to the OP why naming the
field indexes is a good idea :-)
Cheers,
John

Mar 3 '07 #7

Bjoern Schliessmann

Bruno Desthuilliers wrote:

Shawn Milo a écrit :

> if recs.has_key(piid) is False:

'is' is the identity operator - practically, in CPython, it
compares memory addresses. You *dont* want to use it here.

It's recommended to use "is None"; why not "is False"? Are there
multiple False instances or is False generated somehow?

Regards,
Björn

--
BOFH excuse #135:

You put the disk in upside down.

Mar 3 '07 #8

Ben Finney

Bjoern Schliessmann <us**************************@spamgourmet.comwrite s:

Bruno Desthuilliers wrote:
Shawn Milo a Ã©crit :

if recs.has_key(piid) is False:
'is' is the identity operator - practically, in CPython, it
compares memory addresses. You *dont* want to use it here.

It's recommended to use "is None"; why not "is False"? Are there
multiple False instances or is False generated somehow?

I'd recommend against using "is False" simply because it's more
confusing. This is better::

if not recs.has_key(piid): # [1]

Moreover, any 'if' statement expects a boolean expression *anyway*, so
there's no point testing for identity against False. Otherwise, the
following would be just as reasonable::

if (recs.has_key(piid) is False) is True:

Or perhaps:

if (((((recs.has_key(piid) is False) is True) is False) is False) is True):
[1]: yes, this is even better written as 'if not piid in recs', but
that's beside the point for this discussion.

--
\ "To be is to do" -- Plato |
`\ "To do is to be" -- Aristotle |
_o__) "Do be do be do" -- Sinatra |
Ben Finney

Mar 3 '07 #9

attn.steven.kuo

On Mar 2, 2:44 pm, "Shawn Milo" <S...@Milochik.comwrote:

(snipped)

I'm attaching both the Perl and Python versions, and I'm open to
comments on either. The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

Thanks for taking the time to look at these.

My attempts:

### Perl ###

#!/usr/bin/perl
use strict;
use warnings;

use List::Util qw/reduce/;
use constant {
STATE =1,
DATE =6,
TARGET =5,
};

sub keep_best {
my ($best, $current) = @_;
if ($current->[STATE] eq $current->[TARGET]) {
if ($best->[STATE] eq $best->[TARGET]) {
if ($current->[DATE] gt $best->[DATE]) {
return 0;
}
} else {
return 0;
}
} elsif (
$best->[STATE] ne $best->[TARGET]
and
$current->[DATE] gt $best->[DATE]) {
return 0;
}
return 1;
}
my %input;

# while uses less memory than for:
# the former is an iterator

while (<>)
{
chomp;
my @results = split(/\t/, $_);
my $key = $results[0];
push @{$input{$key}}, [ @results, $_ ];
}

# while uses less memory than for:
# the former is an iterator

while (my ($key, $aref ) = each %input)
{
my $best = reduce {
keep_best( $a, $b ) ? $a : $b
} @$aref;

print $best->[-1], "\n";
}
### Python (re-working John's code) ###

import sys

def keep_best(best, current):

ACTUAL_STATE = 1
# John had these swapped
DESIRED_STATE = 5
EXPIRY_DATE = 6

keep = True
if current[ACTUAL_STATE] == current[DESIRED_STATE]:
if best[ACTUAL_STATE] == best[DESIRED_STATE]:
if current[EXPIRY_DATE] best[EXPIRY_DATE]:
keep = False
else:
keep = False
else:
if (best[ACTUAL_STATE] != best[ACTUAL_STATE]
and current[EXPIRY_DATE] best[EXPIRY_DATE]):
keep = False
return keep

def process_file(opened_file=sys.stdin):

PIID = 0
recs = {}

for line in opened_file:
line = line.rstrip('\n')
row = line.split('\t')
row.append(line)
piid = row[PIID]
if piid not in recs:
recs[piid] = []
recs[piid].append(row)

for piid in recs:
best = reduce(lambda b, c: keep_best(b, c) and b or c,
recs[piid])
print best[-1]

if __name__ == "__main__":
process_file()
# "reduce" seems to be Lispish, Pythonic, and Perlish!

--
Hope this helps,
Steve

Mar 3 '07 #10

Marc 'BlackJack' Rintsch

In <54*************@mid.individual.net>, Bjoern Schliessmann wrote:

Bruno Desthuilliers wrote:
>Shawn Milo a Ã©crit :

>> if recs.has_key(piid) is False:

'is' is the identity operator - practically, in CPython, it
compares memory addresses. You *dont* want to use it here.

It's recommended to use "is None"; why not "is False"? Are there
multiple False instances or is False generated somehow?

Before `True` and `False` existed many people defined them as aliases to 1
and 0. And of course there are *many* other objects that can be used in a
boolean context of an ``if`` statement for testing "trueness" and
"falseness".

Ciao,
Marc 'BlackJack' Rintsch

Mar 3 '07 #11

John Machin

On Mar 3, 7:08 pm, attn.steven....@gmail.com wrote:

On Mar 2, 2:44 pm, "Shawn Milo" <S...@Milochik.comwrote:

(snipped)

I'm attaching both the Perl and Python versions, and I'm open to
comments on either. The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

Thanks for taking the time to look at these.

My attempts:
### Python (re-working John's code) ###

import sys

def keep_best(best, current):

ACTUAL_STATE = 1
# John had these swapped
DESIRED_STATE = 5
EXPIRY_DATE = 6

*Bullshit* -- You are confusing me with Bruno; try (re)?reading what
the OP wrote (and which you quoted above):
"""
The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6).
"""

and his code (indented a little less boisterously):

"""
#If the current record is the correct state
if current.split("\t")[1] == current.split("\t")[6]:
#If the existing record is the correct state
if best.split("\t")[1] == best.split("\t")[6]:
#If the new record has a newer exp. date
if current.split("\t")[5] best.split("\t")[5]:
"""

Mar 3 '07 #12

William Heymann

On Saturday 03 March 2007, Ben Finney wrote:

Bjoern Schliessmann <us**************************@spamgourmet.comwrite s:

if not recs.has_key(piid): # [1]

Why not

if piid not in recs:

That is shorter, simpler, easier to read and very slightly faster. Plus you
can change the data structure of recs later without changing that line so
long as it implements containment testing.

Mar 3 '07 #13

Ben Finney

William Heymann <ko**@aesaeion.comwrites:

On Saturday 03 March 2007, Ben Finney wrote:
Bjoern Schliessmann <us**************************@spamgourmet.comwrite s:

if not recs.has_key(piid): # [1]
Why not

if piid not in recs:

That is shorter, simpler, easier to read and very slightly faster.

Perhaps if I'd made my posting shorter, simpler, easier to read and
slightly faster, you might have read the footnote to which the '[1]'
referred.

--
\ "Choose mnemonic identifiers. If you can't remember what |
`\ mnemonic means, you've got a problem." -- Larry Wall |
_o__) |
Ben Finney

Mar 3 '07 #14

Paddy

On Mar 2, 10:44 pm, "Shawn Milo" <S...@Milochik.comwrote:

I'm new to Python and fairly experienced in Perl, although that
experience is limited to the things I use daily.

I wrote the same script in both Perl and Python, and the output is
identical. The run speed is similar (very fast) and the line count is
similar.

Now that they're both working, I was looking at the code and wondering
what Perl-specific and Python-specific improvements to the code would
look like, as judged by others more knowledgeable in the individual
languages.

Hi Shawn, there is a web page that gives examples from Perl's
Datastructures Cookbook re-implemented in Python. It might be of help
for future Python projects:
http://wiki.python.org/moin/PerlPhrasebook

- Paddy.

Mar 3 '07 #15

Jussi Salmela

Shawn Milo kirjoitti:

<snip>
I am not looking for the smallest number of lines, or anything else
that would make the code more difficult to read in six months. Just
any instances where I'm doing something inefficiently or in a "bad"
way.

I'm attaching both the Perl and Python versions, and I'm open to
comments on either. The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

I don't know if this attempt satisfies your criteria but here goes!

This is not a rewrite of your program but was created using your problem
description above. I've not included the reading of the data because it
has not much to do with the problem per se.

#================================================= ===========
input = [
"aaa\tAAA\t...\t...\t...\t20071212\tBBB\n",
"aaa\tAAA\t...\t...\t...\t20070120\tAAA\n",
"aaa\tAAA\t...\t...\t...\t20070101\tAAA\n",
"aaa\tAAA\t...\t...\t...\t20071010\tBBB\n",
"aaa\tAAA\t...\t...\t...\t20071111\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20071201\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20070101\tAAA\n",
"ccc\tAAA\t...\t...\t...\t20071212\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20071212\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20070101\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20070101\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20071212\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20070612\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20071212\tBBB\n",
]

input = [x[:-1].split('\t') for x in input]
recs = {}
for row in input:
recs.setdefault(row[0], []).append(row)

for key in recs:
rows = recs[key]
rows.sort(key=lambda x:x[5], reverse=True)
for current in rows:
if current[1] == current[6]:
break
else:
current = rows[0]
print '\t'.join(current)
#================================================= ===========
The output is:

aaa AAA ... ... ... 20070120 AAA
bbb AAA ... ... ... 20071212 AAA
ccc AAA ... ... ... 20071212 AAA

and it is the same as the output of your original code on this data.
Further testing would naturally be beneficial.

Cheers,
Jussi

Mar 3 '07 #16

Bruno Desthuilliers

John Machin a écrit :

On Mar 3, 12:36 pm, Bruno Desthuilliers >
[snip]

> DATE = 5
TARGET = 6

[snip]

>>Now for the bad news: I'm afraid your algorithm is broken : here are my
test data and results:

input = [
#ID STATE ... ... ... TARG DATE
"aaa\tAAA\t...\t...\t...\tBBB\t20071212\n",

[snip]

Bruno, The worse news is that your test data is broken.

Re-reading the OP's specs, the bad news is that my only neuron left is
broken. Shouldn't code at 2 o'clock in the morning :(

Mar 4 '07 #17

Bruno Desthuilliers

Bjoern Schliessmann a écrit :

Bruno Desthuilliers wrote:

>>Shawn Milo a écrit :

>> if recs.has_key(piid) is False:

'is' is the identity operator - practically, in CPython, it
compares memory addresses. You *dont* want to use it here.

It's recommended to use "is None"; why not "is False"? Are there
multiple False instances or is False generated somehow?

Once upon a time, Python didn't have a "proper" boolean type. It only
had rules for boolean evaluation of a given object. According to these
rules - that of course still apply -, empty strings, lists, tuples or
dicts, numeric zeros and None are false in a boolean context. IOW, an
expression can eval to false without actually being the False object
itself. So the result of using the identity operator to test against
such an expression, while being clearly defined, may not be exactly what
you'd think.

To make a long story short:

if not []:
print "the empty list evals to false in a boolean context"

if [] is False:
print "this python interpreter is broken"

HTH

Mar 4 '07 #18

Bruno Desthuilliers

Shawn Milo a écrit :
(snip)

The script reads a file from standard input and
finds the best record for each unique ID (piid). The best is defined
as follows: The newest expiration date (field 5) for the record with
the state (field 1) which matches the desired state (field 6). If
there is no record matching the desired state, then just take the
newest expiration date.

Here's a fixed (wrt/ test data) version with a somewhat better (and
faster) algorithm using Decorate/Sort/Undecorate (aka schwarzian transform):

import sys
output = sys.stdout

input = [
#ID STATE ... ... ... DATE TARGET
"aaa\tAAA\t...\t...\t...\t20071212\tBBB\n",
"aaa\tAAA\t...\t...\t...\t20070120\tAAA\n",
"aaa\tAAA\t...\t...\t...\t20070101\tAAA\n",
"aaa\tAAA\t...\t...\t...\t20071010\tBBB\n",
"aaa\tAAA\t...\t...\t...\t20071111\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20071201\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20070101\tAAA\n",
"ccc\tAAA\t...\t...\t...\t20071212\tBBB\n",
"ccc\tAAA\t...\t...\t...\t20071212\tAAA\n",
"bbb\tAAA\t...\t...\t...\t20070101\tBBB\n",
"bbb\tAAA\t...\t...\t...\t20070101\tBBB\n",
"bbb\tAAA\t...\t...\t...\t20071212\tBBB\n",
"bbb\tAAA\t...\t...\t...\t20070612\tBBB\n",
"bbb\tAAA\t...\t...\t...\t20071212\tBBB\n",
]

def find_best_match(input=input, output=output):
PIID = 0
STATE = 1
EXP_DATE = 5
DESIRED_STATE = 6

recs = {}
for line in input:
line = line.rstrip('\n')
row = line.split('\t')
sort_key = (row[STATE] == row[DESIRED_STATE], row[EXP_DATE])
recs.setdefault(row[PIID], []).append((sort_key, line))

for decorated_lines in recs.itervalues():
print >output, sorted(decorated_lines, reverse=True)[0][1]

Lines are sorted first on whether the state matches the desired state,
then on the expiration date. Since it's a reverse sort, we first have
lines that match (if any) sorted by date descending, then the lines that
dont match sorted by date descending. So in both cases, the 'best match'
is the first item in the list. Then we just have to get rid of the sort
key, et voilà !-)

HTH

Mar 4 '07 #19

Peter Otten

Bruno Desthuilliers wrote:

print >output, sorted(decorated_lines, reverse=True)[0][1]

Or just
print >output, max(decorated_lines)[1]

Peter

Mar 4 '07 #20

Bruno Desthuilliers

Peter Otten a écrit :

Bruno Desthuilliers wrote:

> print >output, sorted(decorated_lines, reverse=True)[0][1]

Or just
print >output, max(decorated_lines)[1]

Good point. More explicit, and a bit faster too. Thanks Peter.

Mar 4 '07 #21

Perl and Python, a practical side-by-side example.

Similar topics