ben moretti wrote:
hi
i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.
so, the data i am using is some epiphyte population abundance data for
particular sites, and it looks like this:
1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00
(i have changed the data to protect the innocent) the first four
columns relate to the location, the fifth to the substrate, the sixth
is the epiphyte species and the seventh the abundance. i need to turn
this into a substrate x species matrix with columns 1 to 4 retained as
sorting columns and the intersection of speces and substrate being the
abundance. the species name needs to be the column headers. this is
going to go into a multivariate analysis of variance programme that
only takes its data in that format. here is an example of the output
region location site stand substrate genus species 1 genus species
2 genus species 3 genus species 4 genus species 5 genus species
6 genus species 7
<..etc..>
1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0
<..etc...>
so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens and then populate a hash of
hashes, the syntax of which is
$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]} = $tokens[6]
with the various location and species values are the keys of the hash,
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure that i can use to loop over the keys
and sort them by each as i go, then to write out the data into a
matrix as above. the syntax for this is generally like
# level 1 - region
foreach $region (sort {$a <=> $b} keys %HoH) {
# level 2 - location
foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {
# level 3 - site
foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })
<... etc ...>
there is a bit more perl obviously, but that is the general gist of
it. multidimensional hash and then looping and sorting to get the data
out.
ok. so how do i do this in python? i've tried the "perlish" way but
didn't get very far, however i know it must be able to be done!
The best solution would probably to be to rely on a database that supports
pivot tables.
However, I've put together a simple class to generate a pivot table to get
you started. It's only 2D, i. e. f(row,col) -> value, but if I have
understood you correctly that should be sufficient (I am not good at
reading perl).
To read your data from a (text) file, have a look at Python's csv module.
Peter
<code>
import sets
class Adder(object):
""" Adds all values entered via set()
"""
def __init__(self, value=0):
self.value = value
def set(self, value):
self.value += value
def get(self):
return self.value
_none = object()
class First(object):
""" Accepts any value the first time set() is called,
requires the same value on subsequent calls of set().
"""
def __init__(self):
self.value = _none
def set(self, value):
if self.value is _none:
self.value = value
else:
if value != self.value:
raise ValueError, "%s expected but got %s" % (self.value,
value)
def get(self):
return self.value
class Pivot(object):
""" A simple Pivot table generator class
"""
def __init__(self, valueAccumulator, rowHeaders):
self.rows = sets.Set()
self.columns = sets.Set()
self.values = {}
self.valueAccumulator = valueAccumulator
self.rowHeaders = rowHeaders
def extend(self, table, extractRow, extractColumn, extractValue):
for record in table:
r = extractRow(record)
c = extractColumn(record)
self.rows.add(r)
self.columns.add(c)
try:
fxy = self.values[r, c]
except KeyError:
fxy = self.valueAccumulator()
self.values[r, c] = fxy
fxy.set(extractValue(record))
def toTable(self, defaultValue=None, columnCompare=None,
rowCompare=None):
""" returns a list of lists.
"""
table = []
rows = list(self.rows)
rows.sort(rowCompare)
columns = list(self.columns)
columns.sort(columnCompare)
headers = self.rowHeaders + [c for c in columns]
table.append(headers)
for row in rows:
record = list(row)
for column in columns:
v = self.values.get((row, column), None)
if v is not None:
v = v.get()
record.append(v)
table.append(record)
return table
def printTable(p):
for row in p.toTable():
print row
if __name__ == "__main__":
table = [
"Jack Welsh Beer 1",
"Richard Maier Beer 1",
"Bill Bush Wine 2",
"Bill Bush Wine 2",
]
table = [row.split() for row in table]
print table
print "-" * 10
p = Pivot(Adder, ["Christian", "Surname"])
def extractRow(record):
return record[0], record[1]
def extractValue(record):
return int(record[3])
def extractColumn(record):
return record[2]
p.extend(table, extractRow, extractColumn, extractValue)
printTable(p)
columns = "region location site stand substrate species
abundance".split()
table = [
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 2", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 3", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 706, "species 4", 1],
# [1.0, 1.0, 1.0, "GO", 706, "species 4", 2],# uncomment me
[1.0, 1.0, 1.0, "GO", 806, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 106, "species 1", 1],
]
p = Pivot(First, columns[:5])
p.extend(table, lambda r: tuple(r[:5]),
lambda r: r[5],
lambda r: r[6])
printTable(p)
</code>