472,782 Members | 3,168 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,782 software developers and data experts.

data management with python from perl

hi

i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.

so, the data i am using is some epiphyte population abundance data for
particular sites, and it looks like this:

1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00

(i have changed the data to protect the innocent) the first four
columns relate to the location, the fifth to the substrate, the sixth
is the epiphyte species and the seventh the abundance. i need to turn
this into a substrate x species matrix with columns 1 to 4 retained as
sorting columns and the intersection of speces and substrate being the
abundance. the species name needs to be the column headers. this is
going to go into a multivariate analysis of variance programme that
only takes its data in that format. here is an example of the output

region location site stand substrate genus species 1 genus species
2 genus species 3 genus species 4 genus species 5 genus species
6 genus species 7

<..etc..>

1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0

<..etc...>

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
= $tokens[6]

with the various location and species values are the keys of the hash,
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure that i can use to loop over the keys
and sort them by each as i go, then to write out the data into a
matrix as above. the syntax for this is generally like

# level 1 - region
foreach $region (sort {$a <=> $b} keys %HoH) {

# level 2 - location
foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {

# level 3 - site
foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })

<... etc ...>

there is a bit more perl obviously, but that is the general gist of
it. multidimensional hash and then looping and sorting to get the data
out.

ok. so how do i do this in python? i've tried the "perlish" way but
didn't get very far, however i know it must be able to be done!

if you want to respond to this, try benmoretti at yahoo dot com dot au
as i get too much spam otherwise

cheers

ben
Jul 18 '05 #1
2 4355
ben moretti wrote:
hi

i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.

so, the data i am using is some epiphyte population abundance data for
particular sites, and it looks like this:

1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00

(i have changed the data to protect the innocent) the first four
columns relate to the location, the fifth to the substrate, the sixth
is the epiphyte species and the seventh the abundance. i need to turn
this into a substrate x species matrix with columns 1 to 4 retained as
sorting columns and the intersection of speces and substrate being the
abundance. the species name needs to be the column headers. this is
going to go into a multivariate analysis of variance programme that
only takes its data in that format. here is an example of the output

region location site stand substrate genus species 1 genus species
2 genus species 3 genus species 4 genus species 5 genus species
6 genus species 7

<..etc..>

1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0

<..etc...>

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]} = $tokens[6]

with the various location and species values are the keys of the hash,
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure that i can use to loop over the keys
and sort them by each as i go, then to write out the data into a
matrix as above. the syntax for this is generally like

# level 1 - region
foreach $region (sort {$a <=> $b} keys %HoH) {

# level 2 - location
foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {

# level 3 - site
foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })

<... etc ...>

there is a bit more perl obviously, but that is the general gist of
it. multidimensional hash and then looping and sorting to get the data
out.

ok. so how do i do this in python? i've tried the "perlish" way but
didn't get very far, however i know it must be able to be done!


The best solution would probably to be to rely on a database that supports
pivot tables.
However, I've put together a simple class to generate a pivot table to get
you started. It's only 2D, i. e. f(row,col) -> value, but if I have
understood you correctly that should be sufficient (I am not good at
reading perl).
To read your data from a (text) file, have a look at Python's csv module.

Peter

<code>
import sets

class Adder(object):
""" Adds all values entered via set()
"""
def __init__(self, value=0):
self.value = value
def set(self, value):
self.value += value
def get(self):
return self.value

_none = object()
class First(object):
""" Accepts any value the first time set() is called,
requires the same value on subsequent calls of set().
"""
def __init__(self):
self.value = _none
def set(self, value):
if self.value is _none:
self.value = value
else:
if value != self.value:
raise ValueError, "%s expected but got %s" % (self.value,
value)
def get(self):
return self.value

class Pivot(object):
""" A simple Pivot table generator class
"""
def __init__(self, valueAccumulator, rowHeaders):
self.rows = sets.Set()
self.columns = sets.Set()
self.values = {}
self.valueAccumulator = valueAccumulator
self.rowHeaders = rowHeaders
def extend(self, table, extractRow, extractColumn, extractValue):
for record in table:
r = extractRow(record)
c = extractColumn(record)
self.rows.add(r)
self.columns.add(c)
try:
fxy = self.values[r, c]
except KeyError:
fxy = self.valueAccumulator()
self.values[r, c] = fxy
fxy.set(extractValue(record))

def toTable(self, defaultValue=None, columnCompare=None,
rowCompare=None):
""" returns a list of lists.
"""
table = []
rows = list(self.rows)
rows.sort(rowCompare)
columns = list(self.columns)
columns.sort(columnCompare)
headers = self.rowHeaders + [c for c in columns]
table.append(headers)
for row in rows:
record = list(row)
for column in columns:
v = self.values.get((row, column), None)
if v is not None:
v = v.get()
record.append(v)
table.append(record)
return table
def printTable(p):
for row in p.toTable():
print row

if __name__ == "__main__":
table = [
"Jack Welsh Beer 1",
"Richard Maier Beer 1",
"Bill Bush Wine 2",
"Bill Bush Wine 2",
]
table = [row.split() for row in table]
print table
print "-" * 10
p = Pivot(Adder, ["Christian", "Surname"])
def extractRow(record):
return record[0], record[1]
def extractValue(record):
return int(record[3])
def extractColumn(record):
return record[2]
p.extend(table, extractRow, extractColumn, extractValue)

printTable(p)

columns = "region location site stand substrate species
abundance".split()

table = [
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 2", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 3", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 706, "species 4", 1],
# [1.0, 1.0, 1.0, "GO", 706, "species 4", 2],# uncomment me
[1.0, 1.0, 1.0, "GO", 806, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 106, "species 1", 1],
]
p = Pivot(First, columns[:5])
p.extend(table, lambda r: tuple(r[:5]),
lambda r: r[5],
lambda r: r[6])
printTable(p)
</code>
Jul 18 '05 #2
bm******@chariot.net.au (ben moretti) wrote:
i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.
Not being too familiar with Perl (or scientific computing), I'm not
sure if I understood everything correctly...
1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00
I _think_ you want your data as a nested dictionary like so:
{1: {1: {1: {"MO": {906: {"genus species 1": 1,
"genus species 2": 1,
"genus species 3": 1,
"genus species 4": 1} }}}}}

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens
I hope I will NOT bore you with a whole script, but I've expanded your
data a bit to have a somewhat more complicated/structured data file to
work with (not shown here, this's more than long enough as it is); so
I'll first read it in and split it up:

###

import csv

f = open(r"i:\python\nestedtest.txt", "r") # my testdata
csvreader = csv.reader(f, delimiter=' ', quotechar='"')

###

From your output I gather that maybe you the numbers as numbers, and
not as strings, so I'll convert the data while populating an
intermediate list:

###

def parselist(lst):
"""convert the list's values to floats or integers where
appropriate"""
parsed = []
for itm in lst:
try:
f = float(itm)
i = int(f)
if i == f:
parsed.append(int(i))
else:
parsed.append(f)
except ValueError:
parsed.append(itm)
return parsed

datalist = []
for line in csvreader:
datalist.append(parselist(line))
f.close() # don't need that anymore

###
and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
= $tokens[6]
Now, if that does what I think it does (create a nested hash), then
hats off to Perl! I haven't found anything as concise built into
Python (but then I'm not a guru, maybe someone else knows a better
way?), so I rolled my own:

###

def nestdict(lst):
"""create a recursively nested dictionary from a _flat_ list"""
dct = {}
if len(lst) > 2:
dct[lst[0]] = nestdict(lst[1:])
elif len(lst) == 2:
dct[lst[0]] = lst[1]
return dct

###

which is good for ONE line of input; since I have a list of those, I
want to build up the dictionary line by line, for which I need another
function:

###

def nestextend(dct, upd):
"""recursively extend/update a nested dictionary with another one"""
try:
items = upd.items()
for key, val in items:
if key not in dct:
dct[key] = val
else:
nestextend(dct[key], upd[key])
except AttributeError:
dct.update(upd)

datadict = {}
for lst in datalist:
nestextend(datadict, nestdict(lst))

###

datadict now holds all the data from the testfile in a nested
dictionary with the various locations and species values as the keys
of the hash, which is what (I hope) you wanted.
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure
Reading that I'm not sure I've understood anything - shouldn't you
want to use a multidimensional array for that? Anyone familiar with
Python's scientific/number crunching/array libraries should be able to
clear that up...
that i can use to loop over the keys and sort them by each as i go,
then to write out the data into a matrix as above.
I'm not sure how you arrive at your matrix output, but looping over
the dictionary shouldn't be a problem now. However, since you also
want to sort the data (by key), and dictionaries notoriously don't
support that, I've written another function:

###

def nestsort(dct):
"""convert a nested dictionary to a nested (key, value) list,
recursively sorting it by key"""
lst = []
try:
items = dct.items()
items.sort()
for key, value in items:
lst.append([key, nestsort(dct[key])])
return lst
except AttributeError:
return dct

sorteddata = nestsort(datadict)

###

So now the data from the beginning looks like:

[1, [1, [1, ["MO", [906, ["genus species 1", 1],
["genus species 2", 1],
["genus species 3", 1],
["genus species 4", 1] ]]]]]

which you probably could have had cheaper...

Now you can do something like:

###

for region, rdata in sorteddata:
print "Region", region
for location, ldata in rdata:
print " " * 2 + "Location", location
for site, sitedata in ldata:
print " " * 4 + "Site", site
for stand, stdata in sitedata:
print " " * 6 + "Stand", stand
for substrate, subdata in stdata:
print " " * 8 + "Substrate", substrate
for genus, abundance in subdata:
print " " * 10 + "Genus", genus, "Abundance", abundance

###

to test my script and your (real) data.
There's next to no error-checking and it sure'd be more
pythonic/beautiful/reusable if I'd subclass'd dict, but it works --
for my data at least.
ok. so how do i do this in python? i've tried the "perlish" way but
Once more, it seems that "the perlish way" <> "the python way".
didn't get very far, however i know it must be able to be done!
I don't think there's much of anything either language can do that the
other can't, but of course some things are harder than others...
if you want to respond to this, try benmoretti at yahoo dot com dot au
as i get too much spam otherwise


<posted to the NG and forwarded to you>
--
Christopher
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: spar | last post by:
I'm converting a Perl script to Python and have run into something I'm not sure how to do in Python. In Perl, I am running through a couple loops and inserting values directly into a complex...
31
by: surfunbear | last post by:
I've read some posts on Perl versus Python and studied a bit of my Python book. I'm a software engineer, familiar with C++ objected oriented development, but have been using Perl because it is...
68
by: Lad | last post by:
Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.
1
by: Miguel Manso | last post by:
Hi there, I'm a Perl programmer trying to get into Python. I've been reading some documentation and I've choosed Python has being the "next step" to give. Can you point me out to Python...
3
by: Alex | last post by:
Hi all, I'm looking for some advice on how best to implement storage of access logs into a db/2 8.1.4 database running on a RH 7.2 system. I have 5 (squid) web caches running here that...
5
by: Robert Oschler | last post by:
I am converting a Perl script over to "C" for a potential open source project. I need some open source "C" code that will give me the same functionality of a Perl Style associative array: ...
13
by: squash | last post by:
I am a little annoyed at why such a simple program in Perl is causing so much difficulty for python, i.e: $a += 200000 * 140000; print $a;
12
by: rurpy | last post by:
Is there an effcient way (more so than cgi) of using Python with Microsoft IIS? Something equivalent to Perl-ISAPI?
8
by: Palindrom | last post by:
Hi everyone ! I'd like to apologize in advance for my bad english, it's not my mother tongue... My girlfriend (who is a newbie in Python, but knows Perl quite well) asked me this morning why...
0
by: Rina0 | last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.