473,385 Members | 2,274 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

code for Computer Language Shootout

There are a bunch of new tests up at shootout.alioth.debian.org for which
Python does not yet have code. I've taken a crack at one of them, a task
to print the reverse complement of a gene transcription. Since there are a
lot of minds on this newsgroup that are much better at optimization than
I, I'm posting the code I came up with to see if anyone sees any
opportunities for substantial improvement. Without further ado:

table = string.maketrans('ACBDGHK\nMNSRUTWVY', 'TGVHCDM\nKNSYAAWBR')

def show(s):
i = 0
for char in s.upper().translate(table)[::-1]:
if i == 60:
print
i = 0
sys.stdout.write(char)
i += 1
print

def main():
seq = ''
for line in sys.stdin:
if line[0] == '>' or line[0] == ';':
if seq != '':
show(seq)
seq = ''
print line,
else:
seq += line[:-1]
show(seq)

main()
Making seq into a list instead of a string (and using .extend instead of
the + operator) didn't give any speed improvements. Neither did using a
dictionary instead of the translate function, or using reversed() instead
of s[::-1]. The latter surprised me, since I would have guessed using an
iterator to be more efficient. Since the shootout also tests memory usage,
should I be using reversed for that reason? Does anyone have any other
ideas to optimize this code?

By the way - is there a good way to find out the maximum memory a program
used (in the manner of the "time" command)? Other than downloading and
running the shootout benchmark scripts, of course.

--
Jacob Lee
je****@uiuc.edu | www.nearestneighbor.net

Jul 18 '05 #1
14 1576
Jacob Lee wrote:
By the way - is there a good way to find out the maximum memory a program
used (in the manner of the "time" command)? Other than downloading and
running the shootout benchmark scripts, of course.


Inserting appropriate pauses with raw_input() and recording the memory
usage using top(1) is a quick and dirty way.

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
Jul 18 '05 #2
Here's my solution to the problem[1]:

[1] http://shootout.alioth.debian.org/be...p?test=revcomp

import sys
import string

basetable = string.maketrans('ACBDGHKMNSRUTWVYacbdghkmnsrutwvy ',
'TGVHCDMKNSYAAWBRTGVHCDMKNSYAAWBR')
def revcomp(seqlines, linelength=60, basetable=basetable):
seq = ''.join(seqlines)
complement = seq.translate(basetable)
revcomplement = complement[::-1]
for i in xrange(0, len(revcomplement), linelength):
print revcomplement[i:i+linelength]

def main():
seqlines = []
for line in sys.stdin:
if line.startswith('>') or line.startswith(';'):
if seqlines:
revcomp(seqlines)
sys.stdout.write(line)
seqlines = []
else:
seqlines.append(line.strip())
revcomp(seqlines)

if __name__ == '__main__':
main()

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
Jul 18 '05 #3
Jacob Lee wrote:
There are a bunch of new tests up at shootout.alioth.debian.org for which
Python does not yet have code. I've taken a crack at one of them, a task
to print the reverse complement of a gene transcription. Since there are a
lot of minds on this newsgroup that are much better at optimization than
I, I'm posting the code I came up with to see if anyone sees any
opportunities for substantial improvement. Without further ado:

table = string.maketrans('ACBDGHK\nMNSRUTWVY', 'TGVHCDM\nKNSYAAWBR')
string.translate is a good idea. Note you can handle upper-casing and
translation in one operation by adding a mapping from lower case to the
complement i.e.,

table = string.maketrans('ACBDGHK\nMNSRUTWVYacbdghkmnsrutw vy',
'TGVHCDM\nKNSYAAWBRTGVHCDMKNSYAAWBR')
def show(s):
i = 0
for char in s.upper().translate(table)[::-1]:
if i == 60:
print
i = 0
sys.stdout.write(char)
i += 1
print
def main():
seq = ''
for line in sys.stdin:
if line[0] == '>' or line[0] == ';':
if seq != '':
show(seq)
seq = ''
print line,
else:
seq += line[:-1]
show(seq)

main()


This looks unwieldy - especially writing to sys.stdout oe character at a time.
I may not have understood the spec exactly, but isn't this the same as:

for line in sys.stdin:
if line and (line[0] == ">" or line[0] == ";"):
print line
else:
print line.translate(table)

HTH

Michael

Jul 18 '05 #4
Jacob Lee wrote:
There are a bunch of new tests up at shootout.alioth.debian.org for which
Python does not yet have code. I've taken a crack at one of them, a task
to print the reverse complement of a gene transcription. Since there are a
lot of minds on this newsgroup that are much better at optimization than
I, I'm posting the code I came up with to see if anyone sees any
opportunities for substantial improvement. Without further ado:

table = string.maketrans('ACBDGHK\nMNSRUTWVY', 'TGVHCDM\nKNSYAAWBR')

def show(s):
i = 0
for char in s.upper().translate(table)[::-1]:
if i == 60:
print
i = 0
sys.stdout.write(char)
i += 1
print

def main():
seq = ''
for line in sys.stdin:
if line[0] == '>' or line[0] == ';':
if seq != '':
show(seq)
seq = ''
print line,
else:
seq += line[:-1]
show(seq)

main()
Don't know if this is faster for your data, but I think you could also
write this as (untested):

# table as default argument value so you don't have to do
# a global lookup each time it's used

def show(seq, table=string.maketrans('ACBDGHK\nMNSRUTWVY',
'TGVHCDM\nKNSYAAWBR')
seq = seq.upper().translate(table)[::-1]
# print string in slices of length 60
for i in range(0, len(seq), 60):
print seq[i:i+60]

def main():
seq = []
# alias methods to avoid repeated lookup
join = ''.join
append = seq.append
for line in sys.stdin:
# note only one "line[0]" by using "in" test
if line[0] in ';>':
# note no need to check if seq is empty; show now prints
# nothing for an empty string
show(join(seq))
print line,
del seq[:]
else:
append(line[:-1])

Making seq into a list instead of a string (and using .extend instead of
the + operator) didn't give any speed improvements. Neither did using a
dictionary instead of the translate function, or using reversed() instead
of s[::-1]. The latter surprised me, since I would have guessed using an
iterator to be more efficient. Since the shootout also tests memory usage,
should I be using reversed for that reason?


reversed() won't save you any memory -- you're already loading the
entire string into memory anyway.
Interesting tidbit:
del seq[:]
tests faster than
seq = []

$ python -m timeit -s "lst = range(1000)" "lst = []"
10000000 loops, best of 3: 0.159 usec per loop

$ python -m timeit -s "lst = range(1000)" "del lst[:]"
10000000 loops, best of 3: 0.134 usec per loop

It's probably the right way to go in this case anyway -- no need to
create a new empty list each time.

STeVe
Jul 18 '05 #5
On Tue, 15 Mar 2005 21:38:48 -0800, Michael Spencer wrote:
string.translate is a good idea. Note you can handle upper-casing and
translation in one operation by adding a mapping from lower case to the
complement i.e.,

table = string.maketrans('ACBDGHK\nMNSRUTWVYacbdghkmnsrutw vy',
'TGVHCDM\nKNSYAAWBRTGVHCDMKNSYAAWBR')
Good call.
This looks unwieldy - especially writing to sys.stdout oe character at a
time. I may not have understood the spec exactly, but isn't this the
same as:

for line in sys.stdin:
if line and (line[0] == ">" or line[0] == ";"):
print line
else:
print line.translate(table)

That's the immediately obvious solution, but it doesn't actually fulfill
the problem requirements. What if your last line is less than 60
characters long? You no longer will be displaying the input in reverse
order. Otherwise you'd be right - my solution would be unnecessarily
unwieldy (and the problem would be much simpler...) .

--
Jacob Lee
je****@uiuc.edu | www.nearestneighbor.net

Jul 18 '05 #6
On Tue, 15 Mar 2005 22:45:48 -0700, Steven Bethard wrote:
# table as default argument value so you don't have to do
# a global lookup each time it's used

def show(seq, table=string.maketrans('ACBDGHK\nMNSRUTWVY',
'TGVHCDM\nKNSYAAWBR')
seq = seq.upper().translate(table)[::-1]
# print string in slices of length 60
for i in range(0, len(seq), 60):
print seq[i:i+60]

def main():
seq = []
# alias methods to avoid repeated lookup
join = ''.join
append = seq.append
for line in sys.stdin:
# note only one "line[0]" by using "in" test
if line[0] in ';>':
# note no need to check if seq is empty; show now prints
# nothing for an empty string
show(join(seq))
print line,
del seq[:]
else:
append(line[:-1])

Wow - that ran about 10 times faster on a 10MB input file. The significant
change was the for loop inside the show function: your method avoids the
increment and if statement and of course has 60x fewer iterations total.
reversed() won't save you any memory -- you're already loading the
entire string into memory anyway.
Interesting tidbit:
del seq[:]
tests faster than
seq = []

$ python -m timeit -s "lst = range(1000)" "lst = []"
10000000 loops, best of 3: 0.159 usec per loop

$ python -m timeit -s "lst = range(1000)" "del lst[:]"
10000000 loops, best of 3: 0.134 usec per loop

It's probably the right way to go in this case anyway -- no need to
create a new empty list each time.


Fascinating - I hadn't thought about that.

Besides redoing that loop, the remaining optimizations produce less than a
10% speed-up; in particular, adding the function aliases increases the
number of lines of code (another benchmark in the shootout), and I would
imagine that the organizers don't really want that type of special
optimization (no developer is going to write that in their programs
unless they have really time-critical code, so this is the sort of hit
that Python really should take in the contest as penalty for being so
dynamically nifty ;)). So here's a tentative contest version of the code:

import sys
import string

def show(seq, table=string.maketrans('ACBDGHK\nMNSRUTWVYacbdghkm nsrutwvy',
'TGVHCDM\nKNSYAAWBRTGVHCDMKNSYAAWBR')):
seq = seq.translate(table)[::-1]
for i in range(0, len(seq), 60):
print seq[i:i+60]

def main():
seq = []
for line in sys.stdin:
if line[0] in ';>':
show(''.join(seq))
print line,
del seq[:]
else:
seq.append(line[:-1])
show(''.join(seq))

main()

--
Jacob Lee
je****@uiuc.edu | www.nearestneighbor.net

Jul 18 '05 #7
Jacob Lee wrote:
So here's a tentative contest version of the code:

import sys
import string

def show(seq, table=string.maketrans('ACBDGHK\nMNSRUTWVYacbdghkm nsrutwvy',
'TGVHCDM\nKNSYAAWBRTGVHCDMKNSYAAWBR')):
seq = seq.translate(table)[::-1]
for i in range(0, len(seq), 60):
print seq[i:i+60]

def main():
seq = []
for line in sys.stdin:
if line[0] in ';>':
show(''.join(seq))
print line,
del seq[:]
else:
seq.append(line[:-1])
show(''.join(seq))

main()


Looks pretty good. (And yes, I definitely prefer the unaliased ''.join
and seq.append for readability's sake. Glad to know they try to grade
on that too.) =)

Only one other suggestion: "range" in the show function should probably
be "xrange". "range" is going to create an actual list of however many
integers, while "xrange" will just create the integers as needed.
"xrange" thus will be more memory friendly. (It's also occasionally
faster, though this depends a lot on the rest of the code).

STeVe
Jul 18 '05 #8
Jacob Lee wrote:
On Tue, 15 Mar 2005 21:38:48 -0800, Michael Spencer wrote:

string.translate is a good idea. Note you can handle upper-casing and
translation in one operation by adding a mapping from lower case to the
complement i.e.,

table = string.maketrans('ACBDGHK\nMNSRUTWVYacbdghkmnsrutw vy',
'TGVHCDM\nKNSYAAWBRTGVHCDMKNSYAAWBR')


Good call.

This looks unwieldy - especially writing to sys.stdout oe character at a
time. I may not have understood the spec exactly, but isn't this the
same as:

for line in sys.stdin:
if line and (line[0] == ">" or line[0] == ";"):
print line
else:
print line.translate(table)


That's the immediately obvious solution, but it doesn't actually fulfill
the problem requirements. What if your last line is less than 60
characters long? You no longer will be displaying the input in reverse
order. Otherwise you'd be right - my solution would be unnecessarily
unwieldy (and the problem would be much simpler...) .

yes it would be, wouldn't it ;-)

How about this then:

basetable = string.maketrans('ACBDGHKMNSRUTWVYacbdghkmnsrutwvy ',
'TGVHCDMKNSYAAWBRTGVHCDMKNSYAAWBR')

from collections import deque
from itertools import islice

def output(seq, linelength = 60):
if seq:
iterseq = iter(seq)
while iterseq:
print "".join(islice(iterseq,linelength))

def revcomp(input = sys.stdin):
seqlines = deque()
for line in input:
if line[0] in ">;":
output(seqlines)
print line,
seqlines.clear()
else:
seqlines.extendleft(line.translate(basetable, "\n\r"))
output(seqlines)
It would be nice to inline the output function, and re-arrange the iteration so
that EOF triggers the same suite as line[0] in ">;"

Michael

Jul 18 '05 #9
Le Tue, 15 Mar 2005 23:21:02 -0800, Michael Spencer a écrit :
Jacob Lee wrote:
On Tue, 15 Mar 2005 21:38:48 -0800, Michael Spencer wrote:

Good call.
How about this then:

basetable = string.maketrans('ACBDGHKMNSRUTWVYacbdghkmnsrutwvy ',
'TGVHCDMKNSYAAWBRTGVHCDMKNSYAAWBR')

from collections import deque
from itertools import islice

def output(seq, linelength = 60):
if seq:
iterseq = iter(seq)
while iterseq:
print "".join(islice(iterseq,linelength))

I suppose you mean :
print "".join( str(item) for item in islice(iterseq, linelength) )
# using python 2.4 genexp
def revcomp(input = sys.stdin):
seqlines = deque()
for line in input:
if line[0] in ">;":
output(seqlines)
print line,
seqlines.clear()
else:
seqlines.extendleft(line.translate(basetable, "\n\r"))
output(seqlines)
It would be nice to inline the output function, and re-arrange the iteration so
that EOF triggers the same suite as line[0] in ">;"

Michael

Jul 18 '05 #10
Consider keeping the alias for append because it occurs in the
innermost loop. For maximum readability, write: addline = seq.append

Move the ''.join() to the show() function. That eliminates a little
redundancy.

The test dataset doesn't use the semi-colon comment field. So,
consider reversing ';>' to '>;'.
Raymond Hettinger

Jul 18 '05 #11
F. Petitjean wrote:
Le Tue, 15 Mar 2005 23:21:02 -0800, Michael Spencer a écrit :


def output(seq, linelength = 60):
if seq:
iterseq = iter(seq)
while iterseq:
print "".join(islice(iterseq,linelength))


I suppose you mean :
print "".join( str(item) for item in islice(iterseq, linelength) )
# using python 2.4 genexp

No, as written, given the seq is a sequence of single character strings already
(changing the signature might clarify that):

def output(charseq, linelength = 60):
if charseq:
iterseq = iter(charseq)
while iterseq:
print "".join(islice(iterseq,linelength))
output("*" * 120, 40) ****************************************
****************************************
**************************************** output(['*'] * 120, 40) ****************************************
****************************************
****************************************


Cheers
Michael

Jul 18 '05 #12
Michael Spencer wrote:
def output(seq, linelength = 60):
if seq:
iterseq = iter(seq)
while iterseq:
print "".join(islice(iterseq,linelength))


Worth noting: "while iterseq" only works because for this case, you have
a list iterator, which provides a __len__ method. This approach will
not work when you have an iterator that does not provide a __len__
method. For a nice infinite printing loop, try:

def gen():
yield '+'*100
yield '*'*100

output(gen())

STeVe
Jul 18 '05 #13
Steven Bethard wrote:
Michael Spencer wrote:
def output(seq, linelength = 60):
if seq:
iterseq = iter(seq)
while iterseq:
print "".join(islice(iterseq,linelength))

Worth noting: "while iterseq" only works because for this case, you have
a list iterator, which provides a __len__ method.


Thanks! I had noted that the file iterator didn't behave like this, but I hadn't
deduced the reason. Unfortunately, the above construct, while cute, is also
not terribly speedy.
print "\n".join(body[-index:-index-linelength:-1]

... for index in xrange(1, len(body), linelength))

is ugly but much faster with an already-existing string

So, my second attempt is:

from itertools import groupby

def revcomp2(input = sys.stdin, linelength = 60):
basetable = string.maketrans('ACBDGHKMNSRUTWVYacbdghkmnsrutwvy ',
'TGVHCDMKNSYAAWBRTGVHCDMKNSYAAWBR')

def record(item):
return item[0] in ">;"

for header, body in groupby(input, record):
body = "".join(body)
if header:
print body,
else:
body = body.translate(basetable, "\n\r")
print "\n".join(body[-index:-index-linelength:-1]
for index in xrange(1, len(body), linelength))
Michael

Jul 18 '05 #14
We've made it somewhat easier to contribute programs.

No need to subscribe to the mailing-list.
No need for a user-id or login.

See the FAQ "How can I contribute a program?"
http://shootout.alioth.debian.org/faq.php

Jul 18 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Isaac Gouy | last post by:
"The Great Computer Language Shootout" has been revived on Debian: http://shootout.alioth.debian.org/index.php Only 15 of the 25 programs have PHP implementations...
0
by: Isaac Gouy | last post by:
"The Great Computer Language Shootout" was mentioned a few months ago - note that the Shootout is active and welcoming contributions at: http://shootout.alioth.debian.org/
26
by: HackingYodel | last post by:
Hello all! I'm learning to program at home. I can't imagine a better language than Python for this. The ideal situation, for me, would be to study two languages at the same time. Probably...
2
by: Nicolas Pernetty | last post by:
Hello, I'm looking (without success so far) for code sources for common problems in different languages, but especially Python, C, Fortran, Ada and Matlab. It would help people coming from...
2
by: Greg Buchholz | last post by:
I was browsing the C++ programs on "The Computer Language Shootout" http://shootout.alioth.debian.org/ and for the heck of it, I thought I'd try my hand at seeing if I couldn't shorten the entry...
14
by: bearophileHUGS | last post by:
The The Computer Language Shootout has just published results for Python 2.5 and Psyco 1.5.2. Comparing the old (Python 2.4) Gentoo Pentium 4 results (now not visible anymore) with the new results,...
80
by: tech | last post by:
Hi, i have the following problem In file1.h namespace A { class Bar { void foo();
13
by: brad | last post by:
Still learning C++. I'm writing some regex using boost. It works great. Only thing is... this code seems slow to me compared to equivelent Perl and Python. I'm sure I'm doing something incorrect....
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.