Hello,
I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:
"ATT/GATA/G"
gets split to
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
I have written a very ugly function to do this (listed below for the curious),
but intuitively I think this should only take a couple of lines for one
skilled in regex and/or listcomp. Any takers?
James
p.s. Here is the ugly function I wrote:
def build_consensus(astr):
consensus = [] # the lol that will be returned
possibilities = [] # one element of consensus
consecutives = 0 # keeps track of how many in a row
for achar in astr:
if (achar == "/"):
consecutives = 0
continue
else:
consecutives += 1
if (consecutives > 1):
consensus.append(possibilities)
possibilities = [achar]
else:
possibilities.append(achar)
if possibilities:
consensus.append(possibilities)
return consensus
--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095 http://www.jamesstroud.com/ 13 1501
Using a parser may sound like overkill, but why not when it's this
easy? Get the latest pyparsing at http://pyparsing.sourceforge.net.
-- Paul
from pyparsing import oneOf, Group, OneOrMore, Literal
testdata = "ATT/GATA/G"
marker = oneOf( "A T G C")
SLASH = Literal("/").suppress()
genDegenList = OneOrMore( Group( marker + SLASH + marker ) | Group(
marker ) )
print genDegenList.parseString( testdata )
(prints:
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote: Hello,
I have strings represented as a combination of an alphabet (AGCT) and a an operator "/", that signifies degeneracy. I want to split these strings into lists of lists, where the degeneracies are members of the same list and non-degenerates are members of single item lists. An example will clarify this:
"ATT/GATA/G"
gets split to
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
How about this?
import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
On Sun, 27 Mar 2005 14:39:06 -0800, James Stroud
<js*****@mbi.ucla.edu> wrote: Hello,
I have strings represented as a combination of an alphabet (AGCT) and a an operator "/", that signifies degeneracy. I want to split these strings into lists of lists, where the degeneracies are members of the same list and non-degenerates are members of single item lists. An example will clarify this:
"ATT/GATA/G"
gets split to
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
Here's two ways without using regular expression. Both about the
same.
s = list("ATT/GATA/G")
result = []
while len(s)>0:
a = [s.pop(0)]
if s[0] == '/':
b = s.pop(0)
a.append(s.pop(0))
result.append(a)
print result
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
s = "ATT/GATA/G"
result = []
while len(s)>0:
if s[1:2] == '/':
result.append([s[0],s[2]])
s = s[3:]
else:
result.append([s[0]])
s = s[1:]
print result
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
For the fans of funtional programming: s='ATT/GATA/G' [y.split('/') for y in (' '.join([x for x in s]).replace(' / ',
'/')).split()]
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
This is shorter:
map(list,' '.join(s).replace(' / ','').split())
but for very long genomes Michael Spencer's nice version can be faster.
Hugs,
Bearophile
On 28 Mar 2005 04:12:15 -0800, be************@lycos.com
<be************@lycos.com> wrote: This is shorter: map(list,' '.join(s).replace(' / ','').split())
but for very long genomes Michael Spencer's nice version can be faster.
for very long genomes he might want a generator:
def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c] for g in xgen('ATT/GATA/G'): print g
....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']
Peace
Bill Mill
bill.mill at gmail.com
Bill Mill wrote: for very long genomes he might want a generator:
def xgen(s): l = len(s) - 1 e = enumerate(s) for i,c in e: if i < l and s[i+1] == '/': e.next() i2, c2 = e.next() yield [c, c2] else: yield [c]
for g in xgen('ATT/GATA/G'): print g ... ['A'] ['T'] ['T', 'G'] ['A'] ['T'] ['A', 'G']
Peace Bill Mill bill.mill at gmail.com
works according to the original spec, but there are a couple of issues:
1. the output is specified to be a list, so delaying the creation of the list
isn't a win
2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post : list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]
Michael
On Mon, 28 Mar 2005 09:18:38 -0800, Michael Spencer
<ma**@telcopartners.com> wrote: Bill Mill wrote:
for very long genomes he might want a generator:
def xgen(s): l = len(s) - 1 e = enumerate(s) for i,c in e: if i < l and s[i+1] == '/': e.next() i2, c2 = e.next() yield [c, c2] else: yield [c]
>for g in xgen('ATT/GATA/G'): print g ... ['A'] ['T'] ['T', 'G'] ['A'] ['T'] ['A', 'G']
Peace Bill Mill bill.mill at gmail.com
works according to the original spec, but there are a couple of issues:
1. the output is specified to be a list, so delaying the creation of the list isn't a win
True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes. 2. this version fails down in the presence of "double degeneracies" (if that's what they should be called) - which were not in the OP spec, but which cropped up in a later post : >>> list(xgen("AGC/C/TGA/T")) [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]
This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:
def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack gn
'ATT/GATA/G/AT' for g in xgen(gn): print g
....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']
Peace
Bill Mill
bill.mill at gmail.com
Bill Mill wrote: [long genomes might justify a generator approach]
That's a good point. I should have said: *If* you are going to put the items
into a list anyway, then there is no point generating the list items individually.
Michael Spencer wrote: [Bill's solution didn't work for multiple-degeneracies] This is simple enough to fix, in basically the same way your function works. I think it actually makes the function simpler:
def xgen(s): e = enumerate(s) stack = [e.next()[1]] #push the first char into the stack for i,c in e: if c != '/': yield stack stack = [c] else: stack.append(e.next()[1]) yield stack
That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item
Cheers
Michael
Michael Spencer wrote: def xgen(s): srciter = iter(s) item = [srciter.next()] for i in srciter: if i == '/': item.append(srciter.next()) else: yield item item = [i] yield item
Note that the generator-based solution doesn't generate an error on some
invalid data (e.g. where there is a final '/'), where the previous
list-based solution did:
py> group("AGC/C/TGA/T")
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> group("AGC/C/TGA/T/")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 6, in group
StopIteration
py> list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> list(xgen("AGC/C/TGA/T/"))
[['A'], ['G'], ['C', 'C', 'T'], ['G']]
Not sure which is the desired behavior, but I figured the OP should be
aware of this in case it's possible to have strings in an invalid
format. If this needs to be fixed, you can just wrap the srciter.next()
call in an appropriate try/except.
STeVe
<snip everything> That is clearer. At this point, though, you don't need the enumerator any more (so you can avoid indexing each item):
Good point. def xgen(s): srciter = iter(s) item = [srciter.next()] for i in srciter: if i == '/': item.append(srciter.next()) else: yield item item = [i] yield item
For some reason, keeping the != first feels a lot more logical to me,
but I think that's just a reflection of my particular mental model of
the problem. Also, item is a much clearer name than stack; I chose
stack just to point out how similar the solution to objection 2 was to
yours.
Peace
Bill Mill
bill.mill at gmail.com
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote: Hello,
I have strings represented as a combination of an alphabet (AGCT) and a an operator "/", that signifies degeneracy. I want to split these strings into lists of lists, where the degeneracies are members of the same list and non-degenerates are members of single item lists. An example will clarify this:
"ATT/GATA/G"
gets split to
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
How about this?
import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote: Hello,
I have strings represented as a combination of an alphabet (AGCT) and a an operator "/", that signifies degeneracy. I want to split these strings into lists of lists, where the degeneracies are members of the same list and non-degenerates are members of single item lists. An example will clarify this:
"ATT/GATA/G"
gets split to
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
How about this?
import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: brendan |
last post by:
here's a brain teaser for someone with more skills or at least more
lateral thinking capability than me - done my nut over this one...
have written a list manager in PHP which
a) posts out new...
|
by: Chung Leong |
last post by:
Here's a little brain teaser distilled from a bug that took me a rather
long time to figure out. The two functions in the example below behave
differently. The difference is easy to spot, of...
|
by: datagal |
last post by:
I have a requirement (motivated by a SOX thing) that is just giving me
fits. I know it should be easy and I'm probably overthinking it, but I
just can seem to find the best way to get where I need...
|
by: RedSon |
last post by:
Given a directed graph of k-nodes such that the last node creates a cycle with any other node determine which node the last node's edge points to using the minimum amount of resources and without...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
| |