By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,768 Members | 2,004 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,768 IT Pros & Developers. It's quick & easy.

String Splitter Brain Teaser

P: n/a
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

I have written a very ugly function to do this (listed below for the curious),
but intuitively I think this should only take a couple of lines for one
skilled in regex and/or listcomp. Any takers?

James

p.s. Here is the ugly function I wrote:

def build_consensus(astr):

consensus = [] # the lol that will be returned
possibilities = [] # one element of consensus
consecutives = 0 # keeps track of how many in a row

for achar in astr:
if (achar == "/"):
consecutives = 0
continue
else:
consecutives += 1
if (consecutives > 1):
consensus.append(possibilities)
possibilities = [achar]
else:
possibilities.append(achar)
if possibilities:
consensus.append(possibilities)
return consensus

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Jul 18 '05 #1
Share this Question
Share on Google+
13 Replies


P: n/a
Using a parser may sound like overkill, but why not when it's this
easy? Get the latest pyparsing at http://pyparsing.sourceforge.net.

-- Paul

from pyparsing import oneOf, Group, OneOrMore, Literal

testdata = "ATT/GATA/G"

marker = oneOf( "A T G C")
SLASH = Literal("/").suppress()
genDegenList = OneOrMore( Group( marker + SLASH + marker ) | Group(
marker ) )

print genDegenList.parseString( testdata )

(prints:
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #2

P: n/a
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #3

P: n/a
On Sun, 27 Mar 2005 14:39:06 -0800, James Stroud
<js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Here's two ways without using regular expression. Both about the
same.

s = list("ATT/GATA/G")
result = []
while len(s)>0:
a = [s.pop(0)]
if s[0] == '/':
b = s.pop(0)
a.append(s.pop(0))
result.append(a)
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
s = "ATT/GATA/G"
result = []
while len(s)>0:
if s[1:2] == '/':
result.append([s[0],s[2]])
s = s[3:]
else:
result.append([s[0]])
s = s[1:]
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #4

P: n/a
For the fans of funtional programming:
s='ATT/GATA/G'
[y.split('/') for y in (' '.join([x for x in s]).replace(' / ',

'/')).split()]
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #5

P: n/a
This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

Hugs,
Bearophile

Jul 18 '05 #6

P: n/a
On 28 Mar 2005 04:12:15 -0800, be************@lycos.com
<be************@lycos.com> wrote:
This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.


for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]
for g in xgen('ATT/GATA/G'): print g

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #7

P: n/a
Bill Mill wrote:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

for g in xgen('ATT/GATA/G'): print g
...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com


works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :
list(xgen("AGC/C/TGA/T"))

[['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

Michael

Jul 18 '05 #8

P: n/a
On Mon, 28 Mar 2005 09:18:38 -0800, Michael Spencer
<ma**@telcopartners.com> wrote:
Bill Mill wrote:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

>for g in xgen('ATT/GATA/G'): print g
...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com


works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win


True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes.

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :
>>> list(xgen("AGC/C/TGA/T")) [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]


This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

gn 'ATT/GATA/G/AT' for g in xgen(gn): print g

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #9

P: n/a
Bill Mill wrote:
[long genomes might justify a generator approach]

That's a good point. I should have said: *If* you are going to put the items
into a list anyway, then there is no point generating the list items individually.

Michael Spencer wrote:
[Bill's solution didn't work for multiple-degeneracies]

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item

Cheers
Michael

Jul 18 '05 #10

P: n/a
Michael Spencer wrote:
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item


Note that the generator-based solution doesn't generate an error on some
invalid data (e.g. where there is a final '/'), where the previous
list-based solution did:

py> group("AGC/C/TGA/T")
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> group("AGC/C/TGA/T/")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 6, in group
StopIteration
py> list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> list(xgen("AGC/C/TGA/T/"))
[['A'], ['G'], ['C', 'C', 'T'], ['G']]

Not sure which is the desired behavior, but I figured the OP should be
aware of this in case it's possible to have strings in an invalid
format. If this needs to be fixed, you can just wrap the srciter.next()
call in an appropriate try/except.

STeVe
Jul 18 '05 #11

P: n/a
<snip everything>
That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):
Good point.

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item


For some reason, keeping the != first feels a lot more logical to me,
but I think that's just a reflection of my particular mental model of
the problem. Also, item is a much clearer name than stack; I chose
stack just to point out how similar the solution to objection 2 was to
yours.

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #12

P: n/a
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #13

P: n/a
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #14

This discussion thread is closed

Replies have been disabled for this discussion.