473,320 Members | 1,719 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

String Splitter Brain Teaser

Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

I have written a very ugly function to do this (listed below for the curious),
but intuitively I think this should only take a couple of lines for one
skilled in regex and/or listcomp. Any takers?

James

p.s. Here is the ugly function I wrote:

def build_consensus(astr):

consensus = [] # the lol that will be returned
possibilities = [] # one element of consensus
consecutives = 0 # keeps track of how many in a row

for achar in astr:
if (achar == "/"):
consecutives = 0
continue
else:
consecutives += 1
if (consecutives > 1):
consensus.append(possibilities)
possibilities = [achar]
else:
possibilities.append(achar)
if possibilities:
consensus.append(possibilities)
return consensus

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Jul 18 '05 #1
13 1501
Using a parser may sound like overkill, but why not when it's this
easy? Get the latest pyparsing at http://pyparsing.sourceforge.net.

-- Paul

from pyparsing import oneOf, Group, OneOrMore, Literal

testdata = "ATT/GATA/G"

marker = oneOf( "A T G C")
SLASH = Literal("/").suppress()
genDegenList = OneOrMore( Group( marker + SLASH + marker ) | Group(
marker ) )

print genDegenList.parseString( testdata )

(prints:
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #2
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #3
On Sun, 27 Mar 2005 14:39:06 -0800, James Stroud
<js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Here's two ways without using regular expression. Both about the
same.

s = list("ATT/GATA/G")
result = []
while len(s)>0:
a = [s.pop(0)]
if s[0] == '/':
b = s.pop(0)
a.append(s.pop(0))
result.append(a)
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
s = "ATT/GATA/G"
result = []
while len(s)>0:
if s[1:2] == '/':
result.append([s[0],s[2]])
s = s[3:]
else:
result.append([s[0]])
s = s[1:]
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #4
For the fans of funtional programming:
s='ATT/GATA/G'
[y.split('/') for y in (' '.join([x for x in s]).replace(' / ',

'/')).split()]
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Jul 18 '05 #5
This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

Hugs,
Bearophile

Jul 18 '05 #6
On 28 Mar 2005 04:12:15 -0800, be************@lycos.com
<be************@lycos.com> wrote:
This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.


for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]
for g in xgen('ATT/GATA/G'): print g

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #7
Bill Mill wrote:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

for g in xgen('ATT/GATA/G'): print g
...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com


works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :
list(xgen("AGC/C/TGA/T"))

[['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

Michael

Jul 18 '05 #8
On Mon, 28 Mar 2005 09:18:38 -0800, Michael Spencer
<ma**@telcopartners.com> wrote:
Bill Mill wrote:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

>for g in xgen('ATT/GATA/G'): print g
...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com


works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win


True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes.

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :
>>> list(xgen("AGC/C/TGA/T")) [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]


This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

gn 'ATT/GATA/G/AT' for g in xgen(gn): print g

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #9
Bill Mill wrote:
[long genomes might justify a generator approach]

That's a good point. I should have said: *If* you are going to put the items
into a list anyway, then there is no point generating the list items individually.

Michael Spencer wrote:
[Bill's solution didn't work for multiple-degeneracies]

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item

Cheers
Michael

Jul 18 '05 #10
Michael Spencer wrote:
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item


Note that the generator-based solution doesn't generate an error on some
invalid data (e.g. where there is a final '/'), where the previous
list-based solution did:

py> group("AGC/C/TGA/T")
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> group("AGC/C/TGA/T/")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 6, in group
StopIteration
py> list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> list(xgen("AGC/C/TGA/T/"))
[['A'], ['G'], ['C', 'C', 'T'], ['G']]

Not sure which is the desired behavior, but I figured the OP should be
aware of this in case it's possible to have strings in an invalid
format. If this needs to be fixed, you can just wrap the srciter.next()
call in an appropriate try/except.

STeVe
Jul 18 '05 #11
<snip everything>
That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):
Good point.

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item = [i]
yield item


For some reason, keeping the != first feels a lot more logical to me,
but I think that's just a reflection of my particular mental model of
the problem. Also, item is a much clearer name than stack; I chose
stack just to point out how similar the solution to objection 2 was to
yours.

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #12
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #13
In article <ma*************************************@python.or g>,
James Stroud <js*****@mbi.ucla.edu> wrote:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

--
Doug Schwarz
dmschwarz&urgrad,rochester,edu
Make obvious changes to get real email address.
Jul 18 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: brendan | last post by:
here's a brain teaser for someone with more skills or at least more lateral thinking capability than me - done my nut over this one... have written a list manager in PHP which a) posts out new...
15
by: Chung Leong | last post by:
Here's a little brain teaser distilled from a bug that took me a rather long time to figure out. The two functions in the example below behave differently. The difference is easy to spot, of...
3
by: datagal | last post by:
I have a requirement (motivated by a SOX thing) that is just giving me fits. I know it should be easy and I'm probably overthinking it, but I just can seem to find the best way to get where I need...
2
RedSon
by: RedSon | last post by:
Given a directed graph of k-nodes such that the last node creates a cycle with any other node determine which node the last node's edge points to using the minimum amount of resources and without...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.