By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,587 Members | 1,702 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,587 IT Pros & Developers. It's quick & easy.

grouping subsequences with BIO tags

P: n/a
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe
Jul 19 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Steven Bethard wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

I'd have done it the same way as you, but here's 'another' way:
def grp(lst): ... stack = []
... for label in lst:
... prefix = label[0]
... if prefix == 'B':
... group = [label]
... stack.append(group)
... elif prefix == 'I':
... if group[0][2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (group[0], label))
... group.append(label)
... elif prefix == 'O':
... group = [label]
... return stack
... grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'O', 'I_X', 'B_X'])

Traceback (most recent call last):
File "<input>", line 1, in ?
File "\\CC1040907-A\MichaelDocuments\PyDev\Junk\BIO.py", line 32, in grp
raise ValueError('%s followed by %s' %
ValueError: O followed by I_X

Michael

Jul 19 '05 #2

P: n/a
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
def get_runs(seq): ... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... list(get_runs(L)) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

But note that I allowed multiple I_X, did you want to do that? list(get_runs('B_X I_X I_X'.split())) [['B_X', 'I_X', 'I_X']]

Did you want all these "errors" caught? list(get_runs('B_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 15, in get_runs
ValueError: Unrecognized element: '?_X' list(get_runs('I_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X' list(get_runs('B_X I_Y ?_X'.split()))

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

This is the best I've come up with so far:

py> class K(object):
... def __init__(self):
... self.last_result = False
... self.last_label = 'O'
... def __call__(self, label):
... if label[:2] in ('O', 'B_'):
... self.last_result = not self.last_result Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)
... elif self.last_label[2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (self.last_label, label))
... self.last_label = label
... return self.last_result
...
py> def get_runs(lst):
... for _, item in itertools.groupby(lst, K()):
... result = list(item)
... if result != ['O']:
... yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

Well, there's no end to "another way", but above
is my .02USD ;-)

Regards,
Bengt Richter
Jul 19 '05 #3

P: n/a
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
>>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>> def get_runs(seq): ... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... >>> list(get_runs(L)) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]


Yeah, I started this route, and got confused by it. Of course it makes
perfect sense when someone writes you a working version. ;) Thanks!
But note that I allowed multiple I_X, did you want to do that? >>> list(get_runs('B_X I_X I_X'.split())) [['B_X', 'I_X', 'I_X']]
Yeah, that's right. Multiple 'I_...'s should be grouped together.
Did you want all these "errors" caught? >>> list(get_runs('B_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 15, in get_runs
ValueError: Unrecognized element: '?_X' >>> list(get_runs('I_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X' >>> list(get_runs('B_X I_Y ?_X'.split()))

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)


Yeah, those are the right errors. I'll have to think about whether I
should be trying to catch the [^BI]_ error. It doesn't appear in my
data now, but that doesn't mean it might not in the future. Thanks!

STeVe
Jul 19 '05 #4

P: n/a
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,


I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
timethem(L)

get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call
Michael

Jul 19 '05 #5

P: n/a
Michael Spencer wrote:
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,


I'm curious why you (Bengt or Steve) think the generator is an advantage
here. As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit
faster.


Aren't they basically just the same solution, with your stack.append
replaced by a yield (and with a little additional error checking)? As
far as I'm concerned, either solution is great and writes the code that
I couldn't. ;)

If you're still interested, in the real problem, the data doesn't exist
as a list of strings; it exists as a list of objects for which there is
a Python wrapper to a C API that retrieves the string. I don't know
exactly what happens in the wrapping, but it's possible that I can
conserve some memory by using the generator function. But I'd have to
profile it to know for sure.

STeVe
Jul 19 '05 #6

P: n/a
On Fri, 22 Apr 2005 16:01:42 -0700, Michael Spencer <ma**@telcopartners.com> wrote:
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.
I hadn't seen your post[1], which I think is a nice crisp and clever solution ;-)

I just wrote what I thought was a straightforward solution, anticipating that
the imput list might be some huge bioinfo thing, and you might want to iterate
through the sublists one at a time and not want to build the whole list of
lists as represented by your stack.

[1] I don't know why stuff arrives almost instantly sometimes, and sometimes quite
delayed and out of order, but it is a bit embarrassing to post a me-too without
relevant comment, or being able to decide whether to play open source leapfrog.
In this case, I don't see a lily pad on the other side of your code, other than
the memory aspect ;-)

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
>>> timethem(L)

get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call
Michael


Regards,
Bengt Richter
Jul 19 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.