473,320 Members | 1,857 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

grouping subsequences with BIO tags

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe
Jul 19 '05 #1
6 1053
Steven Bethard wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

I'd have done it the same way as you, but here's 'another' way:
def grp(lst): ... stack = []
... for label in lst:
... prefix = label[0]
... if prefix == 'B':
... group = [label]
... stack.append(group)
... elif prefix == 'I':
... if group[0][2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (group[0], label))
... group.append(label)
... elif prefix == 'O':
... group = [label]
... return stack
... grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'O', 'I_X', 'B_X'])

Traceback (most recent call last):
File "<input>", line 1, in ?
File "\\CC1040907-A\MichaelDocuments\PyDev\Junk\BIO.py", line 32, in grp
raise ValueError('%s followed by %s' %
ValueError: O followed by I_X

Michael

Jul 19 '05 #2
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
def get_runs(seq): ... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... list(get_runs(L)) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

But note that I allowed multiple I_X, did you want to do that? list(get_runs('B_X I_X I_X'.split())) [['B_X', 'I_X', 'I_X']]

Did you want all these "errors" caught? list(get_runs('B_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 15, in get_runs
ValueError: Unrecognized element: '?_X' list(get_runs('I_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X' list(get_runs('B_X I_Y ?_X'.split()))

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

This is the best I've come up with so far:

py> class K(object):
... def __init__(self):
... self.last_result = False
... self.last_label = 'O'
... def __call__(self, label):
... if label[:2] in ('O', 'B_'):
... self.last_result = not self.last_result Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)
... elif self.last_label[2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (self.last_label, label))
... self.last_label = label
... return self.last_result
...
py> def get_runs(lst):
... for _, item in itertools.groupby(lst, K()):
... result = list(item)
... if result != ['O']:
... yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

Well, there's no end to "another way", but above
is my .02USD ;-)

Regards,
Bengt Richter
Jul 19 '05 #3
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
>>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>> def get_runs(seq): ... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... >>> list(get_runs(L)) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]


Yeah, I started this route, and got confused by it. Of course it makes
perfect sense when someone writes you a working version. ;) Thanks!
But note that I allowed multiple I_X, did you want to do that? >>> list(get_runs('B_X I_X I_X'.split())) [['B_X', 'I_X', 'I_X']]
Yeah, that's right. Multiple 'I_...'s should be grouped together.
Did you want all these "errors" caught? >>> list(get_runs('B_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 15, in get_runs
ValueError: Unrecognized element: '?_X' >>> list(get_runs('I_X I_X ?_X'.split())) Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X' >>> list(get_runs('B_X I_Y ?_X'.split()))

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)


Yeah, those are the right errors. I'll have to think about whether I
should be trying to catch the [^BI]_ error. It doesn't appear in my
data now, but that doesn't mean it might not in the future. Thanks!

STeVe
Jul 19 '05 #4
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,


I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
timethem(L)

get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call
Michael

Jul 19 '05 #5
Michael Spencer wrote:
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,


I'm curious why you (Bengt or Steve) think the generator is an advantage
here. As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit
faster.


Aren't they basically just the same solution, with your stack.append
replaced by a yield (and with a little additional error checking)? As
far as I'm concerned, either solution is great and writes the code that
I couldn't. ;)

If you're still interested, in the real problem, the data doesn't exist
as a list of strings; it exists as a list of objects for which there is
a Python wrapper to a C API that retrieves the string. I don't know
exactly what happens in the wrapping, but it's possible that I can
conserve some memory by using the generator function. But I'd have to
profile it to know for sure.

STeVe
Jul 19 '05 #6
On Fri, 22 Apr 2005 16:01:42 -0700, Michael Spencer <ma**@telcopartners.com> wrote:
Steven Bethard wrote:
Bengt Richter wrote:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard
<st************@gmail.com> wrote:

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.
I hadn't seen your post[1], which I think is a nice crisp and clever solution ;-)

I just wrote what I thought was a straightforward solution, anticipating that
the imput list might be some huge bioinfo thing, and you might want to iterate
through the sublists one at a time and not want to build the whole list of
lists as represented by your stack.

[1] I don't know why stuff arrives almost instantly sometimes, and sometimes quite
delayed and out of order, but it is a bit embarrassing to post a me-too without
relevant comment, or being able to decide whether to play open source leapfrog.
In this case, I don't see a lily pad on the other side of your code, other than
the memory aspect ;-)

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
>>> timethem(L)

get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call
Michael


Regards,
Bengt Richter
Jul 19 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Gilles Kuhn | last post by:
Hello all, here is my problem: Let's assume I have a movie list containing movies from different countries. The xml looks like: <root> <movie> <title>matrix</title> <country>USA</country>...
6
by: Jon Bosker | last post by:
Help! This is probably easy but I just don't get it. I am trying to relate and merge 2 datasets. My XML file has 2 datasets (1st level nodes) TimeDetail and TimeSummary. The report is supposed to...
1
by: amber | last post by:
Hello, I have a report in VB.NET/Crystal Reports. I have a criteria form that users select between 2 different types of grouping (group by category or group by year). Can I programmatically...
6
by: Sugapablo | last post by:
Lets say I have several DIVs, each with an ID like X1, X3424, X234, etc. I'd like to render all elements that have an ID that starts with "X" invisible (for example), regardless of what followed...
1
by: masonmarc | last post by:
I have a document like: <all> <phone> <number>12345</number> <r comp="1"/> <r comp="2"/> <r comp="3"/> <r comp="1"> <r address="a"/>
2
by: Andreas Håkansson | last post by:
Seeing how my previous post seem to have fallen between the cracks, I thought I would have a second, more direct, go at it. So my question is "Is it possible to group (Muenchian method) over...
3
by: ahaque38 | last post by:
Hello. Using A2K SP3, I am having the following problem with a report using "Sorting and Grouping". I have recently added a grouping in the reports for "Category2<>'CONTRACTS'". I have...
8
by: Mike MacSween | last post by:
tblCourses one to many to tblEvents. A course may have an intro workshop (a type of event), a mid course workshop, a final exam. Or any combination. Or something different in the future. At...
2
savanm
by: savanm | last post by:
I have a doubt in the following code. $temp=~s/<chapter_contents_B>(.*?)<i>(.*?)<\/i><\/chapter_contents_B>/<item1 id="*"><p>$1<pageNum pageId="$2"><i>$2<\/i><\/pageNum><\/p><\/item1>/g; In the...
0
by: Roman Bertle | last post by:
Hello, I try to format monetary values using the locale module, python2.5: Python 2.5.2a0 (r251:54863, Jan 3 2008, 17:59:56) on linux2 Type "help", "copyright", "credits" or "license" for...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.