Hey everyone,
For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else. I hope I am making sense. Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon. Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'. This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes. This would
hypothetically yield different genes, since the frame would be
shifted.
This might be a lot of information... I appreciate any insight. Thank
you!
Blaine 7 4379
On Apr 27, 8:31*pm, blaine <frik...@gmail.comwrote:
Hey everyone,
* For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences. *I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. *This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. *So
there are actually three different 'frames' I could be using. *For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. *How can I represent this in a regular
expression? :) *This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else. *I hope I am making sense. *Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon. *Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'. *This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes. *This would
hypothetically yield different genes, since the frame would be
shifted.
This might be a lot of information... I appreciate any insight. *Thank
you!
Blaine
Here's one idea (untested):
s= { }
for x in range( len( genes )- 3 ):
s[ x ]= genes[ x: x+ 3 ]
You might like Python's 'string slicing' feature.
On Apr 27, 10:24 pm, castiro...@gmail.com wrote:
On Apr 27, 8:31 pm, blaine <frik...@gmail.comwrote:
Hey everyone,
For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else. I hope I am making sense. Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon. Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'. This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes. This would
hypothetically yield different genes, since the frame would be
shifted.
This might be a lot of information... I appreciate any insight. Thank
you!
Blaine
Here's one idea (untested):
s= { }
for x in range( len( genes )- 3 ):
s[ x ]= genes[ x: x+ 3 ]
You might like Python's 'string slicing' feature.
True - I could try something like that. In fact I have a 'codon'
function that does exactly that. The problem is that I then have to
go back through and loop over the list. I'm trying to use Regular
Expressions so that my processing is quicker. Complexity is key since
this genomic string is pretty large.
Thanks for the suggestion though!
In article
<e6**********************************@8g2000hse.go oglegroups.com>,
blaine <fr*****@gmail.comwrote:
Hey everyone,
For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences.
I strongly suggest you stop trying to reinvent the wheel and read up on the
Biopython project ( http://biopython.org/wiki/Main_Page).
blaine <fr*****@gmail.comwrote:
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else.
I'm not sure what the \s are doing in there - there doesn't appear to
be any whitespace in your examples.
I hope I am making sense. Obviously, however, this will make sure
that ANY set of three characters exist before a start codon. Is
there a way to match exactly, to say something like 'Find all sets
of three, then AUG and AGG, etc.'.
I think you want
^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)
which will match up 0 or more triples, match AUG match 0 or more triples
then AGG. The ? makes it a minimum match otherwise you'll match more
than you expect if there are two AUG...AGG sequences in a given genome.
>>import re m=re.compile(r"^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)" ) m.search("BBBBBBAUGWWWWWWAGGBBBBBB").groups()
('BBB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
>>m.search("BBBQBBBAUGWWWWWWAGGBBBBBB") m.search("BBBQQBBBAUGWWWWWWAGGBBBBBB") m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB")
<_sre.SRE_Match object at 0xb7de33e0>
>>m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB").groups ()
('BQB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
>>m.search("BBBQQBBQBAUGWQWWWWWAGGBBBBBB") m.search("BBBQQBBQBAUGWWWWQWWAGGBBBBBB") m.search("BBBQQBBQBAUGWWQWWQWWAGGBBBBBB") m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB")
<_sre.SRE_Match object at 0xb7de33e0>
>>m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB").group s()
('BQB', 'AUG', 'WWQWAWQWW', 'QWW', 'AGG')
>>>
This way, I could scan for genes, remove the first letter, scan for
more genes, remove the first letter again, and scan for more genes.
This would hypothetically yield different genes, since the frame
would be shifted.
Of you could just unconstrain the first match and it will do them all
at once :-
(AUG)((\w\w\w)*?)(AGG)
You could run this with re.findall, but beware that this will only
return non-overlapping matches which may not be what you want.
I'm not sure re's are the best tool for the job, but they should give
you a quick idea of what the answers might be.
--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Regular expressions for that sort of thing can get *really* big. The
most efficient way would be to programmatically compose the regular
expression to be as exact as possible.
import re
def permutation(lst):
""""
From http://labix.org/snippets/permutations/. Computes permutations
of a
list iteratively.
"""
queue = [-1]
lenlst = len(lst)
while queue:
i = queue[-1]+1
if i == lenlst:
queue.pop()
elif i not in queue:
queue[-1] = i
if len(queue) == lenlst:
yield [lst[j] for j in queue]
queue.append(-1)
else:
queue[-1] = i
def segment_re(a, b):
"""
Creates grouped regular expression pattern to match text between all
possibilies of three-letter sets a and b.
"""
def pattern(n):
return "(%s)" % '|'.join( [''.join(grp) for grp in permutation(n)] )
return re.compile( r'%s(\w+?)%s' % (pattern(a), pattern(b)) )
print segment_re(["a", "b", "c"], ["d", "e", "f"])
You could extend segment_re to accept an integer to limit the (\w+?)
to a definite quantifier. This will grow the compiled expression in
memory but make matching faster (such as \w{3,n} to match from 3 to n
characters).
See http://artfulcode.net/articles/optim...r-expressions/ for
specifics on optimizing regexes.
blaine wrote:
Hey everyone,
For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else. I hope I am making sense. Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon. Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'. This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes. This would
hypothetically yield different genes, since the frame would be
shifted.
As an alternative - if you do need speed - have a look at http://www.egenix.com/products/pytho...e/mxTextTools/
Helmut.
--
Helmut Jarausch
Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
On Apr 28, 6:30 am, Nick Craig-Wood <n...@craig-wood.comwrote:
blaine <frik...@gmail.comwrote:
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else.
I'm not sure what the \s are doing in there - there doesn't appear to
be any whitespace in your examples.
I hope I am making sense. Obviously, however, this will make sure
that ANY set of three characters exist before a start codon. Is
there a way to match exactly, to say something like 'Find all sets
of three, then AUG and AGG, etc.'.
I think you want
^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)
which will match up 0 or more triples, match AUG match 0 or more triples
then AGG. The ? makes it a minimum match otherwise you'll match more
than you expect if there are two AUG...AGG sequences in a given genome.
>>import re
>>m=re.compile(r"^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)" )
>>m.search("BBBBBBAUGWWWWWWAGGBBBBBB").groups()
('BBB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
>>m.search("BBBQBBBAUGWWWWWWAGGBBBBBB")
>>m.search("BBBQQBBBAUGWWWWWWAGGBBBBBB")
>>m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB")
<_sre.SRE_Match object at 0xb7de33e0>
>>m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB").groups ()
('BQB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
>>m.search("BBBQQBBQBAUGWQWWWWWAGGBBBBBB")
>>m.search("BBBQQBBQBAUGWWWWQWWAGGBBBBBB")
>>m.search("BBBQQBBQBAUGWWQWWQWWAGGBBBBBB")
>>m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB")
<_sre.SRE_Match object at 0xb7de33e0>
>>m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB").group s()
('BQB', 'AUG', 'WWQWAWQWW', 'QWW', 'AGG')
>>>
This way, I could scan for genes, remove the first letter, scan for
more genes, remove the first letter again, and scan for more genes.
This would hypothetically yield different genes, since the frame
would be shifted.
Of you could just unconstrain the first match and it will do them all
at once :-
(AUG)((\w\w\w)*?)(AGG)
You could run this with re.findall, but beware that this will only
return non-overlapping matches which may not be what you want.
I'm not sure re's are the best tool for the job, but they should give
you a quick idea of what the answers might be.
--
Nick Craig-Wood <n...@craig-wood.com--http://www.craig-wood.com/nick
Thank you! Your suggestion was overly helpful.
Also thank you for the package suggestions. BioPython is on my plate
to check out, but I needed a kind of quick fix for this one. The
documentation for biopython seems pretty thick - I'm not a biologist
so I'm not even sure what kind of packages I'm even looking for.
thanks!
Blaine This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Kenneth McDonald |
last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate
feedback, suggestions, and criticism as I work towards finalizing the
API and feature sets. rex is a module intended to make...
|
by: Martin Robins |
last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being...
|
by: Tizzah |
last post by:
What is wrong with that?
regex =
/^(http|https):\/\/+({1}+)*\.{2,5}(({1,5})?\/.*)?$/
if(field.hpage.value != regex.test(field.hpage.value)){
alert("Bad Homepage")
field.hpage.focus()...
|
by: Zeba |
last post by:
Hi guys,
I need some help regarding regular expressions. Consider the following
statement :
System.Text.RegularExpressions.Match match =...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
| |