searching substrings with interpositions

borges2003xx

hi everyone.
a problem:
two binary strings, a="0101" b="000011110100";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>00 but also
0<0>001111<101>00"
but also <0>0001111<101>00 but also 000<0>1111<01>0<0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

Jul 19 '05 #1

Subscribe Post Reply

1345

bplumhoff

Hello Giorgi,

I suggest to google for "python boyer moore" to get a fast
implementation of a string search algorithm in Python (the Boyer-Moore
algorithm).

One promising hit seems to be:
http://www.egenix.com/files/python/mxTextTools.html

HTH,
Bernd

Jul 19 '05 #2

Kent Johnson

bo**********@yahoo.it wrote:

hi everyone.
a problem:
two binary strings, a="0101" b="000011110100";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>00 but also
0<0>001111<101>00"
but also <0>0001111<101>00 but also 000<0>1111<01>0<0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

You can do this easily with regular expressions though I guess may be poor with long strings:

import re
re.search('0.*1.*0.*1', '000011110100') <_sre.SRE_Match object at 0x008D9BF0> _.span()

(0, 10)
Put the chars of the search string in groups if you need to know where they were found.

Kent

Jul 19 '05 #3

Claudio Grondi

> i search a function f(a,b) that gives 1 if a is "contained" in b with

any sub strings interposed.
If I understand it right, it should be something
like this:

def blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB):
intNoOfCharsFound = 0
intPtrToBeginOfSubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
for chrA in strA:
blnFoundChrA = False
# print chrA
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(chrA == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
#:for
if blnStrAinB == False:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB)

strA = "010101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB)

Note: code above is intended to help clarify things only,
so that a bunch of examples can be tested.
If you need production quality code, maybe someone else
can provide it.

Is it what you are looking for?

By the way:
it looks to me like a standard problem while
comparing DNA sequences, so I suppose
that there are ready to use fast libraries
providing such kind of function.

Claudio

<bo**********@yahoo.it> schrieb im Newsbeitrag
news:11**********************@g43g2000cwa.googlegr oups.com... hi everyone.
a problem:
two binary strings, a="0101" b="000011110100";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>00 but also
0<0>001111<101>00"
but also <0>0001111<101>00 but also 000<0>1111<01>0<0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

Jul 19 '05 #4

borges2003xx

thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Jul 19 '05 #5

Andrew Dalke

bo**********@yahoo.it wrote:

the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.

Kent Johnson had the right approach, with regular expressions.
For a bit of optimization, use non-greedy groups. That will
give you shorter matches.

Suppose you want no more than 10 bases between terms. You could
use this pattern.

a.{,10}?t.{,10}?c.{,10}?g.{,10}?

import re
pat = re.compile('a.{,10}t.{,10}c.{,10}g.{,10}?')
m = pat.search("tcgaacccgtaaaaagctaatcg")
m.group(0), m.start(0), m.end(0) ('aacccgtaaaaagctaatcg', 3, 23) pat.search("tcgaacccgtaaaaagctaatttttttg") <_sre.SRE_Match object at 0x9b950> pat.search("tcgaacccgtaaaaagctaattttttttg")

If you want to know the location of each of the bases, and
you'll have less than 100 of them (I think that's the limit)
then you can use groups in the regular expression language
def make_pattern(s, limit = None): .... if limit is None:
.... t = ".*?"
.... else:
.... t = ".{,%d}?" % (limit,)
.... text = []
.... for c in s:
.... text.append("(%s)%s" % (c, t))
.... return "".join(text)
.... make_pattern("atcg") '(a).*?(t).*?(c).*?(g).*?' make_pattern("atcg", 10) '(a).{,10}?(t).{,10}?(c).{,10}?(g).{,10}?' pat = re.compile(make_pattern("atcg", 10))
m = pat.search("tcgaacccgtaaaaagctaatttttttg")
m <_sre.SRE_Match object at 0x8ea70> m.groups() ('a', 't', 'c', 'g') for i in range(1, len("atcg")+1): .... print m.group(i), m.start(i), m.end(i)
....
a 3 4
t 9 10
c 16 17
g 27 28

Andrew
da***@dalkescientific.com

Jul 19 '05 #6

Claudio Grondi

<bo**********@yahoo.it> schrieb im Newsbeitrag
news:11*********************@z14g2000cwz.googlegro ups.com...

thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.

I have introduced two additional parameter to the function.
If intMaxLenOfGap == 0 the gap size doesn't matter.
lstStartEndOfRangeOfBwithOccurenceOfA returns in its
0,1 elements the begin and end of the range strA was
found in strB.

Hope this does what you mean with
"make limits of lenght of interposed sequences",
does it?

Claudio
P.S. Here the code:

def blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB,
intMaxLenOfGap = 0, lstStartEndOfRangeOfBwithOccurenceOfA = []):

lstStartEndOfRangeOfBwithOccurenceOfA = []
intNoOfCharsFound = 0
intPtrToFirstCharFound = 0
intPtrToBeginOfSubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
indxToA = 0

while(indxToA < intLenA):
# print chrA
if(indxToA == 0):
blnFoundChrA = False
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToFirstCharFound = indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
else:
intGapLen = 0
blnFoundChrA = False
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
intGapLen += 1
if(intMaxLenOfGap > 0 and intGapLen > intMaxLenOfGap):
indxToA = 0
blnFoundChrA = False
intPtrToBeginOfSubsectionOfB = intPtrToFirstCharFound + 1
intNoOfCharsFound = 0
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s' at range(%i, %i)"%(strA, strB,
intPtrToFirstCharFound, indxToB+1)
lstStartEndOfRangeOfB.append(intPtrToFirstCharFoun d)
lstStartEndOfRangeOfB.append(indxToB+1)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
#:if/else
#:while
if blnStrAinB == False:
if(intMaxLenOfGap > 0 and intGapLen > intMaxLenOfGap):
print "sequence '%s' not in '%s' (maybe allowed gap of %i chars was
too small?)"%(strA, strB, intMaxLenOfGap)
else:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

print

lstStartEndOfRangeOfB = []
strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB)

lstStartEndOfRangeOfB = []
strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB, 2)

strA = "010101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB, 6,
lstStartEndOfRangeOfB)

strA = "010101"
strB = "00001111010000001"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB, 4,
lstStartEndOfRangeOfB)

strA = "010101"
strB = "00001111010000001"
blnFindCharSequenceAevenIfSpreadOverEntireStringB( strA, strB, 5,
lstStartEndOfRangeOfB)
print
print "usage of lstStartEndOfRangeOfB parameter passed to function for use
as return value:"
print "sequence '%s' was found in '%s' at range(%i, %i)"%(strA, strB,
lstStartEndOfRangeOfB[0], lstStartEndOfRangeOfB[1])

Jul 19 '05 #7

Andrew Dalke

Claudio Grondi wrote:

Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.

Still not tested enough to ensure that it's bug free, but more
concise. Here's one the implements the algorithm directly and
another that uses a regexp. The latter should be the preferred
approach. My intent was that the algorithm implements the given
pattern so they should given identical results.

# Doing the work ourselves
def find_spread_substring(query, target, limit = None):
stack = []
ti = qi = 0
Nq = len(query)
Nt = len(target)
delta = 0

while ti < Nt:
# We have a match
if query[qi] == target[ti]:
stack.append( (qi, ti, delta) )
qi = qi + 1
if qi == Nq:
return [ti for (qi, ti, delta) in stack]
ti = ti + 1
delta = 0
else:
# No match
while 1:
# If we have a partial match, check if we've
# gone over the limit.
if stack:
delta = delta + 1
if limit is not None and delta > limit:
# backtrack, treating it as an invalid match
# (so retry this 'else:' block)
qi, ti, delta = stack.pop()
continue
# No backtracking needed
break
# Advance to check the next character in the target
ti = ti + 1

# Failure
return None

# Using regular expressions
import re
def find_spread_substring2(query, target, limit = None):
if limit is None:
template = "(%s).*?"
else:
template = "(%%s).{,%d}?" % (limit,)
terms = [template % c for c in query]
pattern = "".join(terms)

pat = re.compile(pattern)
m = pat.search(target)
if not m:
return None
return [m.start(i) for i in range(1, len(query)+1)]
def test():
for (q, t, limit, is_valid) in (
("1010", "10001001", None, True),
("1010", "100011", None, False),
("1010", "100010", 3, True),
("1010", "100010", 1, True),
("1010", "1000010", 1, False),
("1010", "01000010", 2, True),
("1010", "01000010", 1, False),
("1010", "0100000", None, False),

):
result = find_spread_substring(q, t, limit)
result2 = find_spread_substring2(q, t, limit)
if result != result2:
raise AssertionError( (result, result2) )

if result is not None:
if limit is not None:
# check that it's a proper subset
for (x, y) in zip(result[:-1], result[1:]):
# +1 because 'limit' is the maximum gap size
if (y-x) > limit+1:
raise AssertionError((q, t, limit, result, x, y))
s = "".join([t[i] for i in result])
if s != q:
raise AssertionError((q, t, limit, result, s))

if result is None and not is_valid:
pass
elif result is not None and is_valid:
pass
else:
raise AssertionError( (q, t, limit, is_valid, result) )

if __name__ == "__main__":
test()
print "All tests passed."
Andrew
da***@dalkescientific.com

Jul 19 '05 #8

by: Leandro Pardini | last post by:

Hello there, I'm trying to process a binary file and I really don't know how. The story: gPhoto2 downloads the images from my camera just fine, but small areas of 10x3 pixels are screwed up. I...

Perl

How to replace many substrings?

by: spam | last post by:

Is there a well-known algorithm for replacing many substrings in a string? For example, I'd like to take the string "abc def ghi jkl mno pqr" and replace, say, every instance of "abc", "ghi", and...

C / C++

Replacing large number of substrings

by: Will McGugan | last post by:

Hi, Is there a simple way of replacing a large number of substrings in a string? I was hoping that str.replace could take a dictionary and use it to replace the occurrences of the keys with the...

Python

searching data for a large set of substrings

by: C3 | last post by:

I have to process some data in C that is given to me as a char * array. I have a fairly large number of substrings (well, they're not actually printable, but let's treat them as strings) that I...

C / C++

trying to find repeated substrings with regular expression

by: Robert Dodier | last post by:

Hello all, I'm trying to find substrings that look like 'FOO blah blah blah' in a string. For example give 'blah FOO blah1a blah1b FOO blah2 FOO blah3a blah3b blah3b' I want to get three...

Python

How to search for substrings of a string in a list?

by: Girish Sahani | last post by:

Given a length k string,i want to search for 2 substrings (overlap possible) in a list consisting of length k-1 strings. These 2 substrings when 'united' give the original string. e.g given...

Python

Generating all ordered substrings of a string

by: girish | last post by:

Hi, I want to generate all non-empty substrings of a string of length >=2. Also, each substring is to be paired with 'string - substring' part and vice versa. Thus, gives me , , , , , ] etc....

Python

recording positions of multiple substrings that match a regexp

by: Jason S | last post by:

Is there a way to get the position of multiple substrings that match a regexp without using closures? match() returns the substrings themselves, not the positions, and search() seems to only return...

Javascript

c/c++ text searching engine library

by: Costa | last post by:

I am looking for a c/c++ text search engine library that supports: - free text searching - not only beginning of words but substrings as well - wildcard searching - I want strings such as...

C / C++

extracting substrings based on 'fuzzy' match

by: Pilcrow | last post by:

This problem was raised in comp.lang.perl.misc, and the poster was concerned, among other things, by the speed of execution. Since C is faster than perl, I wonder how a C coder would solve it? ...

C / C++

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

searching substrings with interpositions

Similar topics