473,563 Members | 2,747 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

searching substrings with interpositions

hi everyone.
a problem:
two binary strings, a="0101" b="000011110100 ";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>0 0 but also
0<0>001111<101> 00"
but also <0>0001111<101> 00 but also 000<0>1111<01>0 <0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

Jul 19 '05 #1
7 1356
Hello Giorgi,

I suggest to google for "python boyer moore" to get a fast
implementation of a string search algorithm in Python (the Boyer-Moore
algorithm).

One promising hit seems to be:
http://www.egenix.com/files/python/mxTextTools.html

HTH,
Bernd

Jul 19 '05 #2
bo**********@ya hoo.it wrote:
hi everyone.
a problem:
two binary strings, a="0101" b="000011110100 ";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>0 0 but also
0<0>001111<101> 00"
but also <0>0001111<101> 00 but also 000<0>1111<01>0 <0> etc....
any idea?
Thanx in advance.
Giorgi Borghi


You can do this easily with regular expressions though I guess may be poor with long strings:
import re
re.search('0.*1 .*0.*1', '000011110100') <_sre.SRE_Mat ch object at 0x008D9BF0> _.span()

(0, 10)
Put the chars of the search string in groups if you need to know where they were found.

Kent
Jul 19 '05 #3
> i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
If I understand it right, it should be something
like this:

def blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB):
intNoOfCharsFou nd = 0
intPtrToBeginOf SubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
for chrA in strA:
blnFoundChrA = False
# print chrA
for indxToB in range(intPtrToB eginOfSubsectio nOfB, intLenB):
if(chrA == strB[indxToB]):
intNoOfCharsFou nd += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOf SubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfChars Found == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
#:for
if blnStrAinB == False:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

strA = "0101"
strB = "0000111101 00"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB)

strA = "010101"
strB = "0000111101 00"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB)

Note: code above is intended to help clarify things only,
so that a bunch of examples can be tested.
If you need production quality code, maybe someone else
can provide it.

Is it what you are looking for?

By the way:
it looks to me like a standard problem while
comparing DNA sequences, so I suppose
that there are ready to use fast libraries
providing such kind of function.

Claudio

<bo**********@y ahoo.it> schrieb im Newsbeitrag
news:11******** **************@ g43g2000cwa.goo glegroups.com.. . hi everyone.
a problem:
two binary strings, a="0101" b="000011110100 ";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>0 0 but also
0<0>001111<101> 00"
but also <0>0001111<101> 00 but also 000<0>1111<01>0 <0> etc....
any idea?
Thanx in advance.
Giorgi Borghi


Jul 19 '05 #4
thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Jul 19 '05 #5
bo**********@ya hoo.it wrote:
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.


Kent Johnson had the right approach, with regular expressions.
For a bit of optimization, use non-greedy groups. That will
give you shorter matches.

Suppose you want no more than 10 bases between terms. You could
use this pattern.

a.{,10}?t.{,10} ?c.{,10}?g.{,10 }?

import re
pat = re.compile('a.{ ,10}t.{,10}c.{, 10}g.{,10}?')
m = pat.search("tcg aacccgtaaaaagct aatcg")
m.group(0), m.start(0), m.end(0) ('aacccgtaaaaag ctaatcg', 3, 23) pat.search("tcg aacccgtaaaaagct aatttttttg") <_sre.SRE_Mat ch object at 0x9b950> pat.search("tcg aacccgtaaaaagct aattttttttg")

If you want to know the location of each of the bases, and
you'll have less than 100 of them (I think that's the limit)
then you can use groups in the regular expression language
def make_pattern(s, limit = None): .... if limit is None:
.... t = ".*?"
.... else:
.... t = ".{,%d}?" % (limit,)
.... text = []
.... for c in s:
.... text.append("(% s)%s" % (c, t))
.... return "".join(tex t)
.... make_pattern("a tcg") '(a).*?(t).*?(c ).*?(g).*?' make_pattern("a tcg", 10) '(a).{,10}?(t). {,10}?(c).{,10} ?(g).{,10}?' pat = re.compile(make _pattern("atcg" , 10))
m = pat.search("tcg aacccgtaaaaagct aatttttttg")
m <_sre.SRE_Mat ch object at 0x8ea70> m.groups() ('a', 't', 'c', 'g') for i in range(1, len("atcg")+1): .... print m.group(i), m.start(i), m.end(i)
....
a 3 4
t 9 10
c 16 17
g 27 28


Andrew
da***@dalkescie ntific.com

Jul 19 '05 #6

<bo**********@y ahoo.it> schrieb im Newsbeitrag
news:11******** *************@z 14g2000cwz.goog legroups.com...
thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.

I have introduced two additional parameter to the function.
If intMaxLenOfGap == 0 the gap size doesn't matter.
lstStartEndOfRa ngeOfBwithOccur enceOfA returns in its
0,1 elements the begin and end of the range strA was
found in strB.

Hope this does what you mean with
"make limits of lenght of interposed sequences",
does it?

Claudio
P.S. Here the code:

def blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB,
intMaxLenOfGap = 0, lstStartEndOfRa ngeOfBwithOccur enceOfA = []):

lstStartEndOfRa ngeOfBwithOccur enceOfA = []
intNoOfCharsFou nd = 0
intPtrToFirstCh arFound = 0
intPtrToBeginOf SubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
indxToA = 0

while(indxToA < intLenA):
# print chrA
if(indxToA == 0):
blnFoundChrA = False
for indxToB in range(intPtrToB eginOfSubsectio nOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFou nd += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToFirstCh arFound = indxToB
intPtrToBeginOf SubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfChars Found == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
else:
intGapLen = 0
blnFoundChrA = False
for indxToB in range(intPtrToB eginOfSubsectio nOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFou nd += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOf SubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
intGapLen += 1
if(intMaxLenOfG ap > 0 and intGapLen > intMaxLenOfGap) :
indxToA = 0
blnFoundChrA = False
intPtrToBeginOf SubsectionOfB = intPtrToFirstCh arFound + 1
intNoOfCharsFou nd = 0
break
#:if
#:for
if(intNoOfChars Found == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s' at range(%i, %i)"%(strA, strB,
intPtrToFirstCh arFound, indxToB+1)
lstStartEndOfRa ngeOfB.append(i ntPtrToFirstCha rFound)
lstStartEndOfRa ngeOfB.append(i ndxToB+1)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
#:if/else
#:while
if blnStrAinB == False:
if(intMaxLenOfG ap > 0 and intGapLen > intMaxLenOfGap) :
print "sequence '%s' not in '%s' (maybe allowed gap of %i chars was
too small?)"%(strA, strB, intMaxLenOfGap)
else:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

print

lstStartEndOfRa ngeOfB = []
strA = "0101"
strB = "0000111101 00"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB)

lstStartEndOfRa ngeOfB = []
strA = "0101"
strB = "0000111101 00"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB, 2)

strA = "010101"
strB = "0000111101 00"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB, 6,
lstStartEndOfRa ngeOfB)

strA = "010101"
strB = "00001111010000 001"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB, 4,
lstStartEndOfRa ngeOfB)

strA = "010101"
strB = "00001111010000 001"
blnFindCharSequ enceAevenIfSpre adOverEntireStr ingB(strA, strB, 5,
lstStartEndOfRa ngeOfB)
print
print "usage of lstStartEndOfRa ngeOfB parameter passed to function for use
as return value:"
print "sequence '%s' was found in '%s' at range(%i, %i)"%(strA, strB,
lstStartEndOfRa ngeOfB[0], lstStartEndOfRa ngeOfB[1])
Jul 19 '05 #7
Claudio Grondi wrote:
Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.


Still not tested enough to ensure that it's bug free, but more
concise. Here's one the implements the algorithm directly and
another that uses a regexp. The latter should be the preferred
approach. My intent was that the algorithm implements the given
pattern so they should given identical results.

# Doing the work ourselves
def find_spread_sub string(query, target, limit = None):
stack = []
ti = qi = 0
Nq = len(query)
Nt = len(target)
delta = 0

while ti < Nt:
# We have a match
if query[qi] == target[ti]:
stack.append( (qi, ti, delta) )
qi = qi + 1
if qi == Nq:
return [ti for (qi, ti, delta) in stack]
ti = ti + 1
delta = 0
else:
# No match
while 1:
# If we have a partial match, check if we've
# gone over the limit.
if stack:
delta = delta + 1
if limit is not None and delta > limit:
# backtrack, treating it as an invalid match
# (so retry this 'else:' block)
qi, ti, delta = stack.pop()
continue
# No backtracking needed
break
# Advance to check the next character in the target
ti = ti + 1

# Failure
return None

# Using regular expressions
import re
def find_spread_sub string2(query, target, limit = None):
if limit is None:
template = "(%s).*?"
else:
template = "(%%s).{,%d }?" % (limit,)
terms = [template % c for c in query]
pattern = "".join(ter ms)

pat = re.compile(patt ern)
m = pat.search(targ et)
if not m:
return None
return [m.start(i) for i in range(1, len(query)+1)]
def test():
for (q, t, limit, is_valid) in (
("1010", "10001001", None, True),
("1010", "100011", None, False),
("1010", "100010", 3, True),
("1010", "100010", 1, True),
("1010", "1000010", 1, False),
("1010", "01000010", 2, True),
("1010", "01000010", 1, False),
("1010", "0100000", None, False),

):
result = find_spread_sub string(q, t, limit)
result2 = find_spread_sub string2(q, t, limit)
if result != result2:
raise AssertionError( (result, result2) )

if result is not None:
if limit is not None:
# check that it's a proper subset
for (x, y) in zip(result[:-1], result[1:]):
# +1 because 'limit' is the maximum gap size
if (y-x) > limit+1:
raise AssertionError( (q, t, limit, result, x, y))
s = "".join([t[i] for i in result])
if s != q:
raise AssertionError( (q, t, limit, result, s))

if result is None and not is_valid:
pass
elif result is not None and is_valid:
pass
else:
raise AssertionError( (q, t, limit, is_valid, result) )

if __name__ == "__main__":
test()
print "All tests passed."
Andrew
da***@dalkescie ntific.com

Jul 19 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
11430
by: Leandro Pardini | last post by:
Hello there, I'm trying to process a binary file and I really don't know how. The story: gPhoto2 downloads the images from my camera just fine, but small areas of 10x3 pixels are screwed up. I found exactly where the problem happens and wrote a simple interpolation filter that works like a charm... under DOS, using the old QuickBasic. I've...
4
4654
by: spam | last post by:
Is there a well-known algorithm for replacing many substrings in a string? For example, I'd like to take the string "abc def ghi jkl mno pqr" and replace, say, every instance of "abc", "ghi", and "mno" with another value. Of course the brute-force approach is straight forward. Just iterate over the full string N times (once for "abc",...
3
1825
by: Will McGugan | last post by:
Hi, Is there a simple way of replacing a large number of substrings in a string? I was hoping that str.replace could take a dictionary and use it to replace the occurrences of the keys with the dict values, but that doesnt seem to be the case. To clarify, something along these lines.. >>> dict_replace( "a b c", dict(a="x", b="y") )
9
1795
by: C3 | last post by:
I have to process some data in C that is given to me as a char * array. I have a fairly large number of substrings (well, they're not actually printable, but let's treat them as strings) that I have to search for in the data. I need to keep a count of how often each of these substrings occurs in my original data and then print it out at the...
4
4168
by: Robert Dodier | last post by:
Hello all, I'm trying to find substrings that look like 'FOO blah blah blah' in a string. For example give 'blah FOO blah1a blah1b FOO blah2 FOO blah3a blah3b blah3b' I want to get three substrings, 'FOO blah1a blah1b', 'FOO blah2', and 'FOO blah3a blah3b blah3b'. I've tried numerous variations on '.*(FOO((?!FOO).)*)+.*' and everything...
3
2172
by: Girish Sahani | last post by:
Given a length k string,i want to search for 2 substrings (overlap possible) in a list consisting of length k-1 strings. These 2 substrings when 'united' give the original string. e.g given 'abc' i want to search in the list of 2-length strings to extract either 1) 'ab and 'ac' OR ('a' common) 2) 'ab' and 'bc' OR ('b' common) 3) 'ac' and...
8
5030
by: girish | last post by:
Hi, I want to generate all non-empty substrings of a string of length >=2. Also, each substring is to be paired with 'string - substring' part and vice versa. Thus, gives me , , , , , ] etc. Similarly, 'abcd' should give me , , , , , ,, , , , , , ,] I've tried the following but i cant prevent duplicates and i'm missing
1
1750
by: Jason S | last post by:
Is there a way to get the position of multiple substrings that match a regexp without using closures? match() returns the substrings themselves, not the positions, and search() seems to only return the first position. Here's what seems to work (under Shanti Rao's jsdb.exe shell) but I get a bit nervous about using closures... function...
4
2986
by: Costa | last post by:
I am looking for a c/c++ text search engine library that supports: - free text searching - not only beginning of words but substrings as well - wildcard searching - I want strings such as *test* to be supported - regular expressions I know about clucene, but, unless I am mistaken, lucene doesn't support, for instance, having the * at...
2
2812
by: Pilcrow | last post by:
This problem was raised in comp.lang.perl.misc, and the poster was concerned, among other things, by the speed of execution. Since C is faster than perl, I wonder how a C coder would solve it? Given a *very* long string, A, and a shorter string, B, extract all substrings of A that differ from B in at most N positions. For example: N = 2...
0
7664
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7583
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7885
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7948
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6250
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5484
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5213
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3626
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2082
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.