pattern finding algorithm

Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks

Jul 31 '07 #1

Subscribe Post Reply

2499

bartonc

6,596

Expert 4TB

Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks

Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:

Expand|Select|Wrap|Line Numbers

 
>>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'

>>> s = s * 100

>>> import re

>>> patternA = re.compile('AC....T..', re.MULTILINE)

>>> patternA.match(s)
 
>>> res = patternA.findall(s)

>>> len(res)

200

>>>

[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]

Aug 1 '07 #2

bartonc

6,596

Expert 4TB

Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:

Expand|Select|Wrap|Line Numbers

>>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'

>>> s = s * 100

>>> import re

>>> patternA = re.compile('AC....T..', re.MULTILINE)

>>> patternA.match(s)

>>> res = patternA.findall(s)

>>> len(res)

200

>>>

[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]

I think that I've got it:

Expand|Select|Wrap|Line Numbers

 
>>> patterns = ['AC....T..', '.C.G....C', '.C.......']

>>> reObjs = [re.compile(pat) for pat in patterns]

>>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'

>>> for i, reObj in enumerate(reObjs):

...     print "pattern: %s matches at the following positions:" %patterns[i]

...     result = reObj.finditer(s)

...     for match in result:

...         print (match.start() / 10) + 1

...         

pattern: AC....T.. matches at the following positions:

1

2

pattern: .C.G....C matches at the following positions:

2

3

pattern: .C....... matches at the following positions:

1

2

3

>>>

And it points out an error in your example.

Aug 1 '07 #3

bartonc

6,596

Expert 4TB

I think that I've got it:

Expand|Select|Wrap|Line Numbers

>>> patterns = ['AC....T..', '.C.G....C', '.C.......']

>>> reObjs = [re.compile(pat) for pat in patterns]

>>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'

>>> for i, reObj in enumerate(reObjs):

...     print "pattern: %s matches at the following positions:" %patterns[i]

...     result = reObj.finditer(s)

...     for match in result:

...         print (match.start() / 10) + 1

...

pattern: AC....T.. matches at the following positions:

1

2

pattern: .C.G....C matches at the following positions:

2

3

pattern: .C....... matches at the following positions:

1

2

3

>>>

And it points out an error in your example.

Putting it all together, you get:

Expand|Select|Wrap|Line Numbers

 
>>> for i, reObj in enumerate(reObjs):

...     positions = [((match.start() / 10) + 1) for match in reObj.finditer(s)]

...     nTimes = len(positions)

...     print "pattern: %s matches %d times, at the following positions: %s" %(patterns[i], nTimes, str(positions))
 
pattern: AC....T.. matches 2 times, at the following positions: [1, 2]

pattern: .C.G....C matches 2 times, at the following positions: [2, 3]

pattern: .C....... matches 3 times, at the following positions: [1, 2, 3]

Aug 1 '07 #4

ghostdog74

511

Expert 256MB

Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks

i don't see why you can't use string slicing and indexing.
eg

Expand|Select|Wrap|Line Numbers

 
...

if line[0:2] == "AC" and line[6] == "T":

  #do something and so on

....

Aug 1 '07 #5

kdt

hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks

Aug 1 '07 #6

bvdet

2,851

Expert Mod 2GB

hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks

This took about a minute to run on my machine on 2000 strings, and produced a data file with 83,000,000 bytes. 'sList' is the data sequence.

Expand|Select|Wrap|Line Numbers

 from sets import Set as set

import re

import time
 
patt = re.compile('[ACGT]')
 
outList = ['Opening output file: %s' % time.ctime(), ]
 
dd = {}

indx = 0
 
while len(sList) > 0:

    s1 = sList[0]

    for j, item in enumerate(sList[1:]):

        res = ''

        for i, s in enumerate(s1):

            if s == item[i]:

                res += s

            else:

                res += '.'

        if patt.search(res):

            if dd.has_key(res):

                dd[res].append([indx, j+1+indx])

            else:

                dd[res] = [[indx, j+1+indx], ]

    indx += 1
 
    sList.pop(0)
 
keys = dd.keys()

keys.sort()
 
for key in keys:

    quan = len(set([item[j] for j in range(2) for item in dd[key]]))

    outList.append('(%s) %d occurrences:' % (key, quan))

    for v in dd[key]:

        outList.append('    Pattern between sequence %d and %d' % (v[0], v[1]))
 
outList.append('Writing data to output file: %s' % time.ctime())

fn = r'H:\TEMP\temsys\string_patterns.txt'

f = open(fn, 'w')

f.write('\n'.join(outList))

f.close()

This is probably similar to the algorithm you were going to use. Here's a sample from the output:

Expand|Select|Wrap|Line Numbers

 Opening output file: Wed Aug 01 08:50:53 2007

(........A) 676 occurrences:

    Pattern between sequence 0 and 72

    Pattern between sequence 0 and 89

.......................

(....CA.G.) 31 occurrences:

    Pattern between sequence 30 and 60

    Pattern between sequence 30 and 76

    Pattern between sequence 30 and 242

    Pattern between sequence 30 and 388

.......................

(TTGCCA.A.) 2 occurrences:

    Pattern between sequence 612 and 1016

(TTGCCAA..) 2 occurrences:

    Pattern between sequence 33 and 1016

Writing data to output file: Wed Aug 01 08:51:53 2007

Aug 1 '07 #7

kdt

Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks

Aug 1 '07 #8

bartonc

6,596

Expert 4TB

Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks

Tests that I've run on other code suggest that using tuples instead of lists gives a significant performance boost:

Expand|Select|Wrap|Line Numbers

 # line 21 here

                dd[res].append((indx, j+1+indx))

            else:

                dd[res] = [(indx, j+1+indx), ]

If that list is appended to elsewhere, the tuples can be added together more quickly than the append, also.

Aug 1 '07 #9

kdt

Hi,

A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where.

I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1).

Expand|Select|Wrap|Line Numbers

 
line_num = 0

counter = 0
 
for line in myList:

    elmt_num = 0
 
    for elmt in myList:

        i = 0

        if line_num < elmt_num # As I don't want to compare twice and against the same seq

            while i<9:

                counter += 1

                i += 1
 
        elmt_num += 1
 
    line_num += 1
 
print "%d comparisons were made between %d lines" % (counter, line_num)

I get the following output for my dataset

7587459 comparisons were made between 1299 lines

Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences.

The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from.

If someone can let me know, I would be thankful.

Cheers

Aug 2 '07 #10

bvdet

2,851

Expert Mod 2GB

Hi,

A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where.

I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1).

Expand|Select|Wrap|Line Numbers

line_num = 0

counter = 0

for line in myList:

    elmt_num = 0

    for elmt in myList:

        i = 0

        if line_num < elmt_num # As I don't want to compare twice and against the same seq

            while i<9:

                counter += 1

                i += 1

        elmt_num += 1

    line_num += 1

print "%d comparisons were made between %d lines" % (counter, line_num)

I get the following output for my dataset

7587459 comparisons were made between 1299 lines

Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences.

The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from.

If someone can let me know, I would be thankful.

Cheers

kdt - I ran the following function and counted the comparisons:

Expand|Select|Wrap|Line Numbers

 def patt_match(sList):

    global count

    count = 0

    sList = sList[:]

    patt = re.compile('[ACGT]')

    dd = {}

    indx = 0

    sList = strList[:]

    while len(sList) > 0:

        s1 = sList[0]

        for j, item in enumerate(sList[1:]):

            res = ''

            for i, s in enumerate(s1):

                count += 1

                if s == item[i]:

                    res += s

                else:

                    res += '.'

            if patt.search(res):

                if dd.has_key(res):

                    dd[res].append([indx, j+1+indx])

                else:

                    dd[res] = [[indx, j+1+indx], ]

        indx += 1      

        sList.pop(0)

    return dd

Output:

>>> dd = patt_match(strList)
>>> count
7587459
>>>

Aug 2 '07 #11

kdt

hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers

while len(sList) > 0:

Thanks

Aug 4 '07 #12

bartonc

6,596

Expert 4TB

hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers

while len(sList) > 0:

Thanks

Expand|Select|Wrap|Line Numbers

 # line 2 & 3 would read

    global count linecount

    count = linecount = 0

# insert at the level of (current line 10) - this looks like the lines to me -

        linecount += 1

Aug 4 '07 #13

bvdet

2,851

Expert Mod 2GB

hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers

while len(sList) > 0:

Thanks

Here are the results:

>>> len(strList)
1299
>>> dd = patt_match(strList)
>>> len(dd)
53939
>>> count
7587459
>>> linecount
843051
>>> itemcount
1299
>>>

Here is the function:

Expand|Select|Wrap|Line Numbers

 def patt_match(sList):

    global count, linecount, itemcount

    count = linecount = itemcount = 0

    sList = sList[:]

    patt = re.compile('[ACGT]')

    dd = {}

    indx = 0

    while len(sList) > 0:

        s1 = sList[0]

        for j, item in enumerate(sList[1:]):

            res = ''

            for i, s in enumerate(s1):

                count += 1

                if s == item[i]:

                    res += s

                else:

                    res += '.'

            if patt.search(res):

                if dd.has_key(res):

                    dd[res].append([indx, j+1+indx])

                else:

                    dd[res] = [[indx, j+1+indx], ]

            linecount += 1

        indx += 1

        itemcount += 1

        sList.pop(0)

    return dd

Aug 4 '07 #14

by: kittykat | last post by:

Hi, I was wondering if you could help me. I am writing a program in C++, and the problem is, i have very limited experience in this language. I would like my user to enter a specific pattern, and...

C / C++

Sequence and/or pattern matching

by: Séb | last post by:

Hi everyone, I'm relatively new to python and I want to write a piece of code who do the following work for data mining purpose : 1) I have a list of connexion between some computers. This...

Python

New object-oriented parallel pattern matching algorithm

by: bpontius | last post by:

The GES Algorithm A Surprisingly Simple Algorithm for Parallel Pattern Matching "Partially because the best algorithms presented in the literature are difficult to understand and to implement,...

C / C++

Convert.ToString( double ) + xsd:pattern + RegEx == ?

by: aevans1108 | last post by:

expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a...

.NET Framework

Strategy Design Pattern

by: Daniel Santa Cruz | last post by:

Hello all, I've been trying to go over my OO Patterns book, and I decided to try to implement them in Python this time around. I figured this would help me learn the language better. Well,...

Python

Observer Design Pattern

by: Krivenok Dmitry | last post by:

Hello All! I am trying to implement my own Design Patterns Library. I have read the following documentation about Observer Pattern: 1) Design Patterns by GoF Classic description of Observer....

C / C++

The Template Pattern in PHP 5

by: ltruett | last post by:

....and I've finally completed my series of GOF design patterns using PHP 5 with the Template Pattern. http://www.fluffycat.com/PHP-Design-Patterns/Template/ This is a pretty useful pattern,...

PHP

Generating strings based on pattern

by: dhinakar_ve | last post by:

Hi All, I am writing a function to generate the strings based on a pattern. For example A will generate A1, A2 and A3. If the pattern is A then it will generate the strings A11, A12, A21,...

C / C++

Finding square root of a matrix

by: krishnai888 | last post by:

I had already asked this question long back but no one has replied to me..I hope someone replies to me because its very important for me as I am doing my internship. I am currently writing a code...

Software Development

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

pattern finding algorithm

Similar topics