443,996 Members | 1,498 Online Need help? Post your question and get tips & solutions from a community of 443,996 IT Pros & Developers. It's quick & easy.

# pattern finding algorithm

 P: 50 Hi, I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences. For example in the following: ACGCATTCA ACTGGATAC TCAGCCATC I would like the following output (where a full stop represents any character) (AC....T..) 2 occurrences (pattern between sequence 1 and 2) (.C.G...C) 2 occurrences (pattern between sequence 2 and 3) (.C.......) 2 occurrences (pattern between sequence 1 and 3) As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons? Thanks Jul 31 '07 #1
13 Replies

 Expert 5K+ P: 6,596 Hi, I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences. For example in the following: ACGCATTCA ACTGGATAC TCAGCCATC I would like the following output (where a full stop represents any character) (AC....T..) 2 occurrences (pattern between sequence 1 and 2) (.C.G...C) 2 occurrences (pattern between sequence 2 and 3) (.C.......) 2 occurrences (pattern between sequence 1 and 3) As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons? Thanks Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment: Expand|Select|Wrap|Line Numbers >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n' >>> s = s * 100 >>> import re >>> patternA = re.compile('AC....T..', re.MULTILINE) >>> patternA.match(s)   >>> res = patternA.findall(s) >>> len(res) 200 >>>  [EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT] Aug 1 '07 #2

 Expert 5K+ P: 6,596 Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment: Expand|Select|Wrap|Line Numbers >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n' >>> s = s * 100 >>> import re >>> patternA = re.compile('AC....T..', re.MULTILINE) >>> patternA.match(s)   >>> res = patternA.findall(s) >>> len(res) 200 >>>  [EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT] I think that I've got it: Expand|Select|Wrap|Line Numbers >>> patterns = ['AC....T..', '.C.G....C', '.C.......'] >>> reObjs = [re.compile(pat) for pat in patterns] >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC' >>> for i, reObj in enumerate(reObjs): ...     print "pattern: %s matches at the following positions:" %patterns[i] ...     result = reObj.finditer(s) ...     for match in result: ...         print (match.start() / 10) + 1 ...          pattern: AC....T.. matches at the following positions: 1 2 pattern: .C.G....C matches at the following positions: 2 3 pattern: .C....... matches at the following positions: 1 2 3 >>>  And it points out an error in your example. Aug 1 '07 #3

 Expert 5K+ P: 6,596 I think that I've got it: Expand|Select|Wrap|Line Numbers >>> patterns = ['AC....T..', '.C.G....C', '.C.......'] >>> reObjs = [re.compile(pat) for pat in patterns] >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC' >>> for i, reObj in enumerate(reObjs): ...     print "pattern: %s matches at the following positions:" %patterns[i] ...     result = reObj.finditer(s) ...     for match in result: ...         print (match.start() / 10) + 1 ...          pattern: AC....T.. matches at the following positions: 1 2 pattern: .C.G....C matches at the following positions: 2 3 pattern: .C....... matches at the following positions: 1 2 3 >>>  And it points out an error in your example. Putting it all together, you get: Expand|Select|Wrap|Line Numbers >>> for i, reObj in enumerate(reObjs): ...     positions = [((match.start() / 10) + 1) for match in reObj.finditer(s)] ...     nTimes = len(positions) ...     print "pattern: %s matches %d times, at the following positions: %s" %(patterns[i], nTimes, str(positions))     pattern: AC....T.. matches 2 times, at the following positions: [1, 2] pattern: .C.G....C matches 2 times, at the following positions: [2, 3] pattern: .C....... matches 3 times, at the following positions: [1, 2, 3] Aug 1 '07 #4

 Expert 100+ P: 511 Hi, I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences. For example in the following: ACGCATTCA ACTGGATAC TCAGCCATC I would like the following output (where a full stop represents any character) (AC....T..) 2 occurrences (pattern between sequence 1 and 2) (.C.G...C) 2 occurrences (pattern between sequence 2 and 3) (.C.......) 2 occurrences (pattern between sequence 1 and 3) As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons? Thanks i don't see why you can't use string slicing and indexing. eg Expand|Select|Wrap|Line Numbers ... if line[0:2] == "AC" and line == "T":   #do something and so on ....   Aug 1 '07 #5

 P: 50 hi thanks for all your replies, but I dont think I made my requirements clear. I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences thanks Aug 1 '07 #6

 Expert Mod 2.5K+ P: 2,851 hi thanks for all your replies, but I dont think I made my requirements clear. I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences thanks This took about a minute to run on my machine on 2000 strings, and produced a data file with 83,000,000 bytes. 'sList' is the data sequence. Expand|Select|Wrap|Line Numbers from sets import Set as set import re import time   patt = re.compile('[ACGT]')   outList = ['Opening output file: %s' % time.ctime(), ]   dd = {} indx = 0   while len(sList) > 0:     s1 = sList     for j, item in enumerate(sList[1:]):         res = ''         for i, s in enumerate(s1):             if s == item[i]:                 res += s             else:                 res += '.'         if patt.search(res):             if dd.has_key(res):                 dd[res].append([indx, j+1+indx])             else:                 dd[res] = [[indx, j+1+indx], ]     indx += 1       sList.pop(0)   keys = dd.keys() keys.sort()   for key in keys:     quan = len(set([item[j] for j in range(2) for item in dd[key]]))     outList.append('(%s) %d occurrences:' % (key, quan))     for v in dd[key]:         outList.append('    Pattern between sequence %d and %d' % (v, v))   outList.append('Writing data to output file: %s' % time.ctime()) fn = r'H:\TEMP\temsys\string_patterns.txt' f = open(fn, 'w') f.write('\n'.join(outList)) f.close() This is probably similar to the algorithm you were going to use. Here's a sample from the output: Expand|Select|Wrap|Line Numbers Opening output file: Wed Aug 01 08:50:53 2007 (........A) 676 occurrences:     Pattern between sequence 0 and 72     Pattern between sequence 0 and 89 ....................... (....CA.G.) 31 occurrences:     Pattern between sequence 30 and 60     Pattern between sequence 30 and 76     Pattern between sequence 30 and 242     Pattern between sequence 30 and 388 ....................... (TTGCCA.A.) 2 occurrences:     Pattern between sequence 612 and 1016 (TTGCCAA..) 2 occurrences:     Pattern between sequence 33 and 1016 Writing data to output file: Wed Aug 01 08:51:53 2007 Aug 1 '07 #7

 P: 50 Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much. Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster? Thanks Aug 1 '07 #8

 Expert 5K+ P: 6,596 Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much. Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster? Thanks Tests that I've run on other code suggest that using tuples instead of lists gives a significant performance boost: Expand|Select|Wrap|Line Numbers # line 21 here                 dd[res].append((indx, j+1+indx))             else:                 dd[res] = [(indx, j+1+indx), ] If that list is appended to elsewhere, the tuples can be added together more quickly than the append, also. Aug 1 '07 #9

 P: 50 Hi, A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where. I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1). Expand|Select|Wrap|Line Numbers line_num = 0 counter = 0   for line in myList:     elmt_num = 0       for elmt in myList:         i = 0         if line_num < elmt_num # As I don't want to compare twice and against the same seq             while i<9:                 counter += 1                 i += 1           elmt_num += 1       line_num += 1   print "%d comparisons were made between %d lines" % (counter, line_num)   I get the following output for my dataset 7587459 comparisons were made between 1299 lines Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences. The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from. If someone can let me know, I would be thankful. Cheers Aug 2 '07 #10

 Expert Mod 2.5K+ P: 2,851 Hi, A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where. I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1). Expand|Select|Wrap|Line Numbers line_num = 0 counter = 0   for line in myList:     elmt_num = 0       for elmt in myList:         i = 0         if line_num < elmt_num # As I don't want to compare twice and against the same seq             while i<9:                 counter += 1                 i += 1           elmt_num += 1       line_num += 1   print "%d comparisons were made between %d lines" % (counter, line_num)   I get the following output for my dataset 7587459 comparisons were made between 1299 lines Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences. The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from. If someone can let me know, I would be thankful. Cheers kdt - I ran the following function and counted the comparisons: Expand|Select|Wrap|Line Numbers def patt_match(sList):     global count     count = 0     sList = sList[:]     patt = re.compile('[ACGT]')     dd = {}     indx = 0     sList = strList[:]     while len(sList) > 0:         s1 = sList         for j, item in enumerate(sList[1:]):             res = ''             for i, s in enumerate(s1):                 count += 1                 if s == item[i]:                     res += s                 else:                     res += '.'             if patt.search(res):                 if dd.has_key(res):                     dd[res].append([indx, j+1+indx])                 else:                     dd[res] = [[indx, j+1+indx], ]         indx += 1               sList.pop(0)     return dd Output:>>> dd = patt_match(strList) >>> count 7587459 >>> Aug 2 '07 #11

 P: 50 hi bvdet, thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line Expand|Select|Wrap|Line Numbers while len(sList) > 0: Thanks Aug 4 '07 #12

 Expert 5K+ P: 6,596 hi bvdet, thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line Expand|Select|Wrap|Line Numbers while len(sList) > 0: Thanks Expand|Select|Wrap|Line Numbers # line 2 & 3 would read     global count linecount     count = linecount = 0 # insert at the level of (current line 10) - this looks like the lines to me -         linecount += 1 Aug 4 '07 #13

 Expert Mod 2.5K+ P: 2,851 hi bvdet, thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line Expand|Select|Wrap|Line Numbers while len(sList) > 0: Thanks Here are the results:>>> len(strList) 1299 >>> dd = patt_match(strList) >>> len(dd) 53939 >>> count 7587459 >>> linecount 843051 >>> itemcount 1299 >>> Here is the function: Expand|Select|Wrap|Line Numbers def patt_match(sList):     global count, linecount, itemcount     count = linecount = itemcount = 0     sList = sList[:]     patt = re.compile('[ACGT]')     dd = {}     indx = 0     while len(sList) > 0:         s1 = sList         for j, item in enumerate(sList[1:]):             res = ''             for i, s in enumerate(s1):                 count += 1                 if s == item[i]:                     res += s                 else:                     res += '.'             if patt.search(res):                 if dd.has_key(res):                     dd[res].append([indx, j+1+indx])                 else:                     dd[res] = [[indx, j+1+indx], ]             linecount += 1         indx += 1         itemcount += 1         sList.pop(0)     return dd Aug 4 '07 #14 