By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,996 Members | 1,498 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,996 IT Pros & Developers. It's quick & easy.

pattern finding algorithm

P: 50
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
Jul 31 '07 #1
Share this Question
Share on Google+
13 Replies


bartonc
Expert 5K+
P: 6,596
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:
Expand|Select|Wrap|Line Numbers
  1. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'
  2. >>> s = s * 100
  3. >>> import re
  4. >>> patternA = re.compile('AC....T..', re.MULTILINE)
  5. >>> patternA.match(s)
  6.  
  7. >>> res = patternA.findall(s)
  8. >>> len(res)
  9. 200
  10. >>> 
[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]
Aug 1 '07 #2

bartonc
Expert 5K+
P: 6,596
Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:
Expand|Select|Wrap|Line Numbers
  1. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'
  2. >>> s = s * 100
  3. >>> import re
  4. >>> patternA = re.compile('AC....T..', re.MULTILINE)
  5. >>> patternA.match(s)
  6.  
  7. >>> res = patternA.findall(s)
  8. >>> len(res)
  9. 200
  10. >>> 
[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]
I think that I've got it:
Expand|Select|Wrap|Line Numbers
  1. >>> patterns = ['AC....T..', '.C.G....C', '.C.......']
  2. >>> reObjs = [re.compile(pat) for pat in patterns]
  3. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'
  4. >>> for i, reObj in enumerate(reObjs):
  5. ...     print "pattern: %s matches at the following positions:" %patterns[i]
  6. ...     result = reObj.finditer(s)
  7. ...     for match in result:
  8. ...         print (match.start() / 10) + 1
  9. ...         
  10. pattern: AC....T.. matches at the following positions:
  11. 1
  12. 2
  13. pattern: .C.G....C matches at the following positions:
  14. 2
  15. 3
  16. pattern: .C....... matches at the following positions:
  17. 1
  18. 2
  19. 3
  20. >>> 
And it points out an error in your example.
Aug 1 '07 #3

bartonc
Expert 5K+
P: 6,596
I think that I've got it:
Expand|Select|Wrap|Line Numbers
  1. >>> patterns = ['AC....T..', '.C.G....C', '.C.......']
  2. >>> reObjs = [re.compile(pat) for pat in patterns]
  3. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'
  4. >>> for i, reObj in enumerate(reObjs):
  5. ...     print "pattern: %s matches at the following positions:" %patterns[i]
  6. ...     result = reObj.finditer(s)
  7. ...     for match in result:
  8. ...         print (match.start() / 10) + 1
  9. ...         
  10. pattern: AC....T.. matches at the following positions:
  11. 1
  12. 2
  13. pattern: .C.G....C matches at the following positions:
  14. 2
  15. 3
  16. pattern: .C....... matches at the following positions:
  17. 1
  18. 2
  19. 3
  20. >>> 
And it points out an error in your example.
Putting it all together, you get:
Expand|Select|Wrap|Line Numbers
  1. >>> for i, reObj in enumerate(reObjs):
  2. ...     positions = [((match.start() / 10) + 1) for match in reObj.finditer(s)]
  3. ...     nTimes = len(positions)
  4. ...     print "pattern: %s matches %d times, at the following positions: %s" %(patterns[i], nTimes, str(positions))
  5.  
  6.  
  7. pattern: AC....T.. matches 2 times, at the following positions: [1, 2]
  8. pattern: .C.G....C matches 2 times, at the following positions: [2, 3]
  9. pattern: .C....... matches 3 times, at the following positions: [1, 2, 3]
Aug 1 '07 #4

Expert 100+
P: 511
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
i don't see why you can't use string slicing and indexing.
eg
Expand|Select|Wrap|Line Numbers
  1. ...
  2. if line[0:2] == "AC" and line[6] == "T":
  3.   #do something and so on
  4. ....
  5.  
Aug 1 '07 #5

P: 50
kdt
hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks
Aug 1 '07 #6

bvdet
Expert Mod 2.5K+
P: 2,851
hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks
This took about a minute to run on my machine on 2000 strings, and produced a data file with 83,000,000 bytes. 'sList' is the data sequence.
Expand|Select|Wrap|Line Numbers
  1. from sets import Set as set
  2. import re
  3. import time
  4.  
  5. patt = re.compile('[ACGT]')
  6.  
  7. outList = ['Opening output file: %s' % time.ctime(), ]
  8.  
  9. dd = {}
  10. indx = 0
  11.  
  12. while len(sList) > 0:
  13.     s1 = sList[0]
  14.     for j, item in enumerate(sList[1:]):
  15.         res = ''
  16.         for i, s in enumerate(s1):
  17.             if s == item[i]:
  18.                 res += s
  19.             else:
  20.                 res += '.'
  21.         if patt.search(res):
  22.             if dd.has_key(res):
  23.                 dd[res].append([indx, j+1+indx])
  24.             else:
  25.                 dd[res] = [[indx, j+1+indx], ]
  26.     indx += 1
  27.  
  28.     sList.pop(0)
  29.  
  30. keys = dd.keys()
  31. keys.sort()
  32.  
  33. for key in keys:
  34.     quan = len(set([item[j] for j in range(2) for item in dd[key]]))
  35.     outList.append('(%s) %d occurrences:' % (key, quan))
  36.     for v in dd[key]:
  37.         outList.append('    Pattern between sequence %d and %d' % (v[0], v[1]))
  38.  
  39. outList.append('Writing data to output file: %s' % time.ctime())
  40. fn = r'H:\TEMP\temsys\string_patterns.txt'
  41. f = open(fn, 'w')
  42. f.write('\n'.join(outList))
  43. f.close()
This is probably similar to the algorithm you were going to use. Here's a sample from the output:
Expand|Select|Wrap|Line Numbers
  1. Opening output file: Wed Aug 01 08:50:53 2007
  2. (........A) 676 occurrences:
  3.     Pattern between sequence 0 and 72
  4.     Pattern between sequence 0 and 89
  5. .......................
  6. (....CA.G.) 31 occurrences:
  7.     Pattern between sequence 30 and 60
  8.     Pattern between sequence 30 and 76
  9.     Pattern between sequence 30 and 242
  10.     Pattern between sequence 30 and 388
  11. .......................
  12. (TTGCCA.A.) 2 occurrences:
  13.     Pattern between sequence 612 and 1016
  14. (TTGCCAA..) 2 occurrences:
  15.     Pattern between sequence 33 and 1016
  16. Writing data to output file: Wed Aug 01 08:51:53 2007
Aug 1 '07 #7

P: 50
kdt
Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks
Aug 1 '07 #8

bartonc
Expert 5K+
P: 6,596
Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks
Tests that I've run on other code suggest that using tuples instead of lists gives a significant performance boost:
Expand|Select|Wrap|Line Numbers
  1. # line 21 here
  2.                 dd[res].append((indx, j+1+indx))
  3.             else:
  4.                 dd[res] = [(indx, j+1+indx), ]
If that list is appended to elsewhere, the tuples can be added together more quickly than the append, also.
Aug 1 '07 #9

P: 50
kdt
Hi,

A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where.

I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1).

Expand|Select|Wrap|Line Numbers
  1. line_num = 0
  2. counter = 0
  3.  
  4. for line in myList:
  5.     elmt_num = 0
  6.  
  7.     for elmt in myList:
  8.         i = 0
  9.         if line_num < elmt_num # As I don't want to compare twice and against the same seq
  10.             while i<9:
  11.                 counter += 1
  12.                 i += 1
  13.  
  14.         elmt_num += 1
  15.  
  16.     line_num += 1
  17.  
  18. print "%d comparisons were made between %d lines" % (counter, line_num)
  19.  
I get the following output for my dataset

7587459 comparisons were made between 1299 lines

Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences.

The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from.

If someone can let me know, I would be thankful.

Cheers
Aug 2 '07 #10

bvdet
Expert Mod 2.5K+
P: 2,851
Hi,

A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where.

I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1).

Expand|Select|Wrap|Line Numbers
  1. line_num = 0
  2. counter = 0
  3.  
  4. for line in myList:
  5.     elmt_num = 0
  6.  
  7.     for elmt in myList:
  8.         i = 0
  9.         if line_num < elmt_num # As I don't want to compare twice and against the same seq
  10.             while i<9:
  11.                 counter += 1
  12.                 i += 1
  13.  
  14.         elmt_num += 1
  15.  
  16.     line_num += 1
  17.  
  18. print "%d comparisons were made between %d lines" % (counter, line_num)
  19.  
I get the following output for my dataset

7587459 comparisons were made between 1299 lines

Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences.

The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from.

If someone can let me know, I would be thankful.

Cheers
kdt - I ran the following function and counted the comparisons:
Expand|Select|Wrap|Line Numbers
  1. def patt_match(sList):
  2.     global count
  3.     count = 0
  4.     sList = sList[:]
  5.     patt = re.compile('[ACGT]')
  6.     dd = {}
  7.     indx = 0
  8.     sList = strList[:]
  9.     while len(sList) > 0:
  10.         s1 = sList[0]
  11.         for j, item in enumerate(sList[1:]):
  12.             res = ''
  13.             for i, s in enumerate(s1):
  14.                 count += 1
  15.                 if s == item[i]:
  16.                     res += s
  17.                 else:
  18.                     res += '.'
  19.             if patt.search(res):
  20.                 if dd.has_key(res):
  21.                     dd[res].append([indx, j+1+indx])
  22.                 else:
  23.                     dd[res] = [[indx, j+1+indx], ]
  24.         indx += 1      
  25.         sList.pop(0)
  26.     return dd
Output:
>>> dd = patt_match(strList)
>>> count
7587459
>>>
Aug 2 '07 #11

P: 50
kdt
hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers
  1. while len(sList) > 0:
Thanks
Aug 4 '07 #12

bartonc
Expert 5K+
P: 6,596
hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers
  1. while len(sList) > 0:
Thanks
Expand|Select|Wrap|Line Numbers
  1. # line 2 & 3 would read
  2.     global count linecount
  3.     count = linecount = 0
  4. # insert at the level of (current line 10) - this looks like the lines to me -
  5.         linecount += 1
Aug 4 '07 #13

bvdet
Expert Mod 2.5K+
P: 2,851
hi bvdet,

thanks for getting back to me on this. Unfortunately when I run your script I still get the same results. Would it be possible for you to check the number of lines by adding another counter after the line

Expand|Select|Wrap|Line Numbers
  1. while len(sList) > 0:
Thanks
Here are the results:
>>> len(strList)
1299
>>> dd = patt_match(strList)
>>> len(dd)
53939
>>> count
7587459
>>> linecount
843051
>>> itemcount
1299
>>>
Here is the function:
Expand|Select|Wrap|Line Numbers
  1. def patt_match(sList):
  2.     global count, linecount, itemcount
  3.     count = linecount = itemcount = 0
  4.     sList = sList[:]
  5.     patt = re.compile('[ACGT]')
  6.     dd = {}
  7.     indx = 0
  8.     while len(sList) > 0:
  9.         s1 = sList[0]
  10.         for j, item in enumerate(sList[1:]):
  11.             res = ''
  12.             for i, s in enumerate(s1):
  13.                 count += 1
  14.                 if s == item[i]:
  15.                     res += s
  16.                 else:
  17.                     res += '.'
  18.             if patt.search(res):
  19.                 if dd.has_key(res):
  20.                     dd[res].append([indx, j+1+indx])
  21.                 else:
  22.                     dd[res] = [[indx, j+1+indx], ]
  23.             linecount += 1
  24.         indx += 1
  25.         itemcount += 1
  26.         sList.pop(0)
  27.     return dd
Aug 4 '07 #14

Post your reply

Sign in to post your reply or Sign up for a free account.