473,789 Members | 2,422 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

pattern finding algorithm

50 New Member
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
Jul 31 '07 #1
13 2541
bartonc
6,596 Recognized Expert Expert
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:
Expand|Select|Wrap|Line Numbers
  1. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'
  2. >>> s = s * 100
  3. >>> import re
  4. >>> patternA = re.compile('AC....T..', re.MULTILINE)
  5. >>> patternA.match(s)
  6.  
  7. >>> res = patternA.findall(s)
  8. >>> len(res)
  9. 200
  10. >>> 
[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]
Aug 1 '07 #2
bartonc
6,596 Recognized Expert Expert
Coincidentally enough, regular expressions use the "full stop" as you have specified (but in a regex, it's called a "dot"). I can give more detail, but am in a rush at the moment:
Expand|Select|Wrap|Line Numbers
  1. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC\n'
  2. >>> s = s * 100
  3. >>> import re
  4. >>> patternA = re.compile('AC....T..', re.MULTILINE)
  5. >>> patternA.match(s)
  6.  
  7. >>> res = patternA.findall(s)
  8. >>> len(res)
  9. 200
  10. >>> 
[EDIT] I've re-read the problem and have an idea. Back soon.[/EDIT]
I think that I've got it:
Expand|Select|Wrap|Line Numbers
  1. >>> patterns = ['AC....T..', '.C.G....C', '.C.......']
  2. >>> reObjs = [re.compile(pat) for pat in patterns]
  3. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'
  4. >>> for i, reObj in enumerate(reObjs):
  5. ...     print "pattern: %s matches at the following positions:" %patterns[i]
  6. ...     result = reObj.finditer(s)
  7. ...     for match in result:
  8. ...         print (match.start() / 10) + 1
  9. ...         
  10. pattern: AC....T.. matches at the following positions:
  11. 1
  12. 2
  13. pattern: .C.G....C matches at the following positions:
  14. 2
  15. 3
  16. pattern: .C....... matches at the following positions:
  17. 1
  18. 2
  19. 3
  20. >>> 
And it points out an error in your example.
Aug 1 '07 #3
bartonc
6,596 Recognized Expert Expert
I think that I've got it:
Expand|Select|Wrap|Line Numbers
  1. >>> patterns = ['AC....T..', '.C.G....C', '.C.......']
  2. >>> reObjs = [re.compile(pat) for pat in patterns]
  3. >>> s = 'ACGCATTCA\nACTGGATAC\nTCAGCCATC'
  4. >>> for i, reObj in enumerate(reObjs):
  5. ...     print "pattern: %s matches at the following positions:" %patterns[i]
  6. ...     result = reObj.finditer(s)
  7. ...     for match in result:
  8. ...         print (match.start() / 10) + 1
  9. ...         
  10. pattern: AC....T.. matches at the following positions:
  11. 1
  12. 2
  13. pattern: .C.G....C matches at the following positions:
  14. 2
  15. 3
  16. pattern: .C....... matches at the following positions:
  17. 1
  18. 2
  19. 3
  20. >>> 
And it points out an error in your example.
Putting it all together, you get:
Expand|Select|Wrap|Line Numbers
  1. >>> for i, reObj in enumerate(reObjs):
  2. ...     positions = [((match.start() / 10) + 1) for match in reObj.finditer(s)]
  3. ...     nTimes = len(positions)
  4. ...     print "pattern: %s matches %d times, at the following positions: %s" %(patterns[i], nTimes, str(positions))
  5.  
  6.  
  7. pattern: AC....T.. matches 2 times, at the following positions: [1, 2]
  8. pattern: .C.G....C matches 2 times, at the following positions: [2, 3]
  9. pattern: .C....... matches 3 times, at the following positions: [1, 2, 3]
Aug 1 '07 #4
ghostdog74
511 Recognized Expert Contributor
Hi,

I'm checking to see if you guys may be able to help me with an algorithm for finding patterns. I have around 2000 short sequences (of length 9) that are aligned. I want to be able to extract all common patterns on the same positions and report the number of occurrences.

For example in the following:

ACGCATTCA
ACTGGATAC
TCAGCCATC

I would like the following output (where a full stop represents any character)

(AC....T..) 2 occurrences (pattern between sequence 1 and 2)
(.C.G...C) 2 occurrences (pattern between sequence 2 and 3)
(.C.......) 2 occurrences (pattern between sequence 1 and 3)

As you can see, the way that I am planning on doing this now requires sum(n-1...1) comparisons. Is there a more efficient way of doing this with less comparisons?

Thanks
i don't see why you can't use string slicing and indexing.
eg
Expand|Select|Wrap|Line Numbers
  1. ...
  2. if line[0:2] == "AC" and line[6] == "T":
  3.   #do something and so on
  4. ....
  5.  
Aug 1 '07 #5
kdt
50 New Member
hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks
Aug 1 '07 #6
bvdet
2,851 Recognized Expert Moderator Specialist
hi

thanks for all your replies, but I dont think I made my requirements clear.

I am looking to find all existing patterns in a bunch of sequences, I will not know the patterns beforehand so I want to find all patterns that exist in more than two sequences

thanks
This took about a minute to run on my machine on 2000 strings, and produced a data file with 83,000,000 bytes. 'sList' is the data sequence.
Expand|Select|Wrap|Line Numbers
  1. from sets import Set as set
  2. import re
  3. import time
  4.  
  5. patt = re.compile('[ACGT]')
  6.  
  7. outList = ['Opening output file: %s' % time.ctime(), ]
  8.  
  9. dd = {}
  10. indx = 0
  11.  
  12. while len(sList) > 0:
  13.     s1 = sList[0]
  14.     for j, item in enumerate(sList[1:]):
  15.         res = ''
  16.         for i, s in enumerate(s1):
  17.             if s == item[i]:
  18.                 res += s
  19.             else:
  20.                 res += '.'
  21.         if patt.search(res):
  22.             if dd.has_key(res):
  23.                 dd[res].append([indx, j+1+indx])
  24.             else:
  25.                 dd[res] = [[indx, j+1+indx], ]
  26.     indx += 1
  27.  
  28.     sList.pop(0)
  29.  
  30. keys = dd.keys()
  31. keys.sort()
  32.  
  33. for key in keys:
  34.     quan = len(set([item[j] for j in range(2) for item in dd[key]]))
  35.     outList.append('(%s) %d occurrences:' % (key, quan))
  36.     for v in dd[key]:
  37.         outList.append('    Pattern between sequence %d and %d' % (v[0], v[1]))
  38.  
  39. outList.append('Writing data to output file: %s' % time.ctime())
  40. fn = r'H:\TEMP\temsys\string_patterns.txt'
  41. f = open(fn, 'w')
  42. f.write('\n'.join(outList))
  43. f.close()
This is probably similar to the algorithm you were going to use. Here's a sample from the output:
Expand|Select|Wrap|Line Numbers
  1. Opening output file: Wed Aug 01 08:50:53 2007
  2. (........A) 676 occurrences:
  3.     Pattern between sequence 0 and 72
  4.     Pattern between sequence 0 and 89
  5. .......................
  6. (....CA.G.) 31 occurrences:
  7.     Pattern between sequence 30 and 60
  8.     Pattern between sequence 30 and 76
  9.     Pattern between sequence 30 and 242
  10.     Pattern between sequence 30 and 388
  11. .......................
  12. (TTGCCA.A.) 2 occurrences:
  13.     Pattern between sequence 612 and 1016
  14. (TTGCCAA..) 2 occurrences:
  15.     Pattern between sequence 33 and 1016
  16. Writing data to output file: Wed Aug 01 08:51:53 2007
Aug 1 '07 #7
kdt
50 New Member
Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks
Aug 1 '07 #8
bartonc
6,596 Recognized Expert Expert
Very nice indeed bvdet. I need to make some slight changes, but this is a great step in the right direction. Thank you soo much.

Would anyone happen to know some tips to speed this up. I think you can use psycho to compile parts in C, but is there a way of converting to binary and then performing manipulations - would this be faster?

Thanks
Tests that I've run on other code suggest that using tuples instead of lists gives a significant performance boost:
Expand|Select|Wrap|Line Numbers
  1. # line 21 here
  2.                 dd[res].append((indx, j+1+indx))
  3.             else:
  4.                 dd[res] = [(indx, j+1+indx), ]
If that list is appended to elsewhere, the tuples can be added together more quickly than the append, also.
Aug 1 '07 #9
kdt
50 New Member
Hi,

A quick question on the above algorithm, it seems to performing more comparisons than necessary, though I can't seem to find where.

I decided to experiment to get the number of comparisons required to compare each line with every other line - and not against itself (line 1 to 1) or lines already compared ( as comparing line 1 to 2 is the same as comparing line 2 to 1).

Expand|Select|Wrap|Line Numbers
  1. line_num = 0
  2. counter = 0
  3.  
  4. for line in myList:
  5.     elmt_num = 0
  6.  
  7.     for elmt in myList:
  8.         i = 0
  9.         if line_num < elmt_num # As I don't want to compare twice and against the same seq
  10.             while i<9:
  11.                 counter += 1
  12.                 i += 1
  13.  
  14.         elmt_num += 1
  15.  
  16.     line_num += 1
  17.  
  18. print "%d comparisons were made between %d lines" % (counter, line_num)
  19.  
I get the following output for my dataset

7587459 comparisons were made between 1299 lines

Which is exactly what I would expect from comparing 9 times (n**2 -n)/2 where n is the number of sequences.

The output I get from bvdet's algorithm is however 8430510 comparisons (10 time (n**2-n)/2) I fail to see where the extra 1 comparison each time is coming from.

If someone can let me know, I would be thankful.

Cheers
Aug 2 '07 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

3
5115
by: kittykat | last post by:
Hi, I was wondering if you could help me. I am writing a program in C++, and the problem is, i have very limited experience in this language. I would like my user to enter a specific pattern, and I want my program to search a text file for this pattern, and let the user know if this pattern exists or not. So far, i have figured out how to make my prgram read the text file, but i'm not sure how to take the information the user inserts...
7
2809
by: Séb | last post by:
Hi everyone, I'm relatively new to python and I want to write a piece of code who do the following work for data mining purpose : 1) I have a list of connexion between some computers. This list has this format : Ip A Date Ip B .... ... ...
10
4985
by: bpontius | last post by:
The GES Algorithm A Surprisingly Simple Algorithm for Parallel Pattern Matching "Partially because the best algorithms presented in the literature are difficult to understand and to implement, knowledge of fast and practical algorithms is not commonplace." Hume and Sunday, "Fast String Searching", Software - Practice and Experience, Vol. 21 # 11, pp 1221-48
4
9772
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.
6
4706
by: Daniel Santa Cruz | last post by:
Hello all, I've been trying to go over my OO Patterns book, and I decided to try to implement them in Python this time around. I figured this would help me learn the language better. Well, I've gotten stuck with my first go at OO patterns with Python. I guess it goes without say that some of the stuff that are taken for granted in most of the books (ie. Interfaces, Abstract classes) don't really apply to Python per say, but the idea...
22
4748
by: Krivenok Dmitry | last post by:
Hello All! I am trying to implement my own Design Patterns Library. I have read the following documentation about Observer Pattern: 1) Design Patterns by GoF Classic description of Observer. Also describes implementation via ChangeManager (Mediator + Singleton) 2) Pattern hatching by John Vlissides Describes Observer's implementation via Visitor Design Pattern. 3) Design Patterns Explained by Alan Shalloway and James Trott
0
1266
by: ltruett | last post by:
....and I've finally completed my series of GOF design patterns using PHP 5 with the Template Pattern. http://www.fluffycat.com/PHP-Design-Patterns/Template/ This is a pretty useful pattern, and one that you could easily use without even realizing it is a pattern. Essentially you have an abstract template class that defines a non-abstract method with an algorithm. Also in the abstract template
4
1582
by: dhinakar_ve | last post by:
Hi All, I am writing a function to generate the strings based on a pattern. For example A will generate A1, A2 and A3. If the pattern is A then it will generate the strings A11, A12, A21, A22, A31, A32. What is the best algorithm to accomplish this? Thanks for your time. ananihdv
4
8097
by: krishnai888 | last post by:
I had already asked this question long back but no one has replied to me..I hope someone replies to me because its very important for me as I am doing my internship. I am currently writing a code involving lot of matrices. At one point I need to calculate the square root of a matrix e.g. A which contains non-zero off-diagonal elements. I searched for a lot of info on net but no algorithm worked. My best bet for finding square root was to find...
0
10195
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9979
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9016
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6765
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5415
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5548
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4090
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3695
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2906
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.