473,507 Members | 2,389 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

looping through a big file containing a set of files.

111 New Member
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5784
aboxylica
111 New Member
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Dec 20 '07 #101
bvdet
2,851 Recognized Expert Moderator Specialist
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.     # sequence factor dictionary
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     dataArray = parseArray(fnArray, arraySet)
  6.     if dataArray:
  7.         dataSeq = parseData(fnSeq, seqSet)
  8.         if not dataSeq:
  9.             return False
  10.     else:
  11.         return None
  12.  
  13.     # This is the complete sequence  
  14.     seq = ''.join(dataSeq[1:])
  15.     # These are the subkeys of dataArray - '01', '02', '03',.............
  16.     subKeys = dataArray['A'].keys()
  17.     subKeys.sort()
  18.  
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in range(len(seq) - len(subKeys) + 1):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             den *= value[s]
  35.         numList.append(num)
  36.         denList.append(den)
  37.         seq = seq[1:]
  38.  
  39.     resultList = []
  40.     seqList1 = []
  41.  
  42.     for i, num in enumerate(numList):
  43.         result = log10(num/denList[i])
  44.         # limit output to results greater than 1
  45.         if result > 2:
  46.             # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
  47.             resultList.append(result)
  48.             seqList1.append(seqList[i])
  49.  
  50.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
  51.                         (seqList1[i], res) for i, res in enumerate(resultList)])
  52.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
  53.            (arraySet, seqSet, dataSeq[0], outStr)
Dec 20 '07 #102
aboxylica
111 New Member
i am getting an error and still many sequence headers whose values are less than 2 are printed!
but i want the headers only for those value is more than 2
Traceback (most recent call last):
File "C:\Python25\this_final_1.py", line 165, in <module>
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet )
File "C:\Python25\this_final_1.py", line 137, in compileD
seqList1.append(seqList[i])
IndexError: list index out of range
Dec 20 '07 #103
bvdet
2,851 Recognized Expert Moderator Specialist
I am baffled. The code works fine for me, and limits the output to values over 2. Add print statements to check the status of seqList and the calculation results.
Dec 20 '07 #104

Sign in to post your reply or Sign up for a free account.

Similar topics

8
4038
by: kaptain kernel | last post by:
i've got a while loop thats iterating through a text file and pumping the contents into a database. the file is quite large (over 150mb). the looping causes my CPU load to race up to 100 per...
5
9418
by: B-Dog | last post by:
I have an old dos program that uses dat files to store the data and I'm trying to convert to dotnet. I'd like to be able to import the data into an access database but I don't know which format...
2
24192
by: deko | last post by:
I have a table that contains a bunch of pictures. When the user selects a particular image in a form, I need a way to extract the selected bitmap image (stored in an OLE Object table field) to the...
10
24036
by: bienwell | last post by:
Hi, I have a question about file included in ASP.NET. I have a file that includes all the Sub functions (e.g FileFunct.vb). One of the functions in this file is : Sub TestFunct(ByVal...
3
9528
by: Chung Leong | last post by:
Here's the rest of the tutorial I started earlier: Aside from text within a document, Indexing Service let you search on meta information stored in the files. For example, MusicArtist and...
1
2942
by: Alex | last post by:
Hello, I have a stored procedure that processes an individual file from a directory and archives it in a subdirectory.Now, the problem is, when i execute it , it will only process one file. What...
1
6441
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting"...
5
2683
by: Mark | last post by:
Hi I have an application (in vb.NET 2005) which holds data in SQL Server and some of the SQL records are simply paths to related files. I would like to be able to do a text search on both the...
0
1951
by: anthon | last post by:
Hi all - first post! anywho; I need to create a function for speeding up and down a looping clip. imagine a rotating object, triggered by an action, and slowly decreasing in speed, till it...
0
7223
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7114
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7321
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
7034
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7488
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
5045
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3179
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1544
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
412
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.