472,096 Members | 2,215 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,096 software developers and data experts.

looping through a big file containing a set of files.

111 100+
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5481
aboxylica
111 100+
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Dec 20 '07 #101
bvdet
2,851 Expert Mod 2GB
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.     # sequence factor dictionary
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     dataArray = parseArray(fnArray, arraySet)
  6.     if dataArray:
  7.         dataSeq = parseData(fnSeq, seqSet)
  8.         if not dataSeq:
  9.             return False
  10.     else:
  11.         return None
  12.  
  13.     # This is the complete sequence  
  14.     seq = ''.join(dataSeq[1:])
  15.     # These are the subkeys of dataArray - '01', '02', '03',.............
  16.     subKeys = dataArray['A'].keys()
  17.     subKeys.sort()
  18.  
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in range(len(seq) - len(subKeys) + 1):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             den *= value[s]
  35.         numList.append(num)
  36.         denList.append(den)
  37.         seq = seq[1:]
  38.  
  39.     resultList = []
  40.     seqList1 = []
  41.  
  42.     for i, num in enumerate(numList):
  43.         result = log10(num/denList[i])
  44.         # limit output to results greater than 1
  45.         if result > 2:
  46.             # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
  47.             resultList.append(result)
  48.             seqList1.append(seqList[i])
  49.  
  50.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
  51.                         (seqList1[i], res) for i, res in enumerate(resultList)])
  52.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
  53.            (arraySet, seqSet, dataSeq[0], outStr)
Dec 20 '07 #102
aboxylica
111 100+
i am getting an error and still many sequence headers whose values are less than 2 are printed!
but i want the headers only for those value is more than 2
Traceback (most recent call last):
File "C:\Python25\this_final_1.py", line 165, in <module>
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet )
File "C:\Python25\this_final_1.py", line 137, in compileD
seqList1.append(seqList[i])
IndexError: list index out of range
Dec 20 '07 #103
bvdet
2,851 Expert Mod 2GB
I am baffled. The code works fine for me, and limits the output to values over 2. Add print statements to check the status of seqList and the calculation results.
Dec 20 '07 #104

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

8 posts views Thread by kaptain kernel | last post: by
5 posts views Thread by B-Dog | last post: by
10 posts views Thread by bienwell | last post: by
5 posts views Thread by Mark | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.