469,282 Members | 1,704 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,282 developers. It's quick & easy.

looping through a big file containing a set of files.

111 100+
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5088
aboxylica
111 100+
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Dec 20 '07 #101
bvdet
2,851 Expert Mod 2GB
i want to print the header only if it satifies the following condition:
how do i do it??
Expand|Select|Wrap|Line Numbers
  1. return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
  2.  
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.     # sequence factor dictionary
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     dataArray = parseArray(fnArray, arraySet)
  6.     if dataArray:
  7.         dataSeq = parseData(fnSeq, seqSet)
  8.         if not dataSeq:
  9.             return False
  10.     else:
  11.         return None
  12.  
  13.     # This is the complete sequence  
  14.     seq = ''.join(dataSeq[1:])
  15.     # These are the subkeys of dataArray - '01', '02', '03',.............
  16.     subKeys = dataArray['A'].keys()
  17.     subKeys.sort()
  18.  
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in range(len(seq) - len(subKeys) + 1):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             den *= value[s]
  35.         numList.append(num)
  36.         denList.append(den)
  37.         seq = seq[1:]
  38.  
  39.     resultList = []
  40.     seqList1 = []
  41.  
  42.     for i, num in enumerate(numList):
  43.         result = log10(num/denList[i])
  44.         # limit output to results greater than 1
  45.         if result > 2:
  46.             # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
  47.             resultList.append(result)
  48.             seqList1.append(seqList[i])
  49.  
  50.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
  51.                         (seqList1[i], res) for i, res in enumerate(resultList)])
  52.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
  53.            (arraySet, seqSet, dataSeq[0], outStr)
Dec 20 '07 #102
aboxylica
111 100+
i am getting an error and still many sequence headers whose values are less than 2 are printed!
but i want the headers only for those value is more than 2
Traceback (most recent call last):
File "C:\Python25\this_final_1.py", line 165, in <module>
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet )
File "C:\Python25\this_final_1.py", line 137, in compileD
seqList1.append(seqList[i])
IndexError: list index out of range
Dec 20 '07 #103
bvdet
2,851 Expert Mod 2GB
I am baffled. The code works fine for me, and limits the output to values over 2. Add print statements to check the status of seqList and the calculation results.
Dec 20 '07 #104

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

8 posts views Thread by kaptain kernel | last post: by
5 posts views Thread by B-Dog | last post: by
10 posts views Thread by bienwell | last post: by
5 posts views Thread by Mark | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.