473,577 Members | 3,192 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

looping through a big file containing a set of files.

111 New Member
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
CCTGGAGCCATCGTC CTCGTCCTCC
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
AATACAAAAAAAATA TATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5830
aboxylica
111 New Member
The partial sequence for which you are calculating the log has the length of the matrix, doesn't it? So you need to combine the code from this thread and the matrix thread, so you have both the sequences and matrices, and the scoring, and then run it over all matrices and over all sequences.
but my matrix file is something like this. i am not gonna be specific about the datasets it is a huge file.like this.(this ia a part)
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1//

PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//
NA BEAF-32
PO A C G T
01 16.78 0.91 0.00 3.45
02 0.62 0.92 11.18 8.41
03 0.07 20.94 0.00 0.14
04 0.45 0.47 19.97 0.25
05 11.06 2.12 4.95 3.01
06 0.90 0.00 9.47 10.77
07 12.46 3.27 0.00 5.41
08 0.45 6.88 13.48 0.33
09 0.10 1.02 0.00 20.03
10 9.15 1.11 5.14 5.75
11 2.37 0.29 0.00 18.48
12 0.00 8.76 8.01 4.37
13 0.42 8.63 11.09 1.00
14 7.27 1.53 12.08 0.26
15 1.82 0.05 3.23 16.04
//
//
NA BEAF-32A
PO A C G T
01 1.00 0.00 0.24 1.30
02 0.93 0.00 1.53 0.08
03 1.53 0.00 1.00 0.00
04 1.53 1.00 0.00 0.00
05 0.00 0.00 2.54 0.00
06 0.00 1.69 0.77 0.08
07 0.00 0.00 2.46 0.08
08 0.00 0.64 1.30 0.60
09 0.00 0.08 2.46 0.00
10 0.00 1.05 0.00 1.49
11 0.24 0.00 2.30 0.00
12 0.08 0.11 0.00 2.35
13 0.24 0.00 2.30 0.00
14 0.00 0.93 0.00 1.61
15 0.00 0.00 2.54 0.00
16 0.08 1.53 0.00 0.93
//
//
NA BEAF-32B
PO A C G T
01 0.00 7.91 0.00 0.00
02 0.00 0.00 7.91 0.00
03 7.91 0.00 0.00 0.00
04 0.00 0.00 0.00 7.91
05 7.91 0.00 0.00 0.00
06 0.00 1.67 3.51 2.73
07 0.00 0.00 0.00 7.91
08 3.49 0.16 0.00 4.27
09 0.00 0.00 0.00 7.91
10 0.00 5.11 0.91 1.89
11 0.00 4.31 3.60 0.00
12 0.16 7.64 0.00 0.11
13 7.00 0.00 0.91 0.00
14 0.00 6.18 0.00 1.73
15 4.27 2.80 0.00 0.84
16 1.84 5.11 0.84 0.11
//
//
NA Cf2-II
PO A C G T
01 0.00 0.00 0.43 12.03
02 0.00 10.74 0.00 1.72
03 6.27 0.00 6.19 0.00
04 0.00 11.76 0.00 0.70
05 0.78 0.00 11.25 0.43
06 0.00 0.00 0.00 12.46
07 11.91 0.00 0.12 0.43
08 6.27 0.00 0.00 6.19
09 11.56 0.12 0.78 0.00
10 5.88 0.00 0.00 6.58
11 8.86 0.00 3.60 0.00
12 5.77 0.12 0.00 6.58
13 0.00 6.27 6.19 0.00
14 0.00 12.46 0.00 0.00
15 6.69 0.00 5.77 0.00
16 3.52 0.00 8.94 0.00
//
//
NA Deaf1
PO A C G T
01 5.42 5.98 1.71 0.42
02 7.31 4.33 0.25 1.64
03 12.16 1.24 0.13 0.00
04 13.04 0.13 0.00 0.36
05 7.25 1.66 4.62 0.00
06 0.37 1.29 11.76 0.11
07 0.00 13.47 0.00 0.05
08 0.75 1.71 11.07 0.00
09 11.53 0.13 0.05 1.81
10 0.37 0.00 0.00 13.16
11 0.00 12.82 0.00 0.71
12 0.00 0.00 12.84 0.68
13 8.00 0.25 4.24 1.04
14 0.00 6.03 0.00 7.50
15 0.42 0.13 4.38 8.60
16 0.05 0.98 7.93 4.57
//
//
NA Dfd
PO A C G T
01 0.50 1.66 0.07 68.40
02 52.59 9.34 8.31 0.39
03 69.57 0.66 0.00 0.40
04 2.22 0.14 0.41 67.86
05 0.44 0.18 23.53 46.49
06 36.75 5.44 26.74 1.70
07 16.27 4.86 18.49 31.01
08 8.79 3.43 17.07 41.35
09 1.40 3.62 29.62 36.00
10 1.89 20.88 10.86 37.00
11 30.75 25.66 13.32 0.91
//
//
NA Dref
PO A C G T
01 1.28 2.13 14.78 18.90
02 5.33 12.15 12.68 6.92
03 4.99 8.72 21.15 2.22
04 10.42 6.71 18.00 1.95
05 22.25 0.51 10.62 3.70
06 15.72 3.00 0.00 18.36
07 26.44 3.26 4.01 3.38
08 10.33 5.61 9.50 11.64
09 8.67 18.41 0.18 9.83
10 2.83 0.84 0.24 33.17
11 35.50 0.91 0.60 0.08
12 0.35 0.08 1.05 35.60
13 0.22 34.76 0.79 1.31
14 4.00 0.88 31.28 0.93
15 23.50 6.09 0.33 7.16
16 4.83 1.79 1.77 28.69
//
//
NA E-spl-
PO A C G T
01 0.26 16.93 0.00 0.00
02 16.31 0.88 0.00 0.00
03 0.00 11.13 0.00 6.05
04 0.00 0.00 10.52 6.67
05 8.95 2.38 0.00 5.86
06 0.21 0.07 16.91 0.00
07 8.38 8.81 0.00 0.00
08 0.00 17.07 0.12 0.00
09 17.13 0.05 0.00 0.00
10 0.81 13.88 2.38 0.12
11 8.89 0.00 8.22 0.07
12 8.08 0.07 2.31 6.72
13 0.21 2.38 14.60 0.00
14 0.00 8.45 8.74 0.00
15 11.34 0.00 0.00 5.85
16 0.00 0.00 2.58 14.60
//
//
NA Eip74EF
PO A C G T
01 26.64 2.84 3.15 0.66
02 28.55 3.74 0.18 0.82
03 13.77 1.02 15.46 3.05
04 5.12 14.05 4.35 9.78
05 14.07 17.06 1.63 0.52
06 31.86 0.47 0.07 0.89
07 15.27 1.33 14.82 1.88
08 9.00 10.66 5.19 8.44
09 16.44 0.08 3.58 13.18
10 8.17 0.00 14.74 10.38
11 7.69 7.01 16.85 1.75
12 13.60 6.89 2.36 10.44
13 2.20 0.34 26.98 3.77
14 4.30 0.43 2.94 25.63
15 4.05 2.54 3.78 22.93
//
//
NA HLHm5
PO A C G T
01 6.96 4.69 0.00 0.00
02 0.00 3.00 0.00 8.65
03 0.00 6.96 4.69 0.00
04 0.00 11.65 0.00 0.00
05 4.69 0.00 0.00 6.96
06 0.00 0.00 0.00 11.65
07 0.00 0.00 6.96 4.69
08 0.00 0.00 0.00 11.65
09 0.00 0.00 11.65 0.00
10 4.69 0.00 6.96 0.00
11 0.00 11.65 0.00 0.00
12 4.69 0.00 0.00 6.96
13 0.00 11.65 0.00 0.00
14 0.00 0.00 11.65 0.00
15 0.00 0.00 0.00 11.65
16 0.00 0.00 11.65 0.00
//
//
NA His2B
PO A C G T
01 0.41 0.61 0.53 21.43
02 0.00 0.00 0.00 22.97
03 22.97 0.00 0.00 0.00
04 0.00 22.97 0.00 0.00
05 0.00 22.97 0.00 0.00
06 0.00 0.00 0.00 22.97
07 22.97 0.00 0.00 0.00
08 22.97 0.00 0.00 0.00
//
//
Jul 15 '07 #21
aboxylica
111 New Member
i donno how to incorporate the length of the sequence each time.its gonna change.this is a part of my sequence file:
>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
CCTGGAGCCATCGTC CTCGTCCTCCGTCCC TTAGCGCCTCCTGCA TGGATGTCGTTTTTG GGTTTCATACCTTTT CAC
ACTGGAAAAATACGG AATTTGTTGTAAGCC CTTTCAAGACGAATG GGATTTAGCTTCGGA TGTCAACGTCACCAT AAT
CATATTAGGAATATT TCTACTCAATTGCAA TATTGGTACTTTTCT GACTGTAAACGCGAT GATAATTACAAATAT GCC
TAATTTGCTGTCTTT ATAATCAAATGGAGT TCTTTATATTTCCAA AATATTGAAATTCCG ATTCCCTAGAAAATA ATA
CGTTTTTCTGTTATT AATAAAAAACCAATA GGAAAGTTCTCAAAA ATTACTCTGTTGTAT TTGATCATTTCTTTT CCG
GTATAATCTTTTATT TTAAGCATTCCCATG TGAATAAATTTCAGA CTAATGTATTAATAA GATGTCGTGTTTTTC CAC
TTACAAATTTCTCAT ACAGCTGGATATATA CTACGAGTACTATAC ACATGCTCTGGG
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
AATACAAAAAAAATA TATATATATACAAAA ATTTGTTGTGTTTGA ATTGAATTAAGAGCT TATCAAGAAAAAAAT TTC
AGTGACTCATAATAC ACTACTCTACAAGTT TAAATTGAATCAACA ATTTAACTTTCATTG CTCAGGTTTTTAGTA ACA
ATGTTTATATAAGTT TAGGTATAACAAATG ATTTAAATATAAGAT ACTGTATTTCACATT GAGACGAAACAATCC ACC
GAAAATCATAAAATA TAAGAATGTTGCATT TTATTTTTAAAAATA AAGATGCCTTTTAAG AGGAATAACTTAAAT GTC
TTTAATACCTTTGAA TTTAATTATATGGCT AATAAACACAAACTT AAAGCTTAAAACTGC ATCGAATTGAATGCG GTT
ATAAATGTACTTATA TATCTAATATAATCT GCTAATATGGTTTAC ATGGTATATCTTTCT CGGAAATTTTTACAA AAA
TTATCTATTCATATA TCTCGAGCGTAAGAT ATTTATCAGTTTATA GATAACATCTTTAAA TTTGGGTGATTAAAA AAA
AACATTG
>Cp36_PRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8324430..8324 513
TCTAGAGATCTGGGC ACGATGGCGAGACAA AGATGCGGCGCAAAA TCGGAAATGGAGATG GATCACGTAGCCGGC CAT
GGCGG
>Him_distal|Dro sophila melanogaster|Hi m|FBgn0030900|X :18039896..1804 3470
GGTTTTCTGCGATGG CTTCCGCGCCAGCTG AAGTATCTGATTTGC TGCCTTGTTTTTGTT GATATTTCTGCGAAG GGA
CTTGTGCTTTTCAAA TGGCCTTTTTTTGGG ATTACGGCAAGGGCG CGTTTCCCACGCTCG ATCCCCACTTACCAT TGG
TGCACGCGATTGCGG CAAGCTGCTGAGGCA AGCTATTAAACGCCA CACTGGGCCGGGGGG CGGTACCGGTGGGCG TGG
CAGGGGAGTCGACAC ATGTTGTGTGCCAGA GAACTTTGCTCCGAT CCCCAGATCATCAAA TAGTTGTCGCTGTCT GCT
CGTGCGCAAATTGCA ATACTTTGCATACCC TTACTGCAGGGTATC TGAGCTTGGACTTTA AATAAGGGGGTATAA CAT
AGCTTATACTCTCTA TCTCTGTTATAAAGT CAATTTTCCTTAGAT CTTTAGTACAGTGGG TAGTTAAGGAGACAT AAC
TTCCAAAAAAAAAAA CTATAAAATTGCAAT AATTTATGCAAAATA TGTATTTTATTGAAT GGGATGAATAATTTA CCT
TATACGACTGTAAAA CATTTCTAACGATTA AATGCACTTCTAAAA GTTTTCCCACAAGTA GGTGAGCTATTATGC TAA
GCGTTCCATGACTTG GAATCTAAGATCTTG TTTTGATCTTCGCTG ATCTTTGAGAACTCG GGGATTACTTACACA TTT
CTGGGCAGGCACAAG TGGGCCGAGGCAGTG TAGATTCATCACGTT TTCACTCAACACACG CAGCTCATTAACAGC CCC
GCTGACAACTTGTCA GGACTTCCCCCTCGT GAATCCCCCTGCTAC GCAACCCCCATTCCC CGCCCATTCCAACAC TTC
CCGCCGGGAGCGTGG GAAATTATGCGTGTT GGTGGGACGTCGGGC GGTGAAAATTGGCGC GCTCTTCGGGGGGCC ACA
CCGCGTGGCATTGAC AACTCTTCCACATTT CGCGCCCAACGATGC GTTGGCATCAGTGGG TCACAGGGATTACGG CTG
GCTGGGATTCCAGAG CCAGATCTTTTTCAG CCAAAACTTTCAGCT TTCGAAGACCTCAAG CGATAGGAGAGTGTC GGA
AGTCCAGAAATAGAC GCGTAGCACATAAAT TATGGATCGTATCGA GTATCGATTAGCCCG GGACAAGCGAAGCGA TAG
GGAGACATATTTTTA TTACCCTCTCGGGGA CCTGCACTTGTTGGC TTCGCTTCTATGAAA GATCCCTCTACCATA TCA
CGTATGTGGGCTCCC CCAATCGAACCGAGT TGTGGGAAATGTTTT CCCAGGCCAACAGCT AATTGTCACTCCAAG GGT
TGTCCCCGCAGCCCA GACGACAGATAAGCG GGCAAGTGAAGCCCA GCGATCTGAGTCAAG TGAAGGGCTTCAATT TCT
TTCCCGAGTGGAACT GGGATATCGAAATTA CATTTGTAACAGACG TTTTAGTCCGCAATC CTCAGCTAATGGGAC TTA
CGAACATATATTCAT CTGAAATTCAAGAAC ATGCGCACTTAAAGA GCAGGGAAGTCGCAC ACGCGCAAGTCAGGC GCT
CAAAAAGGGATCTTC GGAGGTACAGTGGGC AAAAGACTGTAAATA AATAATATAAATAAA ATAATATTTAGCTCT ATG
TGTTTATATAATCTA CAAAGTAGTTAACAA AAAATATAAAATGGA TATAAAAATACATCT TATATATCCCTATAA TAA
GAAATAAATAATAAT TTTAGTAAATTAATT TTGTTACACAAAGTA CCTGTATTATTACCT CTTTTTTGTTGGTTG GTT
CTTTTTTGATGTGGC CCCACTGTGCTCTCT TATCAGTGCGACAAT CAGGCATTGCCTTTC CCCATCGGGGGATTC TAA
TTCCGTGGACGATGG GCCGAAACGCCTATA AAGTCGCTCATTAAA AATGTTTAATTATGG CCCATCTTGCATCTT GCA
CCGATGTGGATGGGG TTTGTCGGCAATGAT TTACATTATAAAAAT GCCCGTTATCTGAGC ATTTTGTACGCTCCA CTC
CCTCTTCCCCCCTCC AAAAAAAAAAAAAAC AGATATGTATATTCC CCGAGATATTCCCAA GCGGCCAAAAATAGA CGC
AAATTGTAACGCACT TGAAGTGCACTCTGA AACATCTTGAAGTCC AAATAAAATAGCAGA GAGACCCACAATAAT ATA
CGTTGATATACACAT GTATATATGTATGTA TGTACATAAAGGGCC AGGAGCAGGAACGTT AGGCATGCGGTGGTA CGA
GCACCGTGGTGCGAG CGAGAGCGCTGTGCT GCCTGAGGGAGAGGT AGCGAGTGGGTTGCA TTGCGCACACAGAAC ATG
TGAATGCAGAGTTCA AGTGCATGCCGTGAC ACAGACACGCACACA CACACACGCACACAC AGATGAGTAGCCGCT GCA
AAGTGTTTTTTCCCA GGCGCTATTTATAAT ATGCATCCCGTCGCC GATCCGATCCGATCC AATCCAATCCGATTG GAT
CCCATCTTGCGGCAC TACGATTATGACGCT CGACACGATGATGCA TTCGCAGAGTTTCCC GATCGCAGAGTACCC TGT
ACTCGAGTAGTTTTT AGATGCAGTATTATT AAGTAGAAAATTGTA ACCGTATAATATTCC ATTATATTAAATATT TTT
ATAGCACTAAAGAAA TAAAAGCCCATTTTA TAATTTATATTACAA AAATACTTAACCATA GAAACTTATGATATG ATA
CCAATATTTAAGTTC CAAAAAATGTAGAAC ATTTTTAAGTATATA CTCGAAAATATTAAT TTTCAAAATTGATAT TCA
AGAGATATTATAAAA AGATCCCCATTCTAA ATATCTAACATCATG CCATGCTTTCTAATG AGTATAGTATACCCC TGC
TACCCTGTCAATCCG CAAAACAGGCGCCGA AACATGCGGTTTCTC GCAGCAGACTGCCAC GGGAAAAATTCGGTT CGA
GATTTGGGAATGGAT GTATGACGGAGCAGA AGGAGCAGGACCCGG ATTTCGGATTTCGGA ATGGATATGGAAATG AAG
ATGGAAATGGGACTT TGACTGCGCGACGGC CACATGCGCCGCTGG CGATGCCGCTGGATG TTGCATGTGGCAGCG GTC
GGTGCAGCAGCGAAA GTGTTGCAGCTGTAT GAGAGGGTCTATTTT TGGGGCGATTGTGCG GCGCTGGTGCTGCCA CAT
GTGTTCTGTGTTGGG CTGCTAAAAGGCATT GTAATGAGAGCAGAA AATAGAATTGACTCC ACTTGAGCAATGTCC CAT
AAAGCGGGAGTTTCG AGTTTGGCGCGCAAT GTGCCGCACCAGCAA ACGAACAAAAGAAAA AAAAAAAAAAAAAAC ACA
GCCAGTAACACATGG GCCCACGAGTTATGT TTTATTTTTAATCCC ACAAAGAGTCGATCT CCAAAACAAACCCGC AGA
GAGCACATATAAAGA GACTCGGTGGACGAG TGGTTCGAAACAGTC TTCCGCCGCAGCTCG ACGCGCTCGCATATC GGG
AATATATAGATCGGA GATATCGCAGGACCC ACAGCAGAGCAGAGC CGCAGAGCCACCAAC CTCG
>Him_proximal|D rosophila melanogaster|Hi m|FBgn0030900|X :18041232..1804 3470
GCCCAGACGACAGAT AAGCGGGCAAGTGAA GCCCAGCGATCTGAG TCAAGTGAAGGGCTT CAATTTCTTTCCCGA GTG
GAACTGGGATATCGA AATTACATTTGTAAC AGACGTTTTAGTCCG CAATCCTCAGCTAAT GGGACTTACGAACAT ATA
TTCATCTGAAATTCA AGAACATGCGCACTT AAAGAGCAGGGAAGT CGCACACGCGCAAGT CAGGCGCTCAAAAAG GGA
TCTTCGGAGGTACAG TGGGCAAAAGACTGT AAATAAATAATATAA ATAAAATAATATTTA GCTCTATGTGTTTAT ATA
ATCTACAAAGTAGTT AACAAAAAATATAAA ATGGATATAAAAATA CATCTTATATATCCC TATAATAAGAAATAA ATA
ATAATTTTAGTAAAT TAATTTTGTTACACA AAGTACCTGTATTAT TACCTCTTTTTTGTT GGTTGGTTCTTTTTT GAT
GTGGCCCCACTGTGC TCTCTTATCAGTGCG ACAATCAGGCATTGC CTTTCCCCATCGGGG GATTCTAATTCCGTG GAC
GATGGGCCGAAACGC CTATAAAGTCGCTCA TTAAAAATGTTTAAT TATGGCCCATCTTGC ATCTTGCACCGATGT GGA
TGGGGTTTGTCGGCA ATGATTTACATTATA AAAATGCCCGTTATC TGAGCATTTTGTACG CTCCACTCCCTCTTC CCC
CCTCCAAAAAAAAAA AAAACAGATATGTAT ATTCCCCGAGATATT CCCAAGCGGCCAAAA ATAGACGCAAATTGT AAC
GCACTTGAAGTGCAC TCTGAAACATCTTGA AGTCCAAATAAAATA GCAGAGAGACCCACA ATAATATACGTTGAT ATA
CACATGTATATATGT ATGTATGTACATAAA GGGCCAGGAGCAGGA ACGTTAGGCATGCGG TGGTACGAGCACCGT GGT
GCGAGCGAGAGCGCT GTGCTGCCTGAGGGA GAGGTAGCGAGTGGG TTGCATTGCGCACAC AGAACATGTGAATGC AGA
GTTCAAGTGCATGCC GTGACACAGACACGC ACACACACACACGCA CACACAGATGAGTAG CCGCTGCAAAGTGTT TTT
TCCCAGGCGCTATTT ATAATATGCATCCCG TCGCCGATCCGATCC GATCCAATCCAATCC GATTGGATCCCATCT TGC
GGCACTACGATTATG ACGCTCGACACGATG ATGCATTCGCAGAGT TTCCCGATCGCAGAG TACCCTGTACTCGAG TAG
TTTTTAGATGCAGTA TTATTAAGTAGAAAA TTGTAACCGTATAAT ATTCCATTATATTAA ATATTTTTATAGCAC TAA
AGAAATAAAAGCCCA TTTTATAATTTATAT TACAAAAATACTTAA CCATAGAAACTTATG ATATGATACCAATAT TTA
AGTTCCAAAAAATGT AGAACATTTTTAAGT ATATACTCGAAAATA TTAATTTTCAAAATT GATATTCAAGAGATA TTA
TAAAAAGATCCCCAT TCTAAATATCTAACA TCATGCCATGCTTTC TAATGAGTATAGTAT ACCCCTGCTACCCTG TCA
ATCCGCAAAACAGGC GCCGAAACATGCGGT TTCTCGCAGCAGACT GCCACGGGAAAAATT CGGTTCGAGATTTGG GAA
TGGATGTATGACGGA GCAGAAGGAGCAGGA CCCGGATTTCGGATT TCGGAATGGATATGG AAATGAAGATGGAAA TGG
GACTTTGACTGCGCG ACGGCCACATGCGCC GCTGGCGATGCCGCT GGATGTTGCATGTGG CAGCGGTCGGTGCAG CAG
CGAAAGTGTTGCAGC TGTATGAGAGGGTCT ATTTTTGGGGCGATT GTGCGGCGCTGGTGC TGCCACATGTGTTCT GTG
TTGGGCTGCTAAAAG GCATTGTAATGAGAG CAGAAAATAGAATTG ACTCCACTTGAGCAA TGTCCCATAAAGCGG GAG
TTTCGAGTTTGGCGC GCAATGTGCCGCACC AGCAAACGAACAAAA GAAAAAAAAAAAAAA AAAACACAGCCAGTA ACA
CATGGGCCCACGAGT TATGTTTTATTTTTA ATCCCACAAAGAGTC GATCTCCAAAACAAA CCCGCAGAGAGCACA TAT
AAAGAGACTCGGTGG ACGAGTGGTTCGAAA CAGTCTTCCGCCGCA GCTCGACGCGCTCGC ATATCGGGAATATAT AGA
TCGGAGATATCGCAG GACCCACAGCAGAGC AGAGCCGCAGAGCCA CCAACCTCG
>Obp18a_prom|Dr osophila melanogaster|Ob p18a|FBgn003098 5|X:18969778..1 8972746
ATGGCGAAAATCTGT TTCCCAACTAACAAT GAGCGCATCATCACA GCTCTATATATATAA CCCATCGATTTGCTA ATT
CAGCTCAAAAGTAGA CAGGAGATTTTAATT AAATAATTGGATGCT ACTTTACATTCGCCA CACACCAACAAATAA AGT
CTATAATTGAAATTT TAAGCGCAGTTCCCG ATTATGAGCTACACG TATGTCGTATGCGCA ATATCTGCATTACAA TTG
CCAATAGTAAATTAC CAACTTGGTTTTCTT CATATTTATTAAGAT AGAAAACATACAATT TTTGGCTTTTACACT CCA
AGCATCTCTGAAGTT TAAACAAAAAACATA TGTGTAGCCTATCTA CTGTATTGGACTTTA TTCGTATATTTTATA TGG
TTCATTAATATAGGT ATAAATACAAATTAT ATTCACGCTTTGCGA TTTGCAGCGAATATC ACATCTTATACACGA TGT
AAAAAAAAAAAAAAT ATTTCGTCATGTTTT TAGGTTGGCCGCAGG CAGTGCTCACTGTAC CGCCACAATGTTTAT CGT
TTTGCATTTTTTTTT TCTTTGTTTTCTTGC GGTTTCCCCTAATTA TCTTTAGTATAAACT TAGTCTACTGTCTTT TTT
GGTAAGTATTTTCGT GATGGGCTCGTCTAT GCGAATTCCCATTTC CAATGAATAAATAAA GTAATTAGAACATTA AAA
TTAGCAATAAAACAC GTACATTTAAAGCTG ACAACAAAAAAAAAA AGTATTCTTATGTTA AACTGTAGTATGTGC CTA
TGCAATATTAAGAAC AATTAAATAAAATAG CATATTAACTTATGG CAGCACTTTGTTGCT ATGTTTATGTTTATG TTT
ATGCACGCAGTTAGG CCAGGGCGGATGTAA CATGATCACCCACTC GAAGGCAAAAAGTAT AAGTGCATGGTCAGC ATT
CACACGCCGACCAAA TACATATTACATACG TACATACATATCTCG CTCTCCCGATAAGCC TAGATATATAAGATA TAC
ATAAGAACGCCGCTC CGCTGCTGGCGTACC CGGCAGCGCAGCTAC GCGGATTAGCCTAAG TCCAAATATATTAAA AAC
TGTAAAATCAGAGAG ACTCTGTAGACGTTG AGCTGACAGAACCAT TTCTGCCTACTCTAA AATCAAAAGAAGAAA TTG
AATAAATATATGTCA GCCCGACGGCTGCCT TCAACTTAAAACGGA CTTGTGTTCTGAATT GGAGTTCATCATTAC ATG
GCGACCGTGACAGTC GTCCAACGCTGGACG AATTGACCAAAGCTG GTGAAAACAAAGGAA CAAAGGAACACTGGA CTG
GAAGAAGACTGGACT AATTAAATGGAACTG CAAAAACCAAGGAAA AATCTGAGTGAGTAG AGTTCTATTGAGTAT GGG
CAAACACCGTGGCGG TTTGAAAACTAAGCT GAATAAACGTATAGC CCACGTAAGGTGGCT AATATACGGTCAGCA AAC
GCCACCGGTTTGGTC GAAAGCTCTAAAGCT ACATGCAGAGCTAGA CCACTTGTTGCAATA TCAGCAAGAATTAAA GAC
CCATAAGCTCGAGAA AACTCACTCAGATAA TATTAAAAATATACC CACAATTAATGAAGT TCCAAAATACCAGGC ATG
TCCAGCACCAGCACC AGCATTAACAAAACC AAAGAAGTCCTGCCC CCCTGGCTGCGAAGG AATCTGGAGTCCCCA CTG
CCTGGGGACTTGTGA GCGACCATCGACGTC TTCAGCGGCGAAGAA ATAGACAGCAGCGAG GGAGTGTCAGCGTGC CAC
CCCCGGCGACGCCCA GCTGACACCTGATGA GCATCATCAACAGCA GAATATAATAATAAA TATATATAAATATAA AGT
AAATATAAAATATAT ATAGATAAGAAAAAT TGTAAGAAATATTGT AAAACGGAGCATATA CTATTATGCCCTGTT AAC
CCAATATGGCCCGTG AAGCCATAGCTAGAA TCAGGCAGGCAACAA TGTAAAATACAATTT TTTTTTACTCTTGCG AAC
ATTGAAAGATTTTAT AAATAGATAATTCCA AACATAAATGTCTAT AGAGACAAATGAAAT AAGTAAAACTGAAAA TAA
AAGTATATACAAAGG AAATTTTCTATTCTA TTCTCCAAAATATAA AATTAGTATACCCAA AATGGGTCTAATAGA CAC
TAAAACTGTGGACTC TACAGCCAATGTAAT AAATAAAGTAGAAGT CCAAAATGCAGACTT GTTCTGGATAACCAT AAT
ACTAATTGTAATTGC ATTAATTATGGTATC CAATGCATTAATAAA AATATACAAACTGCA TAACAAGTGTCTTAA GAA
ACGATACCGTAGCAC TGCTAACGGTATAGA TAATATTTAAGGAAG ATCTTTAATAAAGTC AATTATGAATGAAAA TAT
GAGAAAAATTATATG AAAAAAAAAAAATAA TAAATAAAAAAAAAA ATATAAAACGTAATA TTGAATTTATCTACG TTA
AAAAAAAAAATATAT ACAAATGAATAAATT TGAAGTTATGAGTAT ACCACAGCATGGACT GGGAAAAGCTTGTTG ATC
AGATAAAAGATCAAA ATGAAAATTTCAGAA AATCCTATAAGTGCT TAACGCAAAACAGAT CAACACAAGCTGTAA CAA
TCAATAGGAATGCCC AAGTCTTGGTAAATA GTTATAATGAAATCA GAGAGTTGATCCAAC AAAATAGAAAGAATT TGG
AACGCAAACAGTGTG CTAAGGCTTTGAACC TACTGGTGACATTAA GAGAAAAATTAATAT TTATAAAAAATAAAT TCA
GTCTCCAGATAGAAA TTCCAACCATAGTAA ACACCCCACTAAGAA TAAATTTGAATGAAG ACAGCACTAACTCTG ACG
AGGAAGATAGGACTA TAGTCAAGGAAGACA TTAAAGAGGAAGATC TTCACGATCTAACTA TACCAGCAAAATTAA TGC
TGAA
Jul 15 '07 #22
bvdet
2,851 Recognized Expert Moderator Specialist
what i exactly should do is that. for every sequence(from the list of file) and every matrix(that is also a list of matrices in file form) i should calculate the scores.
the output should be in in the form of sequence header(this is the heading that follows the> symbol in the file),position of the sequence read",log value
for eg:
matrix1:
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067
matrix 2
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067

so ..on

the problem is for every sequence in the sequence file and for every position in the file, i am going to calculate the log value.is it clear now?
Not to me. I can't seem to follow the logic in your code:
Expand|Select|Wrap|Line Numbers
  1. def readfasta():
  2.     file1= open("chr011.py",'r')
  3.     file_content=file1.readlines()
  4.     first=1
  5.     list1=""    
  6.     for line in file_content:
  7.         if line[0]==">":
  8.             if first==0:
  9.                 print "***********"
  10.                 list1+=sequence
  11.                 print "***********"
  12.             else:
  13.                 first=0
  14.                 sequence=""
  15.                 seq=""
  16.                 for i in range(0,len(line)-1):
  17.                     seq+=line[i]
  18.         else:
  19.                 for i in range(0,len(line)-1):
  20.             sequence+=line[i]  
  21.     list1+=sequence
  22.     return list1
  23.  
  24. p=readfasta()
  25.  
  26. res=1
  27. part=""
  28. q=len(p)
  29. seqq=""
  30.  
  31. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  32. for i in range(q-16):
  33.     part=p[i:i+16]
  34.     seqq=part
  35.     res=1
  36.     score=1
  37.     for j in range(16):
  38.         key=seqq[j]
  39.         res=res*datadict1[key]["%02d"%(j+1)]
  40.         #print res
  41.     for key in seqq:
  42.         score=score * value[key]
  43.     #print score,"*******************",res
  44.     log_ratio=log10(res/score)
  45.     print i,log_ratio
I have modified function parseData() to include the header:
Expand|Select|Wrap|Line Numbers
  1. def parseData(fn, dataset=1, key='>'):
  2.     '''
  3.     Read a formatted data file of alpha sequences
  4.     Return a list of sequences
  5.     The first element in the list is the header
  6.     '''    
  7.     # initialize output list
  8.     dataList = []
  9.  
  10.     # open file for reading
  11.     f = open(fn)
  12.  
  13.     # skip to required data set
  14.     for _ in range(dataset):
  15.         try:
  16.             s = f.next()
  17.             while not s.startswith(key):
  18.                 s = f.next()
  19.         except StopIteration, e:
  20.             print 'We have reached the end of the file!'
  21.             f.close()
  22.             return False
  23.  
  24.     # initialize output list
  25.     dataList = [s,]
  26.  
  27.     for line in f:
  28.         if not line.startswith(key):
  29.             dataList.append(line.strip())
  30.         else:
  31.             break
  32.  
  33.     f.close()
  34.     return dataList
  35.  
  36. dataSeq = parseData(fnSeq, dataset)
  37. print dataSeq[0]
Output:
>>> >Cp36_PRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8324430..8324 513

I have given you the code to parse the matrix data and sequence data so it can easily be manipulated. I don't understand the log calculation you want. Maybe someone smarter than me can figure it out.
Jul 15 '07 #23
elbin
27 New Member
Expand|Select|Wrap|Line Numbers
  1. for i in range(q-16):
  2.     part=d[i:i+16]
  3.     seqq=part
  4.     res=1
  5.     score=1
  6.     for j in range(16):
  7.         key=seqq[j]
  8.         res=res*datadict1[key]["%02d"%(j+1)]
This means that you take for granted that the subsequence you are examining is 16 characters long, and the matrix you are using is 16 lines too. But they are not all 16 lines. So you need to change this part to
Expand|Select|Wrap|Line Numbers
  1. len(datadict['A'])
for example.
And what do you mean by "integrate the length of the sequence"?

To bvdet: For the log see http://www.thescripts.com/forum/thread672978.html
Jul 15 '07 #24
aboxylica
111 New Member
yes, sixteen is not fixed its gonna varry all through.these aspects confuse me.:(
as to how my code should be
and what i should exactly do is that
>seq1
atattatatat
>seq2
atatattatatata
>seq3
attattatatatata t
...so on..
weightmat1
po
values
weightmat2
values
weightmat3
values
Now
i am actually calculating the log odds ratio(u must be knowing since you are into this)
i calculate for each position the log value..
am i clear now??
you told me previously how to do it for one seq and one weight matrix..now there are multiple matrices and multiple sequences i have to calculate the logodds ratio for each position
am i clear?
waiting for ur reply
cheers
Jul 15 '07 #25
elbin
27 New Member
I think you already have all the needed code for this task, please make an effort and combine it, and you will get the result.
Jul 15 '07 #26
aboxylica
111 New Member
okay il try that.but i don seem to be confident though.but il try.
thanks!
Jul 15 '07 #27
bvdet
2,851 Recognized Expert Moderator Specialist
See if this is what you need:
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     fnArray = 'arraydata.txt'
  6.     fnSeq = 'seqdata.txt'
  7.     dataset = 3
  8.     dataArray = parseArray(fnArray, dataset)
  9.     dataSeq = parseData(fnSeq, dataset)
  10.  
  11.     seq = ''.join(dataSeq[1:])
  12.     subKeys = dataArray['A'].keys()
  13.     subKeys.sort()
  14.  
  15.     i,j = divmod(len(seq), len(subKeys))
  16.     keys = subKeys*i + subKeys[:j]
  17.  
  18.     print dataSeq[0],
  19.     outList = ['%s[%s]*%s = %0.4f' % (s, keys[i], s, dataArray[s][keys[i]]*value[s]) for i, s in enumerate(seq)]
  20.     print '\n'.join(outList)
  21.     print sum([float(s.split('=')[1]) for s in outList])
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> >Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
  2. T[01]*T = 0.0131
  3. C[02]*C = 0.0015
  4. T[03]*T = 0.0019
  5. A[04]*A = 0.0017
  6. G[05]*G = 0.0014
  7. A[06]*A = 0.2515
  8. G[07]*G = 0.0969
  9. A[01]*A = 0.0624
  10. T[02]*T = 0.0014
  11. C[03]*C = 0.0755
  12. T[04]*T = 0.2952
  13. G[05]*G = 0.0014
  14. G[06]*G = 0.0022
  15. G[07]*G = 0.0969
  16. C[01]*C = 0.0093
  17. A[02]*A = 0.0016
  18. C[03]*C = 0.0755
  19. G[04]*G = 0.0010
  20. A[05]*A = 0.0014
  21. T[06]*T = 0.0424
  22. G[07]*G = 0.0969
  23. G[01]*G = 0.1403
  24. C[02]*C = 0.0015
  25. G[03]*G = 0.0011
  26. A[04]*A = 0.0017
  27. G[05]*G = 0.0014
  28. A[06]*A = 0.2515
  29. C[07]*C = 0.0054
  30. A[01]*A = 0.0624
  31. A[02]*A = 0.0016
  32. A[03]*A = 0.1832
  33. G[04]*G = 0.0010
  34. A[05]*A = 0.0014
  35. T[06]*T = 0.0424
  36. G[07]*G = 0.0969
  37. C[01]*C = 0.0093
  38. G[02]*G = 0.1965
  39. G[03]*G = 0.0011
  40. C[04]*C = 0.0011
  41. G[05]*G = 0.0014
  42. C[06]*C = 0.0019
  43. A[07]*A = 0.1154
  44. A[01]*A = 0.0624
  45. A[02]*A = 0.0016
  46. A[03]*A = 0.1832
  47. T[04]*T = 0.2952
  48. C[05]*C = 0.0128
  49. G[06]*G = 0.0022
  50. G[07]*G = 0.0969
  51. A[01]*A = 0.0624
  52. A[02]*A = 0.0016
  53. A[03]*A = 0.1832
  54. T[04]*T = 0.2952
  55. G[05]*G = 0.0014
  56. G[06]*G = 0.0022
  57. A[07]*A = 0.1154
  58. G[01]*G = 0.1403
  59. A[02]*A = 0.0016
  60. T[03]*T = 0.0019
  61. G[04]*G = 0.0010
  62. G[05]*G = 0.0014
  63. A[06]*A = 0.2515
  64. T[07]*T = 0.0310
  65. C[01]*C = 0.0093
  66. A[02]*A = 0.0016
  67. C[03]*C = 0.0755
  68. G[04]*G = 0.0010
  69. T[05]*T = 0.2773
  70. A[06]*A = 0.2515
  71. G[07]*G = 0.0969
  72. C[01]*C = 0.0093
  73. C[02]*C = 0.0015
  74. G[03]*G = 0.0011
  75. G[04]*G = 0.0010
  76. C[05]*C = 0.0128
  77. C[06]*C = 0.0019
  78. A[07]*A = 0.1154
  79. T[01]*T = 0.0131
  80. G[02]*G = 0.1965
  81. G[03]*G = 0.0011
  82. C[04]*C = 0.0011
  83. G[05]*G = 0.0014
  84. G[06]*G = 0.0022
  85. 5.0655
  86. >>> seq
  87. 'TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGAAATGGAGATGGATCACGTAGCCGGCCATGGCGG'
Jul 15 '07 #28
aboxylica
111 New Member
okay.one thing I am doubtful about what does dataset refer to??
and in the last code the calculation u sent me. is it something like
A[01]*A which means your multiplying the normalised value of A at position one and dividing it by the standard A value??so please tell me. and which statement of the code does that??
what I should ba doing is
if i have a sequence like
>header
ATTTATTATATATAT ATTATTATAATTAAA TAT
and using the matrix
calculate A[01]*T[02]*T[03]*.............. ...divided by standard values which is
A=0.3,T=0.3.
C=0.2,G=0.2
so it should be done like A[01]*T[02]*T[03]*.............. .../0.3*0.3*0.3.... ........
for the sequence
then take a log for this value.Then move to the next window of the sequence
TTTATTATATATATA TTATTATAATTAAAT AT(I am just leaving the A) calculate the same way with T in the first position.
I have to do this way for all the sequences.
Jul 16 '07 #29
aboxylica
111 New Member
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46. def random_seq(nchars,insertat,astring):
  47.     seq=""
  48.     for i in range(nchars):
  49.       if i== insertat:
  50.           seq+=astring
  51.       ch=random.choice(("ATGC"))
  52.       seq+=ch
  53.     print seq
  54.     return seq
  55.  
  56. thestring="CGTCAAGTTCAAGTGCAAAA"
  57. count=50-len(thestring)
  58. p=random_seq(count,15,thestring)
  59. file=open("temp.txt",'w')
  60. file.write(str(p))
  61. file.close()
  62.  
  63.  
  64.  
  65.  
  66. res=1
  67. part=""
  68. q=len(p)
  69. seqq=""
  70.  
  71. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  72. for i in range(q-16):
  73.     part=p[i:i+16]
  74.     seqq=part
  75.     res=1
  76.     score=1
  77.     for j in range(16):
  78.         key=seqq[j]
  79.         res=res*datadict1[key]["%02d"%(j+1)]
  80.         #print res
  81.     for key in seqq:
  82.         score=score * value[key]
  83.     #print score,"*******************",res
  84.     log_ratio=log10(res/score)
  85.     print i,log_ratio
  86.  
This is the code that works and calculates for a single sequence and a single matrix(containi ng 16 positions) I want to do it for many sequences and many matrices.I guess am clearer now.I have given how my sequences and matrices look like.I just need to generalize it.am i clearer now
waiting for ur reply
cheers!
Jul 16 '07 #30

Sign in to post your reply or Sign up for a free account.

Similar topics

8
4047
by: kaptain kernel | last post by:
i've got a while loop thats iterating through a text file and pumping the contents into a database. the file is quite large (over 150mb). the looping causes my CPU load to race up to 100 per cent. Even if i remove the mysql insert query and just loop through the file , it still hits 100 per cent CPU. This has the knock on effect of slowing...
5
9427
by: B-Dog | last post by:
I have an old dos program that uses dat files to store the data and I'm trying to convert to dotnet. I'd like to be able to import the data into an access database but I don't know which format the dat files are in. Here is the first few lines of the dat file. If anyone could help me figure out which type this is it would be greatly...
2
24211
by: deko | last post by:
I have a table that contains a bunch of pictures. When the user selects a particular image in a form, I need a way to extract the selected bitmap image (stored in an OLE Object table field) to the file system so the user can do stuff with "somePicture.bmp", for example. Is there an easy way to do this? Thanks in advance.
10
24054
by: bienwell | last post by:
Hi, I have a question about file included in ASP.NET. I have a file that includes all the Sub functions (e.g FileFunct.vb). One of the functions in this file is : Sub TestFunct(ByVal strInput As String) return (strInput & " test") End Sub
3
9543
by: Chung Leong | last post by:
Here's the rest of the tutorial I started earlier: Aside from text within a document, Indexing Service let you search on meta information stored in the files. For example, MusicArtist and MusicAlbum let you find MP3 and other music files based on the singer and album name; DocAuthor let you find Office documents created by a certain user;...
1
2947
by: Alex | last post by:
Hello, I have a stored procedure that processes an individual file from a directory and archives it in a subdirectory.Now, the problem is, when i execute it , it will only process one file. What i want to do is to check to see if there are any files in the folder, and if there are , process them all, and once done, go to the next part in a...
1
6466
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting" setting to be "E_ALL", notices are still not getting reported. The perms on my file are 664, with owner root and group root. The php.ini file is located...
5
2686
by: Mark | last post by:
Hi I have an application (in vb.NET 2005) which holds data in SQL Server and some of the SQL records are simply paths to related files. I would like to be able to do a text search on both the SQL data and the contents of any referenced files. The returned list being a listing which includes both records containing the text and files...
0
1956
by: anthon | last post by:
Hi all - first post! anywho; I need to create a function for speeding up and down a looping clip. imagine a rotating object, triggered by an action, and slowly decreasing in speed, till it reaches a point 0 (compare a hand spinning a fortune wheel). now, this is quite an easy this to achieve, since you just have to set an interval to...
0
7776
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8286
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7869
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8143
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6517
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5340
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3779
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3797
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1107
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.