469,288 Members | 2,357 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,288 developers. It's quick & easy.

looping through a big file containing a set of files.

111 100+
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07 #1
103 5090
bartonc
6,596 Expert 4TB
Start with a list of input files (hopefully, generated automatically when there are many). Like:
Expand|Select|Wrap|Line Numbers
  1. matrixFileList = ['matfile1', 'matfile2']
  2. sequenceFileList = ['seqFile1', 'seqFile2']
  3.  
  4. # use zip() to pair them
  5. filesList = zip(matrixFileList, sequenceFileList)
  6.  
  7. # define your function to take file names
  8. def mainFunc(matFileName, seqFileName):
  9.     print matFileName, seqFileName
  10.  
  11.  
  12. # call the function for ever file pair in the list
  13. for pair in filesList:
  14.     mainFunc(*pair)   # the * breaks the tuple into its parts.
  15.  
Jul 13 '07 #2
aboxylica
111 100+
my sequence file:>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGAT GTCGTTTTTGGGTTTCATACCTTTTCAC
ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATT TAGCTTCGGATGTCAACGTCACCATAAT
CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTG TAAACGCGATGATAATTACAAATATGCC
TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATAT TGAAATTCCGATTCCCTAGAAAATAATA
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTAC TCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAAT GTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATG CTCTGGG
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATACAAAAATTTGTTGTGTTTGAATTGA ATTAAGAGCTTATCAAGAAAAAAATTTC
AGTGACTCATAATACACTACTCTACAAGTTTAAATTGAATCAACAATTTA ACTTTCATTGCTCAGGTTTTTAGTAACA
ATGTTTATATAAGTTTAGGTATAACAAATGATTTAAATATAAGATACTGT ATTTCACATTGAGACGAAACAATCCACC
GAAAATCATAAAATATAAGAATGTTGCATTTTATTTTTAAAAATAAAGAT GCCTTTTAAGAGGAATAACTTAAATGTC
TTTAATACCTTTGAATTTAATTATATGGCTAATAAACACAAACTTAAAGC TTAAAACTGCATCGAATTGAATGCGGTT
ATAAATGTACTTATATATCTAATATAATCTGCTAATATGGTTTACATGGT ATATCTTTCTCGGAAATTTTTACAAAAA
TTATCTATTCATATATCTCGAGCGTAAGATATTTATCAGTTTATAGATAA CATCTTTAAATTTGGGTGATTAAAAAAA
AACATTG
>Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGA AATGGAGATGGATCACGTAGCCGGCCAT
GGCGG
>Him_distal|Drosophila melanogaster|Him|FBgn0030900|X:18039896..18043470
GGTTTTCTGCGATGGCTTCCGCGCCAGCTGAAGTATCTGATTTGCTGCCT TGTTTTTGTTGATATTTCTGCGAAGGGA
CTTGTGCTTTTCAAATGGCCTTTTTTTGGGATTACGGCAAGGGCGCGTTT CCCACGCTCGATCCCCACTTACCATTGG
TGCACGCGATTGCGGCAAGCTGCTGAGGCAAGCTATTAAACGCCACACTG GGCCGGGGGGCGGTACCGGTGGGCGTGG
CAGGGGAGTCGACACATGTTGTGTGCCAGAGAACTTTGCTCCGATCCCCA GATCATCAAATAGTTGTCGCTGTCTGCT
CGTGCGCAAATTGCAATACTTTGCATACCCTTACTGCAGGGTATCTGAGC TTGGACTTTAAATAAGGGGGTATAACAT
AGCTTATACTCTCTATCTCTGTTATAAAGTCAATTTTCCTTAGATCTTTA GTACAGTGGGTAGTTAAGGAGACATAAC
TTCCAAAAAAAAAAACTATAAAATTGCAATAATTTATGCAAAATATGTAT TTTATTGAATGGGATGAATAATTTACCT
TATACGACTGTAAAACATTTCTAACGATTAAATGCACTTCTAAAAGTTTT CCCACAAGTAGGTGAGCTATTATGCTAA
GCGTTCCATGACTTGGAATCTAAGATCTTGTTTTGATCTTCGCTGATCTT TGAGAACTCGGGGATTACTTACACATTT
CTGGGCAGGCACAAGTGGGCCGAGGCAGTGTAGATTCATCACGTTTTCAC TCAACACACGCAGCTCATTAACAGCCCC
GCTGACAACTTGTCAGGACTTCCCCCTCGTGAATCCCCCTGCTACGCAAC CCCCATTCCCCGCCCATTCCAACACTTC
CCGCCGGGAGCGTGGGAAATTATGCGTGTTGGTGGGACGTCGGGCGGTGA AAATTGGCGCGCTCTTCGGGGGGCCACA
CCGCGTGGCATTGACAACTCTTCCACATTTCGCGCCCAACGATGCGTTGG CATCAGTGGGTCACAGGGATTACGGCTG
GCTGGGATTCCAGAGCCAGATCTTTTTCAGCCAAAACTTTCAGCTTTCGA AGACCTCAAGCGATAGGAGAGTGTCGGA
AGTCCAGAAATAGACGCGTAGCACATAAATTATGGATCGTATCGAGTATC GATTAGCCCGGGACAAGCGAAGCGATAG
GGAGACATATTTTTATTACCCTCTCGGGGACCTGCACTTGTTGGCTTCGC TTCTATGAAAGATCCCTCTACCATATCA
CGTATGTGGGCTCCCCCAATCGAACCGAGTTGTGGGAAATGTTTTCCCAG GCCAACAGCTAATTGTCACTCCAAGGGT
TGTCCCCGCAGCCCAGACGACAGATAAGCGGGCAAGTGAAGCCCAGCGAT CTGAGTCAAGTGAAGGGCTTCAATTTCT
TTCCCGAGTGGAACTGGGATATCGAAATTACATTTGTAACAGACGTTTTA GTCCGCAATCCTCAGCTAATGGGACTTA
CGAACATATATTCATCTGAAATTCAAGAACATGCGCACTTAAAGAGCAGG GAAGTCGCACACGCGCAAGTCAGGCGCT
CAAAAAGGGATCTTCGGAGGTACAGTGGGCAAAAGACTGTAAATAAATAA TATAAATAAAATAATATTTAGCTCTATG
TGTTTATATAATCTACAAAGTAGTTAACAAAAAATATAAAATGGATATAA AAATACATCTTATATATCCCTATAATAA
GAAATAAATAATAATTTTAGTAAATTAATTTTGTTACACAAAGTACCTGT ATTATTACCTCTTTTTTGTTGGTTGGTT
CTTTTTTGATGTGGCCCCACTGTGCTCTCTTATCAGTGCGACAATCAGGC ATTGCCTTTCCCCATCGGGGGATTCTAA
TTCCGTGGACGATGGGCCGAAACGCCTATAAAGTCGCTCATTAAAAATGT TTAATTATGGCCCATCTTGCATCTTGCA
CCGATGTGGATGGGGTTTGTCGGCAATGATTTACATTATAAAAATGCCCG TTATCTGAGCATTTTGTACGCTCCACTC
CCTCTTCCCCCCTCCAAAAAAAAAAAAAACAGATATGTATATTCCCCGAG ATATTCCCAAGCGGCCAAAAATAGACGC
AAATTGTAACGCACTTGAAGTGCACTCTGAAACATCTTGAAGTCCAAATA AAATAGCAGAGAGACCCACAATAATATA
CGTTGATATACACATGTATATATGTATGTATGTACATAAAGGGCCAGGAG CAGGAACGTTAGGCATGCGGTGGTACGA
GCACCGTGGTGCGAGCGAGAGCGCTGTGCTGCCTGAGGGAGAGGTAGCGA GTGGGTTGCATTGCGCACACAGAACATG
TGAATGCAGAGTTCAAGTGCATGCCGTGACACAGACACGCACACACACAC ACGCACACACAGATGAGTAGCCGCTGCA
AAGTGTTTTTTCCCAGGCGCTATTTATAATATGCATCCCGTCGCCGATCC GATCCGATCCAATCCAATCCGATTGGAT
CCCATCTTGCGGCACTACGATTATGACGCTCGACACGATGATGCATTCGC AGAGTTTCCCGATCGCAGAGTACCCTGT
ACTCGAGTAGTTTTTAGATGCAGTATTATTAAGTAGAAAATTGTAACCGT ATAATATTCCATTATATTAAATATTTTT
ATAGCACTAAAGAAATAAAAGCCCATTTTATAATTTATATTACAAAAATA CTTAACCATAGAAACTTATGATATGATA
CCAATATTTAAGTTCCAAAAAATGTAGAACATTTTTAAGTATATACTCGA AAATATTAATTTTCAAAATTGATATTCA
AGAGATATTATAAAAAGATCCCCATTCTAAATATCTAACATCATGCCATG CTTTCTAATGAGTATAGTATACCCCTGC
TACCCTGTCAATCCGCAAAACAGGCGCCGAAACATGCGGTTTCTCGCAGC AGACTGCCACGGGAAAAATTCGGTTCGA
GATTTGGGAATGGATGTATGACGGAGCAGAAGGAGCAGGACCCGGATTTC GGATTTCGGAATGGATATGGAAATGAAG
ATGGAAATGGGACTTTGACTGCGCGACGGCCACATGCGCCGCTGGCGATG CCGCTGGATGTTGCATGTGGCAGCGGTC
GGTGCAGCAGCGAAAGTGTTGCAGCTGTATGAGAGGGTCTATTTTTGGGG CGATTGTGCGGCGCTGGTGCTGCCACAT
GTGTTCTGTGTTGGGCTGCTAAAAGGCATTGTAATGAGAGCAGAAAATAG AATTGACTCCACTTGAGCAATGTCCCAT
AAAGCGGGAGTTTCGAGTTTGGCGCGCAATGTGCCGCACCAGCAAACGAA CAAAAGAAAAAAAAAAAAAAAAAACACA
GCCAGTAACACATGGGCCCACGAGTTATGTTTTATTTTTAATCCCACAAA GAGTCGATCTCCAAAACAAACCCGCAGA
GAGCACATATAAAGAGACTCGGTGGACGAGTGGTTCGAAACAGTCTTCCG CCGCAGCTCGACGCGCTCGCATATCGGG
AATATATAGATCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAG AGCCACCAACCTCG
>Him_proximal|Drosophila melanogaster|Him|FBgn0030900|X:18041232..18043470
GCCCAGACGACAGATAAGCGGGCAAGTGAAGCCCAGCGATCTGAGTCAAG TGAAGGGCTTCAATTTCTTTCCCGAGTG
GAACTGGGATATCGAAATTACATTTGTAACAGACGTTTTAGTCCGCAATC CTCAGCTAATGGGACTTACGAACATATA
TTCATCTGAAATTCAAGAACATGCGCACTTAAAGAGCAGGGAAGTCGCAC ACGCGCAAGTCAGGCGCTCAAAAAGGGA
TCTTCGGAGGTACAGTGGGCAAAAGACTGTAAATAAATAATATAAATAAA ATAATATTTAGCTCTATGTGTTTATATA
ATCTACAAAGTAGTTAACAAAAAATATAAAATGGATATAAAAATACATCT TATATATCCCTATAATAAGAAATAAATA
ATAATTTTAGTAAATTAATTTTGTTACACAAAGTACCTGTATTATTACCT CTTTTTTGTTGGTTGGTTCTTTTTTGAT
GTGGCCCCACTGTGCTCTCTTATCAGTGCGACAATCAGGCATTGCCTTTC CCCATCGGGGGATTCTAATTCCGTGGAC
GATGGGCCGAAACGCCTATAAAGTCGCTCATTAAAAATGTTTAATTATGG CCCATCTTGCATCTTGCACCGATGTGGA
TGGGGTTTGTCGGCAATGATTTACATTATAAAAATGCCCGTTATCTGAGC ATTTTGTACGCTCCACTCCCTCTTCCCC
CCTCCAAAAAAAAAAAAAACAGATATGTATATTCCCCGAGATATTCCCAA GCGGCCAAAAATAGACGCAAATTGTAAC
GCACTTGAAGTGCACTCTGAAACATCTTGAAGTCCAAATAAAATAGCAGA GAGACCCACAATAATATACGTTGATATA
CACATGTATATATGTATGTATGTACATAAAGGGCCAGGAGCAGGAACGTT AGGCATGCGGTGGTACGAGCACCGTGGT
GCGAGCGAGAGCGCTGTGCTGCCTGAGGGAGAGGTAGCGAGTGGGTTGCA TTGCGCACACAGAACATGTGAATGCAGA
GTTCAAGTGCATGCCGTGACACAGACACGCACACACACACACGCACACAC AGATGAGTAGCCGCTGCAAAGTGTTTTT
TCCCAGGCGCTATTTATAATATGCATCCCGTCGCCGATCCGATCCGATCC AATCCAATCCGATTGGATCCCATCTTGC
GGCACTACGATTATGACGCTCGACACGATGATGCATTCGCAGAGTTTCCC GATCGCAGAGTACCCTGTACTCGAGTAG
TTTTTAGATGCAGTATTATTAAGTAGAAAATTGTAACCGTATAATATTCC ATTATATTAAATATTTTTATAGCACTAA
AGAAATAAAAGCCCATTTTATAATTTATATTACAAAAATACTTAACCATA GAAACTTATGATATGATACCAATATTTA
AGTTCCAAAAAATGTAGAACATTTTTAAGTATATACTCGAAAATATTAAT TTTCAAAATTGATATTCAAGAGATATTA
TAAAAAGATCCCCATTCTAAATATCTAACATCATGCCATGCTTTCTAATG AGTATAGTATACCCCTGCTACCCTGTCA
ATCCGCAAAACAGGCGCCGAAACATGCGGTTTCTCGCAGCAGACTGCCAC GGGAAAAATTCGGTTCGAGATTTGGGAA
TGGATGTATGACGGAGCAGAAGGAGCAGGACCCGGATTTCGGATTTCGGA ATGGATATGGAAATGAAGATGGAAATGG
GACTTTGACTGCGCGACGGCCACATGCGCCGCTGGCGATGCCGCTGGATG TTGCATGTGGCAGCGGTCGGTGCAGCAG
CGAAAGTGTTGCAGCTGTATGAGAGGGTCTATTTTTGGGGCGATTGTGCG GCGCTGGTGCTGCCACATGTGTTCTGTG
TTGGGCTGCTAAAAGGCATTGTAATGAGAGCAGAAAATAGAATTGACTCC ACTTGAGCAATGTCCCATAAAGCGGGAG
TTTCGAGTTTGGCGCGCAATGTGCCGCACCAGCAAACGAACAAAAGAAAA AAAAAAAAAAAAAACACAGCCAGTAACA
CATGGGCCCACGAGTTATGTTTTATTTTTAATCCCACAAAGAGTCGATCT CCAAAACAAACCCGCAGAGAGCACATAT
AAAGAGACTCGGTGGACGAGTGGTTCGAAACAGTCTTCCGCCGCAGCTCG ACGCGCTCGCATATCGGGAATATATAGA
TCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAGAGCCACCAAC CTCG
>Obp18a_prom|Drosophila melanogaster|Obp18a|FBgn0030985|X:18969778..189727 46
ATGGCGAAAATCTGTTTCCCAACTAACAATGAGCGCATCATCACAGCTCT ATATATATAACCCATCGATTTGCTAATT
CAGCTCAAAAGTAGACAGGAGATTTTAATTAAATAATTGGATGCTACTTT ACATTCGCCACACACCAACAAATAAAGT
CTATAATTGAAATTTTAAGCGCAGTTCCCGATTATGAGCTACACGTATGT CGTATGCGCAATATCTGCATTACAATTG
CCAATAGTAAATTACCAACTTGGTTTTCTTCATATTTATTAAGATAGAAA ACATACAATTTTTGGCTTTTACACTCCA
AGCATCTCTGAAGTTTAAACAAAAAACATATGTGTAGCCTATCTACTGTA TTGGACTTTATTCGTATATTTTATATGG
TTCATTAATATAGGTATAAATACAAATTATATTCACGCTTTGCGATTTGC AGCGAATATCACATCTTATACACGATGT
AAAAAAAAAAAAAATATTTCGTCATGTTTTTAGGTTGGCCGCAGGCAGTG CTCACTGTACCGCCACAATGTTTATCGT
TTTGCATTTTTTTTTTCTTTGTTTTCTTGCGGTTTCCCCTAATTATCTTT AGTATAAACTTAGTCTACTGTCTTTTTT
GGTAAGTATTTTCGTGATGGGCTCGTCTATGCGAATTCCCATTTCCAATG AATAAATAAAGTAATTAGAACATTAAAA
TTAGCAATAAAACACGTACATTTAAAGCTGACAACAAAAAAAAAAAGTAT TCTTATGTTAAACTGTAGTATGTGCCTA
TGCAATATTAAGAACAATTAAATAAAATAGCATATTAACTTATGGCAGCA CTTTGTTGCTATGTTTATGTTTATGTTT
ATGCACGCAGTTAGGCCAGGGCGGATGTAACATGATCACCCACTCGAAGG CAAAAAGTATAAGTGCATGGTCAGCATT
CACACGCCGACCAAATACATATTACATACGTACATACATATCTCGCTCTC CCGATAAGCCTAGATATATAAGATATAC
ATAAGAACGCCGCTCCGCTGCTGGCGTACCCGGCAGCGCAGCTACGCGGA TTAGCCTAAGTCCAAATATATTAAAAAC
TGTAAAATCAGAGAGACTCTGTAGACGTTGAGCTGACAGAACCATTTCTG CCTACTCTAAAATCAAAAGAAGAAATTG
AATAAATATATGTCAGCCCGACGGCTGCCTTCAACTTAAAACGGACTTGT GTTCTGAATTGGAGTTCATCATTACATG
GCGACCGTGACAGTCGTCCAACGCTGGACGAATTGACCAAAGCTGGTGAA AACAAAGGAACAAAGGAACACTGGACTG
GAAGAAGACTGGACTAATTAAATGGAACTGCAAAAACCAAGGAAAAATCT GAGTGAGTAGAGTTCTATTGAGTATGGG
CAAACACCGTGGCGGTTTGAAAACTAAGCTGAATAAACGTATAGCCCACG TAAGGTGGCTAATATACGGTCAGCAAAC
GCCACCGGTTTGGTCGAAAGCTCTAAAGCTACATGCAGAGCTAGACCACT TGTTGCAATATCAGCAAGAATTAAAGAC
CCATAAGCTCGAGAAAACTCACTCAGATAATATTAAAAATATACCCACAA TTAATGAAGTTCCAAAATACCAGGCATG
TCCAGCACCAGCACCAGCATTAACAAAACCAAAGAAGTCCTGCCCCCCTG GCTGCGAAGGAATCTGGAGTCCCCACTG
CCTGGGGACTTGTGAGCGACCATCGACGTCTTCAGCGGCGAAGAAATAGA CAGCAGCGAGGGAGTGTCAGCGTGCCAC
CCCCGGCGACGCCCAGCTGACACCTGATGAGCATCATCAACAGCAGAATA TAATAATAAATATATATAAATATAAAGT
AAATATAAAATATATATAGATAAGAAAAATTGTAAGAAATATTGTAAAAC GGAGCATATACTATTATGCCCTGTTAAC
CCAATATGGCCCGTGAAGCCATAGCTAGAATCAGGCAGGCAACAATGTAA AATACAATTTTTTTTTACTCTTGCGAAC
ATTGAAAGATTTTATAAATAGATAATTCCAAACATAAATGTCTATAGAGA CAAATGAAATAAGTAAAACTGAAAATAA
AAGTATATACAAAGGAAATTTTCTATTCTATTCTCCAAAATATAAAATTA GTATACCCAAAATGGGTCTAATAGACAC
TAAAACTGTGGACTCTACAGCCAATGTAATAAATAAAGTAGAAGTCCAAA ATGCAGACTTGTTCTGGATAACCATAAT
ACTAATTGTAATTGCATTAATTATGGTATCCAATGCATTAATAAAAATAT ACAAACTGCATAACAAGTGTCTTAAGAA
ACGATACCGTAGCACTGCTAACGGTATAGATAATATTTAAGGAAGATCTT TAATAAAGTCAATTATGAATGAAAATAT
GAGAAAAATTATATGAAAAAAAAAAAATAATAAATAAAAAAAAAAATATA AAACGTAATATTGAATTTATCTACGTTA
AAAAAAAAAATATATACAAATGAATAAATTTGAAGTTATGAGTATACCAC AGCATGGACTGGGAAAAGCTTGTTGATC
AGATAAAAGATCAAAATGAAAATTTCAGAAAATCCTATAAGTGCTTAACG CAAAACAGATCAACACAAGCTGTAACAA
TCAATAGGAATGCCCAAGTCTTGGTAAATAGTTATAATGAAATCAGAGAG TTGATCCAACAAAATAGAAAGAATTTGG
AACGCAAACAGTGTGCTAAGGCTTTGAACCTACTGGTGACATTAAGAGAA AAATTAATATTTATAAAAAATAAATTCA
GTCTCCAGATAGAAATTCCAACCATAGTAAACACCCCACTAAGAATAAAT TTGAATGAAGACAGCACTAACTCTGACG
AGGAAGATAGGACTATAGTCAAGGAAGACATTAAAGAGGAAGATCTTCAC GATCTAACTATACCAGCAAAATTAATGC
TGAA
>Obp19a_prom|Drosophila melanogaster|Obp19a|FBgn0031109|X:20223943..202264 46
CCACCTGCGAAATGGGTCATAGTATATGTATTTGTAAAAAATGTATGTAA AAAAATGTTAAATTAATAATTTTGAATT
TCAATTTGGAGCTGAAAATAATATTTTGTGTCCATCAACAGCTCCAAAGC GATGGTTCATTTTATCTTGTGTGCGTTC
AATAGAATCACTCTTACGTTAGCGCGTCCATTGATGGTTGTCCCATTGAA GTACTTCTTAAAGCCGTCGGCCATTGCT
ACTGGACTGGATCTGGAGATCTGGAGATCTGGATTTGGGGTCGGGTCCGG GTGAGAGCTGAGTGTGTTCTGCCTATAG
CTCCGAGCGAGAACCTAATGACAAGCAGCGAAGTGCAAAGCTCGGCCAAC TAGATTACAAAGTCGATTCATTGGCAGG
ATTCGATTTTTATTGACTCAACGAGGTGGTACATGAGTTTGGTCCCCAAG CCTTTAACTGTGGCATCGAGGACCGGAA
AGGGGGTGCTGATTATAAATAGTTATGGATTGCTGACGGGTCGAATGGGT CGGAGCGGTGGGGAGCCATGACTTCAAT
GATTTGGCAGCATCGGCGCCCTAGCCATGGAGCATGGCCTGCTGGCAGCC CTTGCAGTAGAGCTTGGTCTCGCGCCGC
TTCGTGTTGCGGCGGTGCATCTTGACCAGGACGTAGACGAGTCCCAACGA GGCCCAGGTGGCCTTGGCTACCTGTGGG
TTTCGGTGGCGTATTTGGGCGCATCTTGTGTACTGCCGTGTACTGAATCA CTTACATTGGCGCGACCACGCATGGTCT
GGCTGTTGAAGGCTTCGTTGAAGTTGAAATGATCGGACATCTTTGGATCG TTGTTGACCGGATTGGCGTGGCTTTTAA
CAAAAGATTAAAATTTGGATTCGATATTCGACCTGTATTTTAGACCGGGA TTCGGATTGTGACTTTTAAACGTTCGAA
ATGAAAGGAATGTTACTGACAGTCGTCAAAGCCGACTCGGGTTTCCCAAC TAGAGAGAATGCTGAAGTCTAGTACCGA
CTAATGGGATACCCATTAATTACTGCTTAAATACTGTGATGAAAATTGAG ATATGCAAGAGGCAAATCGAAAGTTTTG
GACATTTTCATATTGTACCTTTAACCAACTTCAGAATTCATTGAGCTAAA TACCATTTACAATTTTATGAAATTTTTA
AGCATGTTACAGCTATAACTATTTTTAAACCAGTTACTAGATTCGTTGAA AATTGTATGTCACACAGAACTTCTTGCC
ATCCTGGTCGGAATTAGGATCACTAGCCAAGCCGATATGGCTATGTCTGT CCGTATGAAAGTCTTGGAATCTGATATT
AACATCGCATATCGATCGACCATTATATATCTAATATATCCTCTACAAAT GTATTTTATCACCTAGCTAGCATGTAAA
CATTCTGGCCTATTTAGCTGTACGCTTCAGTTATGCTAATGCAAACATAA GCCTTTTGTGATATTATAATTTACATTT
ATTATTTATTGCAGTTAGCTTTATCAGCGATTTGGGCTCATGCCACACGC AATACTACTTATTTCAACGTCATCAGTT
GTACTAAATGCACAAATGAAATACATTTCGCCAAATAAATGCCAACTTGC AACTAATTTGAATGCTAATCAAACCGAA
CTACTCATTTGCATACAAGGTAATAGGTGGTTAAAGTGAGTGTAATGGAC TTACTTAAGGGGTTACAAGGCTTATATT
TAAAATGCCTGCCTTGTAATTAAATTTTTAAATATATTGGAAAAAAATGG CCACTTGTTATGTGAGTCTCCAGAAAAA
AAACAAAAAAACAGCAACCATCTGGTATGCAAAATATCTGGTGGTAGCAA AATATCTGGTGGTATCTGGTGGACTATC
AAAATATAAAAACTTTTTTTTCCAGATAGTATATCTTAAAATCAGCATCT TGAAGGAGTATATGTAAATAGCAAACTA
TTTGTAAAAATAGATTTTATTTTATAATTTTTTAAGATATATACCAAACA TTATTACCGATTGTGATTATCTTTACAT
TGTTTGACCTCAAAACGGAAAACTGGATGCGCGGTATCCATGCGACCCTA ACTCTGGAACCGATTTTGGAACCGCCCC
GTTAGATCTCAGATTGAAACCTTATTTGCATTCGCATGATCGCTGATGAA CACTGGGGAAATGCGGCCCAGCAATGGG
ATTGTCAACGCATCTCGGCCAGAATCGCGCCTCGCATGCCACCTCGCACG GTGACCACATACCTGTGTACACTGTCAA
TTAACGTGGCAAGATTATAGCCCGGCCAGAAAGTAATCCGCCCCAGGAAC ACCACCCACCGCCCGCCCATTTGGATAT
GGAAATGGGCAGTGGGGGCGGCGATTGGCGCTAACCCATAATTCCCACAC CCACTTAGCGGTTCGATCGAACCAATAT
GAAGTCATTTGCATGTCGGGGGCCGTGTATAAAAGGAGTCGCCGATGGGT CTGGAGTCTGGAATCCGCCAAATCGTCT
CGGAAAT
>Obp19b_prom|Drosophila melanogaster|Obp19b|FBgn0031110|X:20224439..202274 40
ATTGCTGACGGGTCGAATGGGTCGGAGCGGTGGGGAGCCATGACTTCAAT GATTTGGCAGCATCGGCGCCCTAGCCAT
GGAGCATGGCCTGCTGGCAGCCCTTGCAGTAGAGCTTGGTCTCGCGCCGC TTCGTGTTGCGGCGGTGCATCTTGACCA
GGACGTAGACGAGTCCCAACGAGGCCCAGGTGGCCTTGGCTACCTGTGGG TTTCGGTGGCGTATTTGGGCGCATCTTG
TGTACTGCCGTGTACTGAATCACTTACATTGGCGCGACCACGCATGGTCT GGCTGTTGAAGGCTTCGTTGAAGTTGAA
ATGATCGGACATCTTTGGATCGTTGTTGACCGGATTGGCGTGGCTTTTAA CAAAAGATTAAAATTTGGATTCGATATT
CGACCTGTATTTTAGACCGGGATTCGGATTGTGACTTTTAAACGTTCGAA ATGAAAGGAATGTTACTGACAGTCGTCA
AAGCCGACTCGGGTTTCCCAACTAGAGAGAATGCTGAAGTCTAGTACCGA CTAATGGGATACCCATTAATTACTGCTT
AAATACTGTGATGAAAATTGAGATATGCAAGAGGCAAATCGAAAGTTTTG GACATTTTCATATTGTACCTTTAACCAA
CTTCAGAATTCATTGAGCTAAATACCATTTACAATTTTATGAAATTTTTA AGCATGTTACAGCTATAACTATTTTTAA
ACCAGTTACTAGATTCGTTGAAAATTGTATGTCACACAGAACTTCTTGCC ATCCTGGTCGGAATTAGGATCACTAGCC
AAGCCGATATGGCTATGTCTGTCCGTATGAAAGTCTTGGAATCTGATATT AACATCGCATATCGATCGACCATTATAT
ATCTAATATATCCTCTACAAATGTATTTTATCACCTAGCTAGCATGTAAA CATTCTGGCCTATTTAGCTGTACGCTTC
AGTTATGCTAATGCAAACATAAGCCTTTTGTGATATTATAATTTACATTT ATTATTTATTGCAGTTAGCTTTATCAGC
GATTTGGGCTCATGCCACACGCAATACTACTTATTTCAACGTCATCAGTT GTACTAAATGCACAAATGAAATACATTT
CGCCAAATAAATGCCAACTTGCAACTAATTTGAATGCTAATCAAACCGAA CTACTCATTTGCATACAAGGTAATAGGT
GGTTAAAGTGAGTGTAATGGACTTACTTAAGGGGTTACAAGGCTTATATT TAAAATGCCTGCCTTGTAATTAAATTTT
TAAATATATTGGAAAAAAATGGCCACTTGTTATGTGAGTCTCCAGAAAAA AAACAAAAAAACAGCAACCATCTGGTAT
GCAAAATATCTGGTGGTAGCAAAATATCTGGTGGTATCTGGTGGACTATC AAAATATAAAAACTTTTTTTTCCAGATA
GTATATCTTAAAATCAGCATCTTGAAGGAGTATATGTAAATAGCAAACTA TTTGTAAAAATAGATTTTATTTTATAAT
TTTTTAAGATATATACCAAACATTATTACCGATTGTGATTATCTTTACAT TGTTTGACCTCAAAACGGAAAACTGGAT
GCGCGGTATCCATGCGACCCTAACTCTGGAACCGATTTTGGAACCGCCCC GTTAGATCTCAGATTGAAACCTTATTTG
CATTCGCATGATCGCTGATGAACACTGGGGAAATGCGGCCCAGCAATGGG ATTGTCAACGCATCTCGGCCAGAATCGC
GCCTCGCATGCCACCTCGCACGGTGACCACATACCTGTGTACACTGTCAA TTAACGTGGCAAGATTATAGCCCGGCCA
"how can i make a list of the individual sequences.here the new sequences start with">" symbol.so i want to put them in individual lists.and i cant specify the number of sequences in the entire file.. how can i do?"
Jul 13 '07 #3
bvdet
2,851 Expert Mod 2GB
my sequence file:>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
.........................
GCGCGGTATCCATGCGACCCTAACTCTGGAACCGATTTTGGAACCGCCCC GTTAGATCTCAGATTGAAACCTTATTTG
CATTCGCATGATCGCTGATGAACACTGGGGAAATGCGGCCCAGCAATGGG ATTGTCAACGCATCTCGGCCAGAATCGC
GCCTCGCATGCCACCTCGCACGGTGACCACATACCTGTGTACACTGTCAA TTAACGTGGCAAGATTATAGCCCGGCCA
"how can i make a list of the individual sequences.here the new sequences start with">" symbol.so i want to put them in individual lists.and i cant specify the number of sequences in the entire file.. how can i do?"
Python is great for this kind of stuff:
Expand|Select|Wrap|Line Numbers
  1. def parseData(fn, dataset=1, key='>'):
  2.     # initialize output list
  3.     dataList = []
  4.  
  5.     # open file for reading
  6.     f = open(fn)
  7.  
  8.     # skip to required data set
  9.     for _ in range(dataset):
  10.         try:
  11.             s = f.next()
  12.             while not s.startswith(key):
  13.                 s = f.next()
  14.         except StopIteration, e:
  15.             print 'We have reached the end of the file!'
  16.             f.close()
  17.             return False
  18.  
  19.     for line in f:
  20.         if not line.startswith(key):
  21.             dataList.append(line.strip())
  22.         else:
  23.             break
  24.  
  25.     f.close()
  26.     return dataList
  27.  
  28. fn = 'your_file'
  29. data = []
  30. i = 1
  31. while True:
  32.     d = parseData(fn, i)
  33.     if d:
  34.         data.append(d)
  35.     else:
  36.         break
  37.     i += 1
  38.  
  39. for item in data:
  40.     for i in item:
  41.         print i
Jul 13 '07 #4
aboxylica
111 100+
This is my code. it always seems to go to exception loop..i donno why??
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta(fn,dataset=1,key=">"):
  48.     datalist=[]
  49.     fn= open(fn)
  50.     for _ in range(dataset):
  51.         try:
  52.             s=f.next()
  53.             while not s.startswith(key):
  54.                 s=f.next()
  55.         except StopIteration,e:
  56.             print "we have reached the end of file!"
  57.             f.close()
  58.             return False
  59.     for line in f:
  60.         if not line.startswith(key):
  61.             datalist.append(line.strip())
  62.         else:
  63.             break
  64.     f.close()
  65.     print datalist
  66.     return datalist
  67.  
  68. fn="redfly_sequence.fasta"
  69. data=[]
  70. i=1
  71. while True:
  72.     d=readfasta(fn,i)
  73.     print d
  74.     if d:
  75.         data.append(d)
  76.     else:
  77.         break
  78.     i=i+1
  79.  
  80. for item in data:
  81.     for i in item:
  82.         print i
  83.  
  84.  
  85.  
  86.  
  87. #p=readfasta()
  88.  
  89.  
  90.  
  91.  
  92.  
  93. res=1
  94. part=""
  95.  
  96. #q=len(d)
  97. #print q
  98. seqq=""
  99.  
  100. #value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  101. #for i in range(q-16):
  102.  #   part=d[i:i+16]
  103.   #  seqq=part
  104.    # res=1
  105.     #score=1
  106.    # for j in range(16):
  107.     #    key=seqq[j]
  108.      #   res=res*datadict1[key]["%02d"%(j+1)]
  109.         #print res
  110.    # for key in seqq:
  111.     #    score=score * value[key]
  112.     #print score,"*******************",res
  113.     #log_ratio=log10(res/score)
  114.    # print i,log_ratio
  115.  
waiting for your reply
cheers!!
Jul 15 '07 #5
bartonc
6,596 Expert 4TB
If you mean this,[code=python]# useing
Expand|Select|Wrap|Line Numbers
  1.  tags#
  2.                 s=f.next()
  3.         except StopIteration,e:
  4.             print "we have reached the end of file!"
it is supposed to do that. Simply remove the print statement.

Exceptions are often used to control program flow just like if, while, etc. That is the case here.
Jul 15 '07 #6
aboxylica
111 100+
even if i do that..it is going to false.. i want to see if it is reading the entire file or not..
Jul 15 '07 #7
aboxylica
111 100+
what i want to do specifically is store the sequences seperately in lists and calculate the scores for them individually how do i do it??
Jul 15 '07 #8
aboxylica
111 100+
i executed the code you gave me it works..it takes the entire file as a single list.. i want to read the entire file but take the "individual files" as lists..calculate the scores for thrm..how do i do this?
waiting for your reply,
cheers!
Jul 15 '07 #9
aboxylica
111 100+
hey,
am stuck..plz help
Jul 15 '07 #10
bvdet
2,851 Expert Mod 2GB
i executed the code you gave me it works..it takes the entire file as a single list.. i want to read the entire file but take the "individual files" as lists..calculate the scores for thrm..how do i do this?
waiting for your reply,
cheers!
The output is actually a list of lists. The StopIteration happens because there is no more data to read, and will happen EVERY TIME. This is necessary because we do not know how many sets of data are in there. The function was coded to read one data set only. You could rewrite the function to iterate only once and return the same list of lists. I will leave that up to you. Here is part of the output, printed in a different way:
Expand|Select|Wrap|Line Numbers
  1. >>> for i, item in enumerate(data):
  2. ...     print 'Sequence number %d: %s' % (i, item)
  3. ...     
  4. Sequence number 0: ['CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG', 'GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT', 'CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC', 'CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC', 'ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT', 'CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC', 'TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA', 'CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG', 'GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC', 'TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG']
  5. Sequence number 1: ['AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTAGAAGCTAAATCCATGTCCACGTCAAACC', 
  6. .........................
Jul 15 '07 #11
bvdet
2,851 Expert Mod 2GB
aboxylica,

I have reworked the parse functions a bit for compatibility, and tested for a specific data set (set number 3). The output is a dictionary of the data set matrix after processing and a list of the corresponding(?) data set sequences. Can you show us what needs to be done from here?
Expand|Select|Wrap|Line Numbers
  1. # Parse matrix data and sequence data files
  2.  
  3. def parseArray(fn, dataset=1, key='PO', term='/'):
  4.     '''
  5.     Read a formatted data file in matrix format and
  6.     compile data into a dictionary
  7.     '''
  8.     f = open(fn)
  9.  
  10.     # skip to required data set
  11.     for _ in range(dataset):
  12.         try:
  13.             line = f.next()
  14.             while not line.startswith(key):
  15.                 line = f.next()
  16.         except StopIteration, e:
  17.             print 'We have reached the end of the file!'
  18.             f.close()
  19.             return False
  20.  
  21.     headerList = line.strip().split()[1:]
  22.     lineList = []
  23.  
  24.     line = f.next().strip()
  25.     while not line.startswith(term):
  26.         if line != '':
  27.             lineList.append(line.strip().split())
  28.         line = f.next().strip()
  29.  
  30.     f.close()
  31.  
  32.     # Key list
  33.     keys = [i[0] for i in lineList]
  34.     # Values list
  35.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  36.  
  37.     # Create a dictionary from keys and values
  38.     lineDict = dict(zip(keys, values))
  39.  
  40.     dataDict = {}
  41.  
  42.     for i, item in enumerate(headerList):
  43.         dataDict[item] = {}
  44.         for key in lineDict:
  45.             dataDict[item][key] = lineDict[key][i]
  46.  
  47.     # Add 1.0 to every element in dataDict subdictionaries
  48.     for keyMain in dataDict:
  49.         for keySub in dataDict[keyMain]:
  50.             dataDict[keyMain][keySub] += 1.0
  51.  
  52.     # Normalize original data (with 1 added) and update data
  53.     valueSums = [sum(item)+4 for item in values]
  54.  
  55.     for keyMain in dataDict:
  56.         for keySub in dataDict[keyMain]:
  57.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  58.  
  59.     return dataDict
  60.  
  61.  
  62. def parseData(fn, dataset=1, key='>'):
  63.     '''
  64.     Read a formatted data file of alpha sequences
  65.     Return a list of sequences
  66.     '''    
  67.     # initialize output list
  68.     dataList = []
  69.  
  70.     # open file for reading
  71.     f = open(fn)
  72.  
  73.     # skip to required data set
  74.     for _ in range(dataset):
  75.         try:
  76.             s = f.next()
  77.             while not s.startswith(key):
  78.                 s = f.next()
  79.         except StopIteration, e:
  80.             print 'We have reached the end of the file!'
  81.             f.close()
  82.             return False
  83.  
  84.     for line in f:
  85.         if not line.startswith(key):
  86.             dataList.append(line.strip())
  87.         else:
  88.             break
  89.  
  90.     f.close()
  91.     return dataList
  92.  
  93. if __name__ == '__main__':
  94.     fnArray = 'matrixdata.txt'
  95.     fnSeq = 'seqdata.txt'
  96.     dataset = 3
  97.     dataArray = parseArray(fnArray, dataset)
  98.     dataSeq = parseData(fnSeq, dataset)
  99.  
  100. '''
  101. >>> for key in dataArray:
  102. ...     print '%s = %s' % (key, dataArray[key])
  103. ...     
  104. A = {'02': 0.0054464669628780287, '03': 0.61075011944577162, '01': 0.20807453416149066, '06': 0.8383737040752951, '07': 0.38474033729874352, '04': 0.0057811753463927378, '05': 0.0047776025990158141}
  105. C = {'02': 0.0073097319764941961, '03': 0.37754419493549923, '01': 0.046583850931677016, '06': 0.0093163250680808381, '07': 0.027232334814390139, '04': 0.0055900621118012426, '05': 0.063924322774831593}
  106. T = {'02': 0.0047776025990158141, '03': 0.0064022933588150982, '01': 0.043669374104156715, '06': 0.14122593282690746, '07': 0.10338732024270221, '04': 0.98385093167701865, '05': 0.92446610290955999}
  107. G = {'02': 0.98246619846161198, '03': 0.0053033922599139988, '01': 0.70167224080267565, '06': 0.01108403802971669, '07': 0.48464000764416415, '04': 0.0047778308647873869, '05': 0.0068319717165926134}
  108. >>> for item in dataSeq:
  109. ...     print item
  110. ...     
  111. TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGAAATGGAGATGGATCACGTAGCCGGCCAT
  112. GGCGG
  113. >>> 
  114. '''
Jul 15 '07 #12
elbin
27
Just a suggestion, the sequence is needed in a single string maybe, not a list of strings, and the same evaluation as in the matrix-thread should be done, each 16-character fragment at a time, using the normalized matrix. And then the same thing for all sequence/matrix pairs (I don't think they are corresponding)...

Is that so, aboxylica?
Jul 15 '07 #13
bvdet
2,851 Expert Mod 2GB
Just a suggestion, the sequence is needed in a single string maybe, not a list of strings, and the same evaluation as in the matrix-thread should be done, each 16-character fragment at a time, using the normalized matrix. And then the same thing for all sequence/matrix pairs (I don't think they are corresponding)...

Is that so, aboxylica?
Not to be contrary, but the matrix data set #3 has only 7 elements for each of A, C, G, T. The sequence data set #3 has 78 characters on one line and 5 on another. The sequence can easily be joined if needed for evaluation:
Expand|Select|Wrap|Line Numbers
  1. >>> ''.join(dataSeq)
  2. 'TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGAAATGGAGATGGATCACGTAGCCGGCCATGGCGG'
  3.  
The sequence/matrix data may not be corresponding as you suggested.
Jul 15 '07 #14
elbin
27
Not to be contrary, but the matrix data set #3 has only 7 elements for each of A, C, G, T. The sequence data set #3 has 78 characters on one line and 5 on another. The sequence can easily be joined if needed for evaluation:
Expand|Select|Wrap|Line Numbers
  1. >>> ''.join(dataSeq)
  2. 'TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGAAATGGAGATGGATCACGTAGCCGGCCATGGCGG'
  3.  
The sequence/matrix data may not be corresponding as you suggested.
Then I suppose the window-size changes for each matrix... It was just some thoughts on the task, because I am into genetics as well... Thanks for the clarification :), I overlooked that.
Jul 15 '07 #15
aboxylica
111 100+
what i exactly should do is that. for every sequence(from the list of file) and every matrix(that is also a list of matrices in file form) i should calculate the scores.
the output should be in in the form of sequence header(this is the heading that follows the> symbol in the file),position of the sequence read",log value
for eg:
matrix1:
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067
matrix 2
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067

so ..on

the problem is for every sequence in the sequence file and for every position in the file, i am going to calculate the log value.is it clear now?
Jul 15 '07 #16
bvdet
2,851 Expert Mod 2GB
Then I suppose the window-size changes for each matrix... It was just some thoughts on the task, because I am into genetics as well... Thanks for the clarification :), I overlooked that.
Thank you for your comments and suggestions. I hope we have helped the OP find a solution.
Jul 15 '07 #17
aboxylica
111 100+
sorry i am not sure if i got what u meant. but i have posted how exactly my o/p should be.so tell me accordingly
Jul 15 '07 #18
elbin
27
the problem is for every sequence in the sequence file and for every position in the file, i am going to calculate the log value.is it clear now?
The partial sequence for which you are calculating the log has the length of the matrix, doesn't it? So you need to combine the code from this thread and the matrix thread, so you have both the sequences and matrices, and the scoring, and then run it over all matrices and over all sequences.
Jul 15 '07 #19
aboxylica
111 100+
After that for each sequence in the file .for every A,T,G,C I am going to calculate the values,like i give specific scores for each alphabet,then calculate the divide the score of the martix divided by the score of this(at each position) take a log for this. and give the o/p in the form i mentioned.
q=len(d)
print q
seqq=""

value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
for i in range(q-16):
part=d[i:i+16]
seqq=part
res=1
score=1
for j in range(16):
key=seqq[j]
res=res*datadict1[key]["%02d"%(j+1)]
print res
for key in seqq:
score=score * value[key]
log_ratio=log10(res/score)
print i,log_ratio
Jul 15 '07 #20
aboxylica
111 100+
The partial sequence for which you are calculating the log has the length of the matrix, doesn't it? So you need to combine the code from this thread and the matrix thread, so you have both the sequences and matrices, and the scoring, and then run it over all matrices and over all sequences.
but my matrix file is something like this. i am not gonna be specific about the datasets it is a huge file.like this.(this ia a part)
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1//

PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//
NA BEAF-32
PO A C G T
01 16.78 0.91 0.00 3.45
02 0.62 0.92 11.18 8.41
03 0.07 20.94 0.00 0.14
04 0.45 0.47 19.97 0.25
05 11.06 2.12 4.95 3.01
06 0.90 0.00 9.47 10.77
07 12.46 3.27 0.00 5.41
08 0.45 6.88 13.48 0.33
09 0.10 1.02 0.00 20.03
10 9.15 1.11 5.14 5.75
11 2.37 0.29 0.00 18.48
12 0.00 8.76 8.01 4.37
13 0.42 8.63 11.09 1.00
14 7.27 1.53 12.08 0.26
15 1.82 0.05 3.23 16.04
//
//
NA BEAF-32A
PO A C G T
01 1.00 0.00 0.24 1.30
02 0.93 0.00 1.53 0.08
03 1.53 0.00 1.00 0.00
04 1.53 1.00 0.00 0.00
05 0.00 0.00 2.54 0.00
06 0.00 1.69 0.77 0.08
07 0.00 0.00 2.46 0.08
08 0.00 0.64 1.30 0.60
09 0.00 0.08 2.46 0.00
10 0.00 1.05 0.00 1.49
11 0.24 0.00 2.30 0.00
12 0.08 0.11 0.00 2.35
13 0.24 0.00 2.30 0.00
14 0.00 0.93 0.00 1.61
15 0.00 0.00 2.54 0.00
16 0.08 1.53 0.00 0.93
//
//
NA BEAF-32B
PO A C G T
01 0.00 7.91 0.00 0.00
02 0.00 0.00 7.91 0.00
03 7.91 0.00 0.00 0.00
04 0.00 0.00 0.00 7.91
05 7.91 0.00 0.00 0.00
06 0.00 1.67 3.51 2.73
07 0.00 0.00 0.00 7.91
08 3.49 0.16 0.00 4.27
09 0.00 0.00 0.00 7.91
10 0.00 5.11 0.91 1.89
11 0.00 4.31 3.60 0.00
12 0.16 7.64 0.00 0.11
13 7.00 0.00 0.91 0.00
14 0.00 6.18 0.00 1.73
15 4.27 2.80 0.00 0.84
16 1.84 5.11 0.84 0.11
//
//
NA Cf2-II
PO A C G T
01 0.00 0.00 0.43 12.03
02 0.00 10.74 0.00 1.72
03 6.27 0.00 6.19 0.00
04 0.00 11.76 0.00 0.70
05 0.78 0.00 11.25 0.43
06 0.00 0.00 0.00 12.46
07 11.91 0.00 0.12 0.43
08 6.27 0.00 0.00 6.19
09 11.56 0.12 0.78 0.00
10 5.88 0.00 0.00 6.58
11 8.86 0.00 3.60 0.00
12 5.77 0.12 0.00 6.58
13 0.00 6.27 6.19 0.00
14 0.00 12.46 0.00 0.00
15 6.69 0.00 5.77 0.00
16 3.52 0.00 8.94 0.00
//
//
NA Deaf1
PO A C G T
01 5.42 5.98 1.71 0.42
02 7.31 4.33 0.25 1.64
03 12.16 1.24 0.13 0.00
04 13.04 0.13 0.00 0.36
05 7.25 1.66 4.62 0.00
06 0.37 1.29 11.76 0.11
07 0.00 13.47 0.00 0.05
08 0.75 1.71 11.07 0.00
09 11.53 0.13 0.05 1.81
10 0.37 0.00 0.00 13.16
11 0.00 12.82 0.00 0.71
12 0.00 0.00 12.84 0.68
13 8.00 0.25 4.24 1.04
14 0.00 6.03 0.00 7.50
15 0.42 0.13 4.38 8.60
16 0.05 0.98 7.93 4.57
//
//
NA Dfd
PO A C G T
01 0.50 1.66 0.07 68.40
02 52.59 9.34 8.31 0.39
03 69.57 0.66 0.00 0.40
04 2.22 0.14 0.41 67.86
05 0.44 0.18 23.53 46.49
06 36.75 5.44 26.74 1.70
07 16.27 4.86 18.49 31.01
08 8.79 3.43 17.07 41.35
09 1.40 3.62 29.62 36.00
10 1.89 20.88 10.86 37.00
11 30.75 25.66 13.32 0.91
//
//
NA Dref
PO A C G T
01 1.28 2.13 14.78 18.90
02 5.33 12.15 12.68 6.92
03 4.99 8.72 21.15 2.22
04 10.42 6.71 18.00 1.95
05 22.25 0.51 10.62 3.70
06 15.72 3.00 0.00 18.36
07 26.44 3.26 4.01 3.38
08 10.33 5.61 9.50 11.64
09 8.67 18.41 0.18 9.83
10 2.83 0.84 0.24 33.17
11 35.50 0.91 0.60 0.08
12 0.35 0.08 1.05 35.60
13 0.22 34.76 0.79 1.31
14 4.00 0.88 31.28 0.93
15 23.50 6.09 0.33 7.16
16 4.83 1.79 1.77 28.69
//
//
NA E-spl-
PO A C G T
01 0.26 16.93 0.00 0.00
02 16.31 0.88 0.00 0.00
03 0.00 11.13 0.00 6.05
04 0.00 0.00 10.52 6.67
05 8.95 2.38 0.00 5.86
06 0.21 0.07 16.91 0.00
07 8.38 8.81 0.00 0.00
08 0.00 17.07 0.12 0.00
09 17.13 0.05 0.00 0.00
10 0.81 13.88 2.38 0.12
11 8.89 0.00 8.22 0.07
12 8.08 0.07 2.31 6.72
13 0.21 2.38 14.60 0.00
14 0.00 8.45 8.74 0.00
15 11.34 0.00 0.00 5.85
16 0.00 0.00 2.58 14.60
//
//
NA Eip74EF
PO A C G T
01 26.64 2.84 3.15 0.66
02 28.55 3.74 0.18 0.82
03 13.77 1.02 15.46 3.05
04 5.12 14.05 4.35 9.78
05 14.07 17.06 1.63 0.52
06 31.86 0.47 0.07 0.89
07 15.27 1.33 14.82 1.88
08 9.00 10.66 5.19 8.44
09 16.44 0.08 3.58 13.18
10 8.17 0.00 14.74 10.38
11 7.69 7.01 16.85 1.75
12 13.60 6.89 2.36 10.44
13 2.20 0.34 26.98 3.77
14 4.30 0.43 2.94 25.63
15 4.05 2.54 3.78 22.93
//
//
NA HLHm5
PO A C G T
01 6.96 4.69 0.00 0.00
02 0.00 3.00 0.00 8.65
03 0.00 6.96 4.69 0.00
04 0.00 11.65 0.00 0.00
05 4.69 0.00 0.00 6.96
06 0.00 0.00 0.00 11.65
07 0.00 0.00 6.96 4.69
08 0.00 0.00 0.00 11.65
09 0.00 0.00 11.65 0.00
10 4.69 0.00 6.96 0.00
11 0.00 11.65 0.00 0.00
12 4.69 0.00 0.00 6.96
13 0.00 11.65 0.00 0.00
14 0.00 0.00 11.65 0.00
15 0.00 0.00 0.00 11.65
16 0.00 0.00 11.65 0.00
//
//
NA His2B
PO A C G T
01 0.41 0.61 0.53 21.43
02 0.00 0.00 0.00 22.97
03 22.97 0.00 0.00 0.00
04 0.00 22.97 0.00 0.00
05 0.00 22.97 0.00 0.00
06 0.00 0.00 0.00 22.97
07 22.97 0.00 0.00 0.00
08 22.97 0.00 0.00 0.00
//
//
Jul 15 '07 #21
aboxylica
111 100+
i donno how to incorporate the length of the sequence each time.its gonna change.this is a part of my sequence file:
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGAT GTCGTTTTTGGGTTTCATACCTTTTCAC
ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATT TAGCTTCGGATGTCAACGTCACCATAAT
CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTG TAAACGCGATGATAATTACAAATATGCC
TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATAT TGAAATTCCGATTCCCTAGAAAATAATA
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTAC TCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAAT GTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATG CTCTGGG
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATACAAAAATTTGTTGTGTTTGAATTGA ATTAAGAGCTTATCAAGAAAAAAATTTC
AGTGACTCATAATACACTACTCTACAAGTTTAAATTGAATCAACAATTTA ACTTTCATTGCTCAGGTTTTTAGTAACA
ATGTTTATATAAGTTTAGGTATAACAAATGATTTAAATATAAGATACTGT ATTTCACATTGAGACGAAACAATCCACC
GAAAATCATAAAATATAAGAATGTTGCATTTTATTTTTAAAAATAAAGAT GCCTTTTAAGAGGAATAACTTAAATGTC
TTTAATACCTTTGAATTTAATTATATGGCTAATAAACACAAACTTAAAGC TTAAAACTGCATCGAATTGAATGCGGTT
ATAAATGTACTTATATATCTAATATAATCTGCTAATATGGTTTACATGGT ATATCTTTCTCGGAAATTTTTACAAAAA
TTATCTATTCATATATCTCGAGCGTAAGATATTTATCAGTTTATAGATAA CATCTTTAAATTTGGGTGATTAAAAAAA
AACATTG
>Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGA AATGGAGATGGATCACGTAGCCGGCCAT
GGCGG
>Him_distal|Drosophila melanogaster|Him|FBgn0030900|X:18039896..18043470
GGTTTTCTGCGATGGCTTCCGCGCCAGCTGAAGTATCTGATTTGCTGCCT TGTTTTTGTTGATATTTCTGCGAAGGGA
CTTGTGCTTTTCAAATGGCCTTTTTTTGGGATTACGGCAAGGGCGCGTTT CCCACGCTCGATCCCCACTTACCATTGG
TGCACGCGATTGCGGCAAGCTGCTGAGGCAAGCTATTAAACGCCACACTG GGCCGGGGGGCGGTACCGGTGGGCGTGG
CAGGGGAGTCGACACATGTTGTGTGCCAGAGAACTTTGCTCCGATCCCCA GATCATCAAATAGTTGTCGCTGTCTGCT
CGTGCGCAAATTGCAATACTTTGCATACCCTTACTGCAGGGTATCTGAGC TTGGACTTTAAATAAGGGGGTATAACAT
AGCTTATACTCTCTATCTCTGTTATAAAGTCAATTTTCCTTAGATCTTTA GTACAGTGGGTAGTTAAGGAGACATAAC
TTCCAAAAAAAAAAACTATAAAATTGCAATAATTTATGCAAAATATGTAT TTTATTGAATGGGATGAATAATTTACCT
TATACGACTGTAAAACATTTCTAACGATTAAATGCACTTCTAAAAGTTTT CCCACAAGTAGGTGAGCTATTATGCTAA
GCGTTCCATGACTTGGAATCTAAGATCTTGTTTTGATCTTCGCTGATCTT TGAGAACTCGGGGATTACTTACACATTT
CTGGGCAGGCACAAGTGGGCCGAGGCAGTGTAGATTCATCACGTTTTCAC TCAACACACGCAGCTCATTAACAGCCCC
GCTGACAACTTGTCAGGACTTCCCCCTCGTGAATCCCCCTGCTACGCAAC CCCCATTCCCCGCCCATTCCAACACTTC
CCGCCGGGAGCGTGGGAAATTATGCGTGTTGGTGGGACGTCGGGCGGTGA AAATTGGCGCGCTCTTCGGGGGGCCACA
CCGCGTGGCATTGACAACTCTTCCACATTTCGCGCCCAACGATGCGTTGG CATCAGTGGGTCACAGGGATTACGGCTG
GCTGGGATTCCAGAGCCAGATCTTTTTCAGCCAAAACTTTCAGCTTTCGA AGACCTCAAGCGATAGGAGAGTGTCGGA
AGTCCAGAAATAGACGCGTAGCACATAAATTATGGATCGTATCGAGTATC GATTAGCCCGGGACAAGCGAAGCGATAG
GGAGACATATTTTTATTACCCTCTCGGGGACCTGCACTTGTTGGCTTCGC TTCTATGAAAGATCCCTCTACCATATCA
CGTATGTGGGCTCCCCCAATCGAACCGAGTTGTGGGAAATGTTTTCCCAG GCCAACAGCTAATTGTCACTCCAAGGGT
TGTCCCCGCAGCCCAGACGACAGATAAGCGGGCAAGTGAAGCCCAGCGAT CTGAGTCAAGTGAAGGGCTTCAATTTCT
TTCCCGAGTGGAACTGGGATATCGAAATTACATTTGTAACAGACGTTTTA GTCCGCAATCCTCAGCTAATGGGACTTA
CGAACATATATTCATCTGAAATTCAAGAACATGCGCACTTAAAGAGCAGG GAAGTCGCACACGCGCAAGTCAGGCGCT
CAAAAAGGGATCTTCGGAGGTACAGTGGGCAAAAGACTGTAAATAAATAA TATAAATAAAATAATATTTAGCTCTATG
TGTTTATATAATCTACAAAGTAGTTAACAAAAAATATAAAATGGATATAA AAATACATCTTATATATCCCTATAATAA
GAAATAAATAATAATTTTAGTAAATTAATTTTGTTACACAAAGTACCTGT ATTATTACCTCTTTTTTGTTGGTTGGTT
CTTTTTTGATGTGGCCCCACTGTGCTCTCTTATCAGTGCGACAATCAGGC ATTGCCTTTCCCCATCGGGGGATTCTAA
TTCCGTGGACGATGGGCCGAAACGCCTATAAAGTCGCTCATTAAAAATGT TTAATTATGGCCCATCTTGCATCTTGCA
CCGATGTGGATGGGGTTTGTCGGCAATGATTTACATTATAAAAATGCCCG TTATCTGAGCATTTTGTACGCTCCACTC
CCTCTTCCCCCCTCCAAAAAAAAAAAAAACAGATATGTATATTCCCCGAG ATATTCCCAAGCGGCCAAAAATAGACGC
AAATTGTAACGCACTTGAAGTGCACTCTGAAACATCTTGAAGTCCAAATA AAATAGCAGAGAGACCCACAATAATATA
CGTTGATATACACATGTATATATGTATGTATGTACATAAAGGGCCAGGAG CAGGAACGTTAGGCATGCGGTGGTACGA
GCACCGTGGTGCGAGCGAGAGCGCTGTGCTGCCTGAGGGAGAGGTAGCGA GTGGGTTGCATTGCGCACACAGAACATG
TGAATGCAGAGTTCAAGTGCATGCCGTGACACAGACACGCACACACACAC ACGCACACACAGATGAGTAGCCGCTGCA
AAGTGTTTTTTCCCAGGCGCTATTTATAATATGCATCCCGTCGCCGATCC GATCCGATCCAATCCAATCCGATTGGAT
CCCATCTTGCGGCACTACGATTATGACGCTCGACACGATGATGCATTCGC AGAGTTTCCCGATCGCAGAGTACCCTGT
ACTCGAGTAGTTTTTAGATGCAGTATTATTAAGTAGAAAATTGTAACCGT ATAATATTCCATTATATTAAATATTTTT
ATAGCACTAAAGAAATAAAAGCCCATTTTATAATTTATATTACAAAAATA CTTAACCATAGAAACTTATGATATGATA
CCAATATTTAAGTTCCAAAAAATGTAGAACATTTTTAAGTATATACTCGA AAATATTAATTTTCAAAATTGATATTCA
AGAGATATTATAAAAAGATCCCCATTCTAAATATCTAACATCATGCCATG CTTTCTAATGAGTATAGTATACCCCTGC
TACCCTGTCAATCCGCAAAACAGGCGCCGAAACATGCGGTTTCTCGCAGC AGACTGCCACGGGAAAAATTCGGTTCGA
GATTTGGGAATGGATGTATGACGGAGCAGAAGGAGCAGGACCCGGATTTC GGATTTCGGAATGGATATGGAAATGAAG
ATGGAAATGGGACTTTGACTGCGCGACGGCCACATGCGCCGCTGGCGATG CCGCTGGATGTTGCATGTGGCAGCGGTC
GGTGCAGCAGCGAAAGTGTTGCAGCTGTATGAGAGGGTCTATTTTTGGGG CGATTGTGCGGCGCTGGTGCTGCCACAT
GTGTTCTGTGTTGGGCTGCTAAAAGGCATTGTAATGAGAGCAGAAAATAG AATTGACTCCACTTGAGCAATGTCCCAT
AAAGCGGGAGTTTCGAGTTTGGCGCGCAATGTGCCGCACCAGCAAACGAA CAAAAGAAAAAAAAAAAAAAAAAACACA
GCCAGTAACACATGGGCCCACGAGTTATGTTTTATTTTTAATCCCACAAA GAGTCGATCTCCAAAACAAACCCGCAGA
GAGCACATATAAAGAGACTCGGTGGACGAGTGGTTCGAAACAGTCTTCCG CCGCAGCTCGACGCGCTCGCATATCGGG
AATATATAGATCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAG AGCCACCAACCTCG
>Him_proximal|Drosophila melanogaster|Him|FBgn0030900|X:18041232..18043470
GCCCAGACGACAGATAAGCGGGCAAGTGAAGCCCAGCGATCTGAGTCAAG TGAAGGGCTTCAATTTCTTTCCCGAGTG
GAACTGGGATATCGAAATTACATTTGTAACAGACGTTTTAGTCCGCAATC CTCAGCTAATGGGACTTACGAACATATA
TTCATCTGAAATTCAAGAACATGCGCACTTAAAGAGCAGGGAAGTCGCAC ACGCGCAAGTCAGGCGCTCAAAAAGGGA
TCTTCGGAGGTACAGTGGGCAAAAGACTGTAAATAAATAATATAAATAAA ATAATATTTAGCTCTATGTGTTTATATA
ATCTACAAAGTAGTTAACAAAAAATATAAAATGGATATAAAAATACATCT TATATATCCCTATAATAAGAAATAAATA
ATAATTTTAGTAAATTAATTTTGTTACACAAAGTACCTGTATTATTACCT CTTTTTTGTTGGTTGGTTCTTTTTTGAT
GTGGCCCCACTGTGCTCTCTTATCAGTGCGACAATCAGGCATTGCCTTTC CCCATCGGGGGATTCTAATTCCGTGGAC
GATGGGCCGAAACGCCTATAAAGTCGCTCATTAAAAATGTTTAATTATGG CCCATCTTGCATCTTGCACCGATGTGGA
TGGGGTTTGTCGGCAATGATTTACATTATAAAAATGCCCGTTATCTGAGC ATTTTGTACGCTCCACTCCCTCTTCCCC
CCTCCAAAAAAAAAAAAAACAGATATGTATATTCCCCGAGATATTCCCAA GCGGCCAAAAATAGACGCAAATTGTAAC
GCACTTGAAGTGCACTCTGAAACATCTTGAAGTCCAAATAAAATAGCAGA GAGACCCACAATAATATACGTTGATATA
CACATGTATATATGTATGTATGTACATAAAGGGCCAGGAGCAGGAACGTT AGGCATGCGGTGGTACGAGCACCGTGGT
GCGAGCGAGAGCGCTGTGCTGCCTGAGGGAGAGGTAGCGAGTGGGTTGCA TTGCGCACACAGAACATGTGAATGCAGA
GTTCAAGTGCATGCCGTGACACAGACACGCACACACACACACGCACACAC AGATGAGTAGCCGCTGCAAAGTGTTTTT
TCCCAGGCGCTATTTATAATATGCATCCCGTCGCCGATCCGATCCGATCC AATCCAATCCGATTGGATCCCATCTTGC
GGCACTACGATTATGACGCTCGACACGATGATGCATTCGCAGAGTTTCCC GATCGCAGAGTACCCTGTACTCGAGTAG
TTTTTAGATGCAGTATTATTAAGTAGAAAATTGTAACCGTATAATATTCC ATTATATTAAATATTTTTATAGCACTAA
AGAAATAAAAGCCCATTTTATAATTTATATTACAAAAATACTTAACCATA GAAACTTATGATATGATACCAATATTTA
AGTTCCAAAAAATGTAGAACATTTTTAAGTATATACTCGAAAATATTAAT TTTCAAAATTGATATTCAAGAGATATTA
TAAAAAGATCCCCATTCTAAATATCTAACATCATGCCATGCTTTCTAATG AGTATAGTATACCCCTGCTACCCTGTCA
ATCCGCAAAACAGGCGCCGAAACATGCGGTTTCTCGCAGCAGACTGCCAC GGGAAAAATTCGGTTCGAGATTTGGGAA
TGGATGTATGACGGAGCAGAAGGAGCAGGACCCGGATTTCGGATTTCGGA ATGGATATGGAAATGAAGATGGAAATGG
GACTTTGACTGCGCGACGGCCACATGCGCCGCTGGCGATGCCGCTGGATG TTGCATGTGGCAGCGGTCGGTGCAGCAG
CGAAAGTGTTGCAGCTGTATGAGAGGGTCTATTTTTGGGGCGATTGTGCG GCGCTGGTGCTGCCACATGTGTTCTGTG
TTGGGCTGCTAAAAGGCATTGTAATGAGAGCAGAAAATAGAATTGACTCC ACTTGAGCAATGTCCCATAAAGCGGGAG
TTTCGAGTTTGGCGCGCAATGTGCCGCACCAGCAAACGAACAAAAGAAAA AAAAAAAAAAAAAACACAGCCAGTAACA
CATGGGCCCACGAGTTATGTTTTATTTTTAATCCCACAAAGAGTCGATCT CCAAAACAAACCCGCAGAGAGCACATAT
AAAGAGACTCGGTGGACGAGTGGTTCGAAACAGTCTTCCGCCGCAGCTCG ACGCGCTCGCATATCGGGAATATATAGA
TCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAGAGCCACCAAC CTCG
>Obp18a_prom|Drosophila melanogaster|Obp18a|FBgn0030985|X:18969778..189727 46
ATGGCGAAAATCTGTTTCCCAACTAACAATGAGCGCATCATCACAGCTCT ATATATATAACCCATCGATTTGCTAATT
CAGCTCAAAAGTAGACAGGAGATTTTAATTAAATAATTGGATGCTACTTT ACATTCGCCACACACCAACAAATAAAGT
CTATAATTGAAATTTTAAGCGCAGTTCCCGATTATGAGCTACACGTATGT CGTATGCGCAATATCTGCATTACAATTG
CCAATAGTAAATTACCAACTTGGTTTTCTTCATATTTATTAAGATAGAAA ACATACAATTTTTGGCTTTTACACTCCA
AGCATCTCTGAAGTTTAAACAAAAAACATATGTGTAGCCTATCTACTGTA TTGGACTTTATTCGTATATTTTATATGG
TTCATTAATATAGGTATAAATACAAATTATATTCACGCTTTGCGATTTGC AGCGAATATCACATCTTATACACGATGT
AAAAAAAAAAAAAATATTTCGTCATGTTTTTAGGTTGGCCGCAGGCAGTG CTCACTGTACCGCCACAATGTTTATCGT
TTTGCATTTTTTTTTTCTTTGTTTTCTTGCGGTTTCCCCTAATTATCTTT AGTATAAACTTAGTCTACTGTCTTTTTT
GGTAAGTATTTTCGTGATGGGCTCGTCTATGCGAATTCCCATTTCCAATG AATAAATAAAGTAATTAGAACATTAAAA
TTAGCAATAAAACACGTACATTTAAAGCTGACAACAAAAAAAAAAAGTAT TCTTATGTTAAACTGTAGTATGTGCCTA
TGCAATATTAAGAACAATTAAATAAAATAGCATATTAACTTATGGCAGCA CTTTGTTGCTATGTTTATGTTTATGTTT
ATGCACGCAGTTAGGCCAGGGCGGATGTAACATGATCACCCACTCGAAGG CAAAAAGTATAAGTGCATGGTCAGCATT
CACACGCCGACCAAATACATATTACATACGTACATACATATCTCGCTCTC CCGATAAGCCTAGATATATAAGATATAC
ATAAGAACGCCGCTCCGCTGCTGGCGTACCCGGCAGCGCAGCTACGCGGA TTAGCCTAAGTCCAAATATATTAAAAAC
TGTAAAATCAGAGAGACTCTGTAGACGTTGAGCTGACAGAACCATTTCTG CCTACTCTAAAATCAAAAGAAGAAATTG
AATAAATATATGTCAGCCCGACGGCTGCCTTCAACTTAAAACGGACTTGT GTTCTGAATTGGAGTTCATCATTACATG
GCGACCGTGACAGTCGTCCAACGCTGGACGAATTGACCAAAGCTGGTGAA AACAAAGGAACAAAGGAACACTGGACTG
GAAGAAGACTGGACTAATTAAATGGAACTGCAAAAACCAAGGAAAAATCT GAGTGAGTAGAGTTCTATTGAGTATGGG
CAAACACCGTGGCGGTTTGAAAACTAAGCTGAATAAACGTATAGCCCACG TAAGGTGGCTAATATACGGTCAGCAAAC
GCCACCGGTTTGGTCGAAAGCTCTAAAGCTACATGCAGAGCTAGACCACT TGTTGCAATATCAGCAAGAATTAAAGAC
CCATAAGCTCGAGAAAACTCACTCAGATAATATTAAAAATATACCCACAA TTAATGAAGTTCCAAAATACCAGGCATG
TCCAGCACCAGCACCAGCATTAACAAAACCAAAGAAGTCCTGCCCCCCTG GCTGCGAAGGAATCTGGAGTCCCCACTG
CCTGGGGACTTGTGAGCGACCATCGACGTCTTCAGCGGCGAAGAAATAGA CAGCAGCGAGGGAGTGTCAGCGTGCCAC
CCCCGGCGACGCCCAGCTGACACCTGATGAGCATCATCAACAGCAGAATA TAATAATAAATATATATAAATATAAAGT
AAATATAAAATATATATAGATAAGAAAAATTGTAAGAAATATTGTAAAAC GGAGCATATACTATTATGCCCTGTTAAC
CCAATATGGCCCGTGAAGCCATAGCTAGAATCAGGCAGGCAACAATGTAA AATACAATTTTTTTTTACTCTTGCGAAC
ATTGAAAGATTTTATAAATAGATAATTCCAAACATAAATGTCTATAGAGA CAAATGAAATAAGTAAAACTGAAAATAA
AAGTATATACAAAGGAAATTTTCTATTCTATTCTCCAAAATATAAAATTA GTATACCCAAAATGGGTCTAATAGACAC
TAAAACTGTGGACTCTACAGCCAATGTAATAAATAAAGTAGAAGTCCAAA ATGCAGACTTGTTCTGGATAACCATAAT
ACTAATTGTAATTGCATTAATTATGGTATCCAATGCATTAATAAAAATAT ACAAACTGCATAACAAGTGTCTTAAGAA
ACGATACCGTAGCACTGCTAACGGTATAGATAATATTTAAGGAAGATCTT TAATAAAGTCAATTATGAATGAAAATAT
GAGAAAAATTATATGAAAAAAAAAAAATAATAAATAAAAAAAAAAATATA AAACGTAATATTGAATTTATCTACGTTA
AAAAAAAAAATATATACAAATGAATAAATTTGAAGTTATGAGTATACCAC AGCATGGACTGGGAAAAGCTTGTTGATC
AGATAAAAGATCAAAATGAAAATTTCAGAAAATCCTATAAGTGCTTAACG CAAAACAGATCAACACAAGCTGTAACAA
TCAATAGGAATGCCCAAGTCTTGGTAAATAGTTATAATGAAATCAGAGAG TTGATCCAACAAAATAGAAAGAATTTGG
AACGCAAACAGTGTGCTAAGGCTTTGAACCTACTGGTGACATTAAGAGAA AAATTAATATTTATAAAAAATAAATTCA
GTCTCCAGATAGAAATTCCAACCATAGTAAACACCCCACTAAGAATAAAT TTGAATGAAGACAGCACTAACTCTGACG
AGGAAGATAGGACTATAGTCAAGGAAGACATTAAAGAGGAAGATCTTCAC GATCTAACTATACCAGCAAAATTAATGC
TGAA
Jul 15 '07 #22
bvdet
2,851 Expert Mod 2GB
what i exactly should do is that. for every sequence(from the list of file) and every matrix(that is also a list of matrices in file form) i should calculate the scores.
the output should be in in the form of sequence header(this is the heading that follows the> symbol in the file),position of the sequence read",log value
for eg:
matrix1:
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067
matrix 2
>ref1
01,2.0012
02,3.0047
.
.
.
>ref2
01,3.0047
02,8.0067

so ..on

the problem is for every sequence in the sequence file and for every position in the file, i am going to calculate the log value.is it clear now?
Not to me. I can't seem to follow the logic in your code:
Expand|Select|Wrap|Line Numbers
  1. def readfasta():
  2.     file1= open("chr011.py",'r')
  3.     file_content=file1.readlines()
  4.     first=1
  5.     list1=""    
  6.     for line in file_content:
  7.         if line[0]==">":
  8.             if first==0:
  9.                 print "***********"
  10.                 list1+=sequence
  11.                 print "***********"
  12.             else:
  13.                 first=0
  14.                 sequence=""
  15.                 seq=""
  16.                 for i in range(0,len(line)-1):
  17.                     seq+=line[i]
  18.         else:
  19.                 for i in range(0,len(line)-1):
  20.             sequence+=line[i]  
  21.     list1+=sequence
  22.     return list1
  23.  
  24. p=readfasta()
  25.  
  26. res=1
  27. part=""
  28. q=len(p)
  29. seqq=""
  30.  
  31. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  32. for i in range(q-16):
  33.     part=p[i:i+16]
  34.     seqq=part
  35.     res=1
  36.     score=1
  37.     for j in range(16):
  38.         key=seqq[j]
  39.         res=res*datadict1[key]["%02d"%(j+1)]
  40.         #print res
  41.     for key in seqq:
  42.         score=score * value[key]
  43.     #print score,"*******************",res
  44.     log_ratio=log10(res/score)
  45.     print i,log_ratio
I have modified function parseData() to include the header:
Expand|Select|Wrap|Line Numbers
  1. def parseData(fn, dataset=1, key='>'):
  2.     '''
  3.     Read a formatted data file of alpha sequences
  4.     Return a list of sequences
  5.     The first element in the list is the header
  6.     '''    
  7.     # initialize output list
  8.     dataList = []
  9.  
  10.     # open file for reading
  11.     f = open(fn)
  12.  
  13.     # skip to required data set
  14.     for _ in range(dataset):
  15.         try:
  16.             s = f.next()
  17.             while not s.startswith(key):
  18.                 s = f.next()
  19.         except StopIteration, e:
  20.             print 'We have reached the end of the file!'
  21.             f.close()
  22.             return False
  23.  
  24.     # initialize output list
  25.     dataList = [s,]
  26.  
  27.     for line in f:
  28.         if not line.startswith(key):
  29.             dataList.append(line.strip())
  30.         else:
  31.             break
  32.  
  33.     f.close()
  34.     return dataList
  35.  
  36. dataSeq = parseData(fnSeq, dataset)
  37. print dataSeq[0]
Output:
>>> >Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513

I have given you the code to parse the matrix data and sequence data so it can easily be manipulated. I don't understand the log calculation you want. Maybe someone smarter than me can figure it out.
Jul 15 '07 #23
elbin
27
Expand|Select|Wrap|Line Numbers
  1. for i in range(q-16):
  2.     part=d[i:i+16]
  3.     seqq=part
  4.     res=1
  5.     score=1
  6.     for j in range(16):
  7.         key=seqq[j]
  8.         res=res*datadict1[key]["%02d"%(j+1)]
This means that you take for granted that the subsequence you are examining is 16 characters long, and the matrix you are using is 16 lines too. But they are not all 16 lines. So you need to change this part to
Expand|Select|Wrap|Line Numbers
  1. len(datadict['A'])
for example.
And what do you mean by "integrate the length of the sequence"?

To bvdet: For the log see http://www.thescripts.com/forum/thread672978.html
Jul 15 '07 #24
aboxylica
111 100+
yes, sixteen is not fixed its gonna varry all through.these aspects confuse me.:(
as to how my code should be
and what i should exactly do is that
>seq1
atattatatat
>seq2
atatattatatata
>seq3
attattatatatatat
...so on..
weightmat1
po
values
weightmat2
values
weightmat3
values
Now
i am actually calculating the log odds ratio(u must be knowing since you are into this)
i calculate for each position the log value..
am i clear now??
you told me previously how to do it for one seq and one weight matrix..now there are multiple matrices and multiple sequences i have to calculate the logodds ratio for each position
am i clear?
waiting for ur reply
cheers
Jul 15 '07 #25
elbin
27
I think you already have all the needed code for this task, please make an effort and combine it, and you will get the result.
Jul 15 '07 #26
aboxylica
111 100+
okay il try that.but i don seem to be confident though.but il try.
thanks!
Jul 15 '07 #27
bvdet
2,851 Expert Mod 2GB
See if this is what you need:
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     fnArray = 'arraydata.txt'
  6.     fnSeq = 'seqdata.txt'
  7.     dataset = 3
  8.     dataArray = parseArray(fnArray, dataset)
  9.     dataSeq = parseData(fnSeq, dataset)
  10.  
  11.     seq = ''.join(dataSeq[1:])
  12.     subKeys = dataArray['A'].keys()
  13.     subKeys.sort()
  14.  
  15.     i,j = divmod(len(seq), len(subKeys))
  16.     keys = subKeys*i + subKeys[:j]
  17.  
  18.     print dataSeq[0],
  19.     outList = ['%s[%s]*%s = %0.4f' % (s, keys[i], s, dataArray[s][keys[i]]*value[s]) for i, s in enumerate(seq)]
  20.     print '\n'.join(outList)
  21.     print sum([float(s.split('=')[1]) for s in outList])
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> >Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
  2. T[01]*T = 0.0131
  3. C[02]*C = 0.0015
  4. T[03]*T = 0.0019
  5. A[04]*A = 0.0017
  6. G[05]*G = 0.0014
  7. A[06]*A = 0.2515
  8. G[07]*G = 0.0969
  9. A[01]*A = 0.0624
  10. T[02]*T = 0.0014
  11. C[03]*C = 0.0755
  12. T[04]*T = 0.2952
  13. G[05]*G = 0.0014
  14. G[06]*G = 0.0022
  15. G[07]*G = 0.0969
  16. C[01]*C = 0.0093
  17. A[02]*A = 0.0016
  18. C[03]*C = 0.0755
  19. G[04]*G = 0.0010
  20. A[05]*A = 0.0014
  21. T[06]*T = 0.0424
  22. G[07]*G = 0.0969
  23. G[01]*G = 0.1403
  24. C[02]*C = 0.0015
  25. G[03]*G = 0.0011
  26. A[04]*A = 0.0017
  27. G[05]*G = 0.0014
  28. A[06]*A = 0.2515
  29. C[07]*C = 0.0054
  30. A[01]*A = 0.0624
  31. A[02]*A = 0.0016
  32. A[03]*A = 0.1832
  33. G[04]*G = 0.0010
  34. A[05]*A = 0.0014
  35. T[06]*T = 0.0424
  36. G[07]*G = 0.0969
  37. C[01]*C = 0.0093
  38. G[02]*G = 0.1965
  39. G[03]*G = 0.0011
  40. C[04]*C = 0.0011
  41. G[05]*G = 0.0014
  42. C[06]*C = 0.0019
  43. A[07]*A = 0.1154
  44. A[01]*A = 0.0624
  45. A[02]*A = 0.0016
  46. A[03]*A = 0.1832
  47. T[04]*T = 0.2952
  48. C[05]*C = 0.0128
  49. G[06]*G = 0.0022
  50. G[07]*G = 0.0969
  51. A[01]*A = 0.0624
  52. A[02]*A = 0.0016
  53. A[03]*A = 0.1832
  54. T[04]*T = 0.2952
  55. G[05]*G = 0.0014
  56. G[06]*G = 0.0022
  57. A[07]*A = 0.1154
  58. G[01]*G = 0.1403
  59. A[02]*A = 0.0016
  60. T[03]*T = 0.0019
  61. G[04]*G = 0.0010
  62. G[05]*G = 0.0014
  63. A[06]*A = 0.2515
  64. T[07]*T = 0.0310
  65. C[01]*C = 0.0093
  66. A[02]*A = 0.0016
  67. C[03]*C = 0.0755
  68. G[04]*G = 0.0010
  69. T[05]*T = 0.2773
  70. A[06]*A = 0.2515
  71. G[07]*G = 0.0969
  72. C[01]*C = 0.0093
  73. C[02]*C = 0.0015
  74. G[03]*G = 0.0011
  75. G[04]*G = 0.0010
  76. C[05]*C = 0.0128
  77. C[06]*C = 0.0019
  78. A[07]*A = 0.1154
  79. T[01]*T = 0.0131
  80. G[02]*G = 0.1965
  81. G[03]*G = 0.0011
  82. C[04]*C = 0.0011
  83. G[05]*G = 0.0014
  84. G[06]*G = 0.0022
  85. 5.0655
  86. >>> seq
  87. 'TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGAAATGGAGATGGATCACGTAGCCGGCCATGGCGG'
Jul 15 '07 #28
aboxylica
111 100+
okay.one thing I am doubtful about what does dataset refer to??
and in the last code the calculation u sent me. is it something like
A[01]*A which means your multiplying the normalised value of A at position one and dividing it by the standard A value??so please tell me. and which statement of the code does that??
what I should ba doing is
if i have a sequence like
>header
ATTTATTATATATATATTATTATAATTAAATAT
and using the matrix
calculate A[01]*T[02]*T[03]*.................divided by standard values which is
A=0.3,T=0.3.
C=0.2,G=0.2
so it should be done like A[01]*T[02]*T[03]*................./0.3*0.3*0.3............
for the sequence
then take a log for this value.Then move to the next window of the sequence
TTTATTATATATATATTATTATAATTAAATAT(I am just leaving the A) calculate the same way with T in the first position.
I have to do this way for all the sequences.
Jul 16 '07 #29
aboxylica
111 100+
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46. def random_seq(nchars,insertat,astring):
  47.     seq=""
  48.     for i in range(nchars):
  49.       if i== insertat:
  50.           seq+=astring
  51.       ch=random.choice(("ATGC"))
  52.       seq+=ch
  53.     print seq
  54.     return seq
  55.  
  56. thestring="CGTCAAGTTCAAGTGCAAAA"
  57. count=50-len(thestring)
  58. p=random_seq(count,15,thestring)
  59. file=open("temp.txt",'w')
  60. file.write(str(p))
  61. file.close()
  62.  
  63.  
  64.  
  65.  
  66. res=1
  67. part=""
  68. q=len(p)
  69. seqq=""
  70.  
  71. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  72. for i in range(q-16):
  73.     part=p[i:i+16]
  74.     seqq=part
  75.     res=1
  76.     score=1
  77.     for j in range(16):
  78.         key=seqq[j]
  79.         res=res*datadict1[key]["%02d"%(j+1)]
  80.         #print res
  81.     for key in seqq:
  82.         score=score * value[key]
  83.     #print score,"*******************",res
  84.     log_ratio=log10(res/score)
  85.     print i,log_ratio
  86.  
This is the code that works and calculates for a single sequence and a single matrix(containing 16 positions) I want to do it for many sequences and many matrices.I guess am clearer now.I have given how my sequences and matrices look like.I just need to generalize it.am i clearer now
waiting for ur reply
cheers!
Jul 16 '07 #30
aboxylica
111 100+
I dont seem to get a hang of it. it is not working for all the sequences and all the matrices.what do i do?
waiting for your reply,
cheers!
Jul 16 '07 #31
bvdet
2,851 Expert Mod 2GB

This is the code that works and calculates for a single sequence and a single matrix(containing 16 positions) I want to do it for many sequences and many matrices.I guess am clearer now.I have given how my sequences and matrices look like.I just need to generalize it.am i clearer now
waiting for ur reply
cheers!
You have a mistake in the code where the data is normalized:
Expand|Select|Wrap|Line Numbers
  1. datadict1=datadict.copy()
  2. for keymain in datadict:
  3.     for keysub in datadict[keymain]:
  4.         datadict1[keymain][keysub] /= (sum(values[int(keysub)-1])+4)
You had keysub where it should have been keymain. The division on the last line can be done with the division in place operator '/='.
Jul 16 '07 #32
bvdet
2,851 Expert Mod 2GB
Expand|Select|Wrap|Line Numbers
  1.  
  2. res=1
  3. part=""
  4. q=len(p)
  5. seqq=""
  6.  
  7. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  8. for i in range(q-16):
  9.     part=p[i:i+16]
  10.     seqq=part
  11.     res=1
  12.     score=1
  13.     for j in range(16):
  14.         key=seqq[j]
  15.         res=res*datadict1[key]["%02d"%(j+1)]
  16.         #print res
  17.     for key in seqq:
  18.         score=score * value[key]
  19.     #print score,"*******************",res
  20.     log_ratio=log10(res/score)
  21.     print i,log_ratio
  22.  
This is the code that works ......
There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

I am trying to understand the calculation you want.
Example:
sequence = 'ATTC.........'
dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
Each key has 16 sub keys, 01-16 in the first set of data

(dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ..................

What is a sequence exactly? This is the first set of sequence data in your file:
Expand|Select|Wrap|Line Numbers
  1. >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
  2. CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
  3. GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
  4. CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
  5. CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
  6. ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
  7. CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
  8. TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
  9. CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
  10. GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
  11. TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG
  12.  
Is each line a separate sequence, or all the lines combined one sequence?
Jul 16 '07 #33
aboxylica
111 100+
This is my full code now.there seems to be problems.But I cant really figure it out!!

Expand|Select|Wrap|Line Numbers
  1. def parseArray(fn, dataset=1, key='PO', term='/'):
  2.     '''
  3.     Read a formatted data file in matrix format and
  4.     compile data into a dictionary
  5.     '''
  6.     f = open(fn)
  7.  
  8.     # skip to required data set
  9.     for _ in range(dataset):
  10.         try:
  11.             line = f.next()
  12.             while not line.startswith(key):
  13.                 line = f.next()
  14.         except StopIteration, e:
  15.             print 'We have reached the end of the file!'
  16.             f.close()
  17.             return False
  18.  
  19.     headerList = line.strip().split()[1:]
  20.     lineList = []
  21.  
  22.     line = f.next().strip()
  23.     while not line.startswith(term):
  24.         if line != '':
  25.             lineList.append(line.strip().split())
  26.         line = f.next().strip()
  27.  
  28.     f.close()
  29.  
  30.     # Key list
  31.     keys = [i[0] for i in lineList]
  32.     print keys
  33.     # Values list
  34.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  35.  
  36.     # Create a dictionary from keys and values
  37.     lineDict = dict(zip(keys, values))
  38.  
  39.     dataDict = {}
  40.  
  41.     for i, item in enumerate(headerList):
  42.         dataDict[item] = {}
  43.         for key in lineDict:
  44.             dataDict[item][key] = lineDict[key][i]
  45.  
  46.     # Add 1.0 to every element in dataDict subdictionaries
  47.     for keyMain in dataDict:
  48.         for keySub in dataDict[keyMain]:
  49.             dataDict[keyMain][keySub] += 1.0
  50.  
  51.     # Normalize original data (with 1 added) and update data
  52.     valueSums = [sum(item)+4 for item in values]
  53.     dataDict1=dataDict.copy()
  54.  
  55.     for keyMain in dataDict:
  56.         for keySub in dataDict[keyMain]:
  57.             dataDict1[keyMain][keySub] /= (sum(valueSums[int(keySub)-1]+4
  58.  
  59.         return dataDict1
  60.  
  61.  
  62. def parseData(fn, dataset=1, key='>'):
  63.     '''
  64.     Read a formatted data file of alpha sequences
  65.     Return a list of sequences
  66.     '''    
  67.     # initialize output list
  68.     dataList = []
  69.  
  70.     # open file for reading
  71.     f = open(fn)
  72.  
  73.     # skip to required data set
  74.     for _ in range(dataset):
  75.         try:
  76.             s = f.next()
  77.             while not s.startswith(key):
  78.                 s = f.next()
  79.         except StopIteration, e:
  80.             print 'We have reached the end of the file!'
  81.             f.close()
  82.             return False
  83.  
  84.     for line in f:
  85.         if not line.startswith(key):
  86.             dataList.append(line.strip())
  87.         else:
  88.             break
  89.  
  90.     f.close()
  91.     return dataList
  92.  
  93. if __name__ == '__main__':
  94.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  95.     fnArray = 'all_redfly.transfac.txt'
  96.     fnSeq = 'redfly_sequence.fasta'
  97.     dataset = 3
  98.     dataArray = parseArray(fnArray, dataset)
  99.     dataSeq = parseData(fnSeq, dataset)
  100.     seq = ''.join(dataSeq[1:])
  101.     subKeys = dataArray['A'].keys()
  102.     subKeys.sort()
  103.  
  104.     i,j = divmod(len(seq), len(subKeys))
  105.     keys = subKeys*i + subKeys[:j]
  106.  
  107.     print dataSeq[0],
  108.     outList = ['%s[%s]*%s = %0.4f' % (s, keys[i], s, dataArray[s][keys[i]]*value[s]) for i, s in enumerate(seq)]
  109.     print '\n'.join(outList)
  110.     print sum([float(s.split('=')[1]) for s in outList])
  111.  
  112.  
  113.  
waiting for ur reply,
cheers!
Jul 16 '07 #34
aboxylica
111 100+
There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

I am trying to understand the calculation you want.
Example:
sequence = 'ATTC.........'
dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
Each key has 16 sub keys, 01-16 in the first set of data

(dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ..................

What is a sequence exactly? This is the first set of sequence data in your file:
Expand|Select|Wrap|Line Numbers
  1. >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
  2. CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
  3. GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
  4. CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
  5. CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
  6. ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
  7. CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
  8. TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
  9. CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
  10. GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
  11. TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG
  12.  
Is each line a separate sequence, or all the lines combined one sequence?
Each sequence refers starts with
>sign(list of A<T<G<C characters... and then second sequence starts with > sign followed by characters.. so ..on.. am i clearer??
Jul 16 '07 #35
aboxylica
111 100+
THIS IS THE FIRST SEQUENCE
:>CG9571_O-E|Drosophila
melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGAT GTCGTTTTTGGGTTTCATACCTTTTCAC
ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATT TAGCTTCGGATGTCAACGTCACCATAAT
CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTG TAAACGCGATGATAATTACAAATATGCC
TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATAT TGAAATTCCGATTCCCTAGAAAATAATA
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTAC TCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAAT GTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATG CTCTGGG
# THIS IS THE SECOND SEQUENCE
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATACAAAAATTTGTTGTGTTTGAATTGA ATTAAGAGCTTATCAAGAAAAAAATTTC
AGTGACTCATAATACACTACTCTACAAGTTTAAATTGAATCAACAATTTA ACTTTCATTGCTCAGGTTTTTAGTAACA
ATGTTTATATAAGTTTAGGTATAACAAATGATTTAAATATAAGATACTGT ATTTCACATTGAGACGAAACAATCCACC
GAAAATCATAAAATATAAGAATGTTGCATTTTATTTTTAAAAATAAAGAT GCCTTTTAAGAGGAATAACTTAAATGTC
TTTAATACCTTTGAATTTAATTATATGGCTAATAAACACAAACTTAAAGC TTAAAACTGCATCGAATTGAATGCGGTT
ATAAATGTACTTATATATCTAATATAATCTGCTAATATGGTTTACATGGT ATATCTTTCTCGGAAATTTTTACAAAAA
TTATCTATTCATATATCTCGAGCGTAAGATATTTATCAGTTTATAGATAA CATCTTTAAATTTGGGTGATTAAAAAAA
AACATTG
#THIS IS THE THIRD SEQUENCE
>Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGA AATGGAGATGGATCACGTAGCCGGCCAT
GGCGG
Jul 16 '07 #36
aboxylica
111 100+
and one more thing.The calculation is not corresponding.
not like first sequence and first dataset.
to make it simpler
just say i have first sequence I am going to find the log value using all the matrices..
like I get a set of log values for matrix 1,then matrix 2,matrix3.......so..on
Now, just realize that there are many sequences too..
so what il do is. for seqence one find all the values using all the matrices..
then move to sequence 2..find all the values using all the matrices..
then move to third sequence..find all the values for all the matrices..so on........
I think its better now??
Jul 16 '07 #37
bvdet
2,851 Expert Mod 2GB
This is my full code now.there seems to be problems.But I cant really figure it out!!

Expand|Select|Wrap|Line Numbers
  1. def parseArray(fn, dataset=1, key='PO', term='/'):
  2.     '''
  3.     Read a formatted data file in matrix format and
  4.     compile data into a dictionary
  5.     '''
  6.     .....................................
  7.     .....................................
  8.     print '\n'.join(outList)
  9.     print sum([float(s.split('=')[1]) for s in outList])
  10.  
  11.  
  12.  
waiting for ur reply,
cheers!
'parseArray' returns the array dictionary after adding 1 and normalizing each array element. 'parseData' returns a list of strings - the first string is the header and the rest comprise the sequence. Based on what you have communicated, this is part of the calculation printout:
Expand|Select|Wrap|Line Numbers
  1. >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
  2. C[01]/C = 1.922438822066
  3. C[02]/C = 0.207382828702
  4. A[03]/A = 0.939745715865
  5. G[04]/G = 0.746578183326
  6. T[05]/T = 1.200055302088
  7. C[06]/C = 0.207296849088
  8. C[07]/C = 0.327664869349
  9. A[08]/A = 2.918567675930
  10. C[09]/C = 0.217751970137
  11. C[10]/C = 0.207382828702
  12. G[11]/G = 0.207382828702
  13. G[12]/G = 0.207382828702
  14. C[13]/C = 0.279850746269
  15. C[14]/C = 0.709249274160
  16. G[15]/G = 3.334715885525
  17. C[16]/C = 0.300705101618
  18. C[01]/C = 1.922438822066
  19. G[02]/G = 2.285358772294
  20. A[03]/A = 0.939745715865
  21. T[04]/T = 0.138255219135
  22. C[05]/C = 0.207382828702
  23. T[06]/T = 0.138197899392
  24. A[07]/A = 0.145167980091
  25. T[08]/T = 0.138255219135
  26. T[09]/T = 0.138255219135
  27. T[10]/T = 0.138255219135
  28. A[11]/A = 0.138255219135
  29. T[12]/T = 0.138255219135
  30. A[13]/A = 2.209784411277
  31. C[14]/C = 0.709249274160
  32. G[15]/G = 3.334715885525
  33. A[16]/A = 1.137840453477
  34. G[01]/G = 0.207382828702
  35. A[02]/A = 0.138255219135
  36. G[03]/G = 0.207296849088
  37. G[04]/G = 0.746578183326
In the first calculation, dataDict['C']['01'] is divided by value['C']. Second - dataDict['C']['02'] is divided by value['C']. When we get to the 17th calculation, we start over with dataDict['X']['01']. Is this right?

If we multiply each element together,
so it should be done like A[01]*T[02]*T[03]*................./0.3*0.3*0.3............
we get an incredibly small number like:
Expand|Select|Wrap|Line Numbers
  1. >>> result
  2. 2.4243160673268744e-210
  3. >>> 
Jul 16 '07 #38
aboxylica
111 100+
yes, we have to multiply it together only
to be clearer
CCTATA..for simplicity let us store them seperately and then divide
numerator=C[01]*C{02]*c[01]..... this is for postion one
corresponding denominator will be C*C*T*A*T*T*A(the standard values should me multiplied)
formula is log(numerator/denominator)
so it will be
FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(the standard values should me multiplied))
FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
(the standard values here))
is it clearer?
waiting for ur reply,
cheers!!
Jul 16 '07 #39
aboxylica
111 100+
we should not divide individualy!!we should calculatte for the entire seuence in the first position)
if this is the ccccctaaaaaaa
get the corresponding values for all the positions say that will be numerator
then get the standard values for the denominator multiply them
then do log(numerator/denominator)
ok??we are not dividing individually!!
am sorry for all the trouble but am finding it difficult to do..thats y am bugging u!!
Jul 16 '07 #40
elbin
27
Bvdet, to make it easier for you too, this is the code generating the calculation (log value)

Expand|Select|Wrap|Line Numbers
  1. q=len(p)
  2. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  3. for i in range(q-len(datadict['A'])):
  4.     part=p[i:i+len(datadict['A'])]
  5.     seqq=part
  6.     res=1
  7.     score=1
  8.     for j in range(len(datadict['A'])):
  9.         key=seqq[j]
  10.         res=res*datadict[key]["%02d"%(j+1)]
  11.     for key in seqq:
  12.         score=score * value[key]
  13.     log_ratio=log10(res/score)
  14.     print i,log_ratio
where datadict is the normalized matrix dictionary, and p is the WHOLE sequence between the header line starting with '>' and the next header, as a string. So, for a sequence of 50 chars length and matrix of 16 lines you get 50 - 16 = 34 log values - from 0 to 33 in the initial example. That is because you evaluate with the matrix a moving window of X characters, where X is the matrix length. And each step the window moves with 1 character forward. Then it should be done as aboxylica said - for each sequence and each matrix.

Hope this helps, as I don't have time to do the whole thing... although there's not much to be done.
Jul 16 '07 #41
bvdet
2,851 Expert Mod 2GB
yes, we have to multiply it together only
to be clearer
CCTATA..for simplicity let us store them seperately and then divide
numerator=C[01]*C{02]*c[01]..... this is for postion one
corresponding denominator will be C*C*T*A*T*T*A(the standard values should me multiplied)
formula is log(numerator/denominator)
so it will be
FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(the standard values should me multiplied))
FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
(the standard values here))
is it clearer?
waiting for ur reply,
cheers!!
Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.......... result in incredibly small numbers. The calculation result is 0.0/0.0!

For a sequence of CCTATA.........
Numerator = C[01]*C{02]*c[01]........ ?
Denominator = C*C*T*A*T*T*A......... ?
Jul 16 '07 #42
aboxylica
111 100+
oops... am sorry!!
Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
then for the same sequence we calulate the denominator which will be
0.2*0.2*0.3*0.3*0.3*0.3=denominator(these are simple standard values)
log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
ok?


Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.......... result in incredibly small numbers. The calculation result is 0.0/0.0!

For a sequence of CCTATA.........
Numerator = C[01]*C{02]*c[01]........ ?
Denominator = C*C*T*A*T*T*A......... ?
Jul 16 '07 #43
elbin
27
The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.......... result in incredibly small numbers. The calculation result is 0.0/0.0!
They are small, but divided they give a reasonable number between -5 and 5 (roughly). You have to use numbers like
Expand|Select|Wrap|Line Numbers
  1. >>> 0.3**16
  2. 4.3046720999999976e-009
but there is a result.
Jul 16 '07 #44
bvdet
2,851 Expert Mod 2GB
oops... am sorry!!
Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
then for the same sequence we calulate the denominator which will be
0.2*0.2*0.3*0.3*0.3*0.3=denominator(these are simple standard values)
log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
ok?
The first sequence in the sequence data file is 759 characters long. The first array in the array data file has 16 elements. How many numbers are multiplied together? 759 or 16?
Jul 16 '07 #45
aboxylica
111 100+
The first sequence in the sequence data file is 759 characters long. The first array in the array data file has 16 elements. How many numbers are multiplied together? 759 or 16?
first sixteen elements
CCAGTCCACCGGCCGC(these are the first sixteen elements)the values are calculated numerator and corresponding denominator)this is for first position.
for the second position
CAGTCCACCGGCCGCC
then
so on until the end of the sequence..
Jul 16 '07 #46
bvdet
2,851 Expert Mod 2GB
first sixteen elements
CCAGTCCACCGGCCGC(these are the first sixteen elements)the values are calculated numerator and corresponding denominator)this is for first position.
for the second position
CAGTCCACCGGCCGCC
then
so on until the end of the sequence..
This seems to work. The example is run on the first data set in the array file and the third data set in the sequence file.
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
  3.     arraySet = 1
  4.     seqSet = 3
  5.  
  6.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  7.  
  8.     fnArray = r'H:\TEMP\temsys\data9.txt'
  9.     fnSeq = r'H:\TEMP\temsys\data12.txt'
  10.  
  11.     dataArray = parseArray(fnArray, arraySet)
  12.     dataSeq = parseData(fnSeq, seqSet)
  13.  
  14.     # This is the complete sequence  
  15.     seq = ''.join(dataSeq[1:])
  16.     # These are the subkeys of dataArray - '01', '02', '03',.............
  17.     subKeys = dataArray['A'].keys()
  18.     subKeys.sort()
  19.  
  20.     # Calculate num/den for each slice of sequence
  21.     # Each sequence slice length = length of subKeys
  22.     # Example:
  23.     # seq = 'ATCGATA'
  24.     # subKeys length = 3
  25.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  26.     numList = []
  27.     denList = []
  28.     seqList = []
  29.     for i in xrange(len(seq) - len(subKeys) + 1):
  30.         subseq = seq[0:len(subKeys)]
  31.         seqList.append(subseq)
  32.         num, den = 1, 1
  33.         for j, s in enumerate(subseq):
  34.             num *= dataArray[s][subKeys[j]]
  35.             den *= value[s]
  36.         numList.append(num)
  37.         denList.append(den)
  38.         seq = seq[1:]
  39.  
  40.     resultList = []
  41.     for i, num in enumerate(numList):
  42.         resultList.append(num/denList[i])
  43.  
  44.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  45.     print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
  46.     print 'Sequence Header: %s' % dataSeq[0]
  47.     print outStr
Output
Expand|Select|Wrap|Line Numbers
  1.  
  2. >>> Array set # = 1
  3. Sequence set # = 3
  4. Sequence Header: >Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
  5.  
  6. Sequence = TCTAGAGATCTGGGCA Calculation = 0.000520377928
  7. Sequence = CTAGAGATCTGGGCAC Calculation = 0.000011324924
  8. Sequence = TAGAGATCTGGGCACG Calculation = 0.000010676845
  9. Sequence = AGAGATCTGGGCACGA Calculation = 0.000043154836
  10. Sequence = GAGATCTGGGCACGAT Calculation = 0.000049390322
  11. Sequence = AGATCTGGGCACGATG Calculation = 0.000000078869
  12. Sequence = GATCTGGGCACGATGG Calculation = 0.003679435071
  13. Sequence = ATCTGGGCACGATGGC Calculation = 0.000004580993
  14. Sequence = TCTGGGCACGATGGCG Calculation = 0.000025964167
  15. Sequence = CTGGGCACGATGGCGA Calculation = 0.000190953272
  16. Sequence = TGGGCACGATGGCGAG Calculation = 0.000209084862
  17. Sequence = GGGCACGATGGCGAGA Calculation = 0.000349499483
  18. Sequence = GGCACGATGGCGAGAC Calculation = 0.000014551293
  19. Sequence = GCACGATGGCGAGACA Calculation = 0.000253266698
  20. Sequence = CACGATGGCGAGACAA Calculation = 0.000002088444
  21. Sequence = ACGATGGCGAGACAAA Calculation = 0.000085759837
  22. Sequence = CGATGGCGAGACAAAG Calculation = 0.000719291466
  23. Sequence = GATGGCGAGACAAAGA Calculation = 0.108603646410
  24. Sequence = ATGGCGAGACAAAGAT Calculation = 0.000022105017
  25. Sequence = TGGCGAGACAAAGATG Calculation = 0.074916911295
  26. Sequence = GGCGAGACAAAGATGC Calculation = 0.000654673006
  27. Sequence = GCGAGACAAAGATGCG Calculation = 0.002905350767
  28. Sequence = CGAGACAAAGATGCGG Calculation = 0.040711263424
  29. Sequence = GAGACAAAGATGCGGC Calculation = 0.000066332349
  30. Sequence = AGACAAAGATGCGGCG Calculation = 0.000844706696
  31. Sequence = GACAAAGATGCGGCGC Calculation = 0.001363986600
  32. Sequence = ACAAAGATGCGGCGCA Calculation = 0.000000158236
  33. Sequence = CAAAGATGCGGCGCAA Calculation = 0.000248960708
  34. Sequence = AAAGATGCGGCGCAAA Calculation = 0.000003482795
  35. Sequence = AAGATGCGGCGCAAAA Calculation = 0.000003790517
  36. Sequence = AGATGCGGCGCAAAAT Calculation = 0.000062906122
  37. Sequence = GATGCGGCGCAAAATC Calculation = 0.000000630359
  38. Sequence = ATGCGGCGCAAAATCG Calculation = 0.000041339176
  39. Sequence = TGCGGCGCAAAATCGG Calculation = 0.007412276588
  40. Sequence = GCGGCGCAAAATCGGA Calculation = 0.000109927284
  41. Sequence = CGGCGCAAAATCGGAA Calculation = 0.032381958151
  42. Sequence = GGCGCAAAATCGGAAA Calculation = 0.027066447384
  43. Sequence = GCGCAAAATCGGAAAT Calculation = 0.000038441301
  44. Sequence = CGCAAAATCGGAAATG Calculation = 0.016863436369
  45. Sequence = GCAAAATCGGAAATGG Calculation = 0.016099091359
  46. Sequence = CAAAATCGGAAATGGA Calculation = 0.000929346454
  47. Sequence = AAAATCGGAAATGGAG Calculation = 0.000186989034
  48. Sequence = AAATCGGAAATGGAGA Calculation = 0.003120608869
  49. Sequence = AATCGGAAATGGAGAT Calculation = 0.000031851876
  50. Sequence = ATCGGAAATGGAGATG Calculation = 0.000387934984
  51. Sequence = TCGGAAATGGAGATGG Calculation = 0.000028928662
  52. Sequence = CGGAAATGGAGATGGA Calculation = 0.858721770074
  53. Sequence = GGAAATGGAGATGGAT Calculation = 0.000032582474
  54. Sequence = GAAATGGAGATGGATC Calculation = 0.000194328378
  55. Sequence = AAATGGAGATGGATCA Calculation = 0.000000025115
  56. Sequence = AATGGAGATGGATCAC Calculation = 0.000005746845
  57. Sequence = ATGGAGATGGATCACG Calculation = 0.000000225826
  58. Sequence = TGGAGATGGATCACGT Calculation = 0.093243689191
  59. Sequence = GGAGATGGATCACGTA Calculation = 0.000581140752
  60. Sequence = GAGATGGATCACGTAG Calculation = 0.000002101908
  61. Sequence = AGATGGATCACGTAGC Calculation = 0.000016524721
  62. Sequence = GATGGATCACGTAGCC Calculation = 0.000029313806
  63. Sequence = ATGGATCACGTAGCCG Calculation = 0.000535232860
  64. Sequence = TGGATCACGTAGCCGG Calculation = 0.000015091041
  65. Sequence = GGATCACGTAGCCGGC Calculation = 0.000010864488
  66. Sequence = GATCACGTAGCCGGCC Calculation = 0.000023539371
  67. Sequence = ATCACGTAGCCGGCCA Calculation = 0.001552014384
  68. Sequence = TCACGTAGCCGGCCAT Calculation = 0.000000040841
  69. Sequence = CACGTAGCCGGCCATG Calculation = 0.000005420914
  70. Sequence = ACGTAGCCGGCCATGG Calculation = 0.000010765295
  71. Sequence = CGTAGCCGGCCATGGC Calculation = 0.002425152785
  72. Sequence = GTAGCCGGCCATGGCG Calculation = 0.000000198520
  73. Sequence = TAGCCGGCCATGGCGG Calculation = 0.000220954056
  74. >>> 
Can you verify these results?
Jul 16 '07 #47
aboxylica
111 100+
I dont think I can really verify but i think its happening right.in the entire file and using matrices we will get some values more than 2.u are taking the normalised values only right??? then it has to be right
Jul 16 '07 #48
aboxylica
111 100+
Then, I think only thing left is just generalising it that is when u donno number of dataset and number of sequences.is that easy??
Jul 16 '07 #49
aboxylica
111 100+
and what does this line mean??
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
Jul 16 '07 #50

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

8 posts views Thread by kaptain kernel | last post: by
5 posts views Thread by B-Dog | last post: by
10 posts views Thread by bienwell | last post: by
5 posts views Thread by Mark | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.