473,890 Members | 1,379 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

looping through a big file containing a set of files.

111 New Member
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
CCTGGAGCCATCGTC CTCGTCCTCC
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
AATACAAAAAAAATA TATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5940
aboxylica
111 New Member
Am getting a list index out of range error!!
but I cant add a -1 to the loop.can i??whats wrong with this??

Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.     lineList = []
  42.  
  43.  
  44.  
  45.     line = f.next().strip()
  46.  
  47.     while not line.startswith(term):
  48.  
  49.         if line != '':
  50.  
  51.             lineList.append(line.strip().split())
  52.  
  53.         line = f.next().strip()
  54.  
  55.  
  56.  
  57.     f.close()
  58.  
  59.  
  60.  
  61.     # Key list
  62.  
  63.     keys = [i[0] for i in lineList]
  64.  
  65.     # Values list
  66.  
  67.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  68.  
  69.  
  70.  
  71.     # Create a dictionary from keys and values
  72.  
  73.     lineDict = dict(zip(keys, values))
  74.  
  75.  
  76.  
  77.     dataDict = {}
  78.  
  79.  
  80.  
  81.     for i, item in enumerate(headerList):
  82.  
  83.         dataDict[item] = {}
  84.  
  85.         for key in lineDict:
  86.  
  87.             dataDict[item][key] = lineDict[key][i]
  88.  
  89.  
  90.  
  91.     # Add 1.0 to every element in dataDict subdictionaries
  92.  
  93.     for keyMain in dataDict:
  94.  
  95.         for keySub in dataDict[keyMain]:
  96.  
  97.             dataDict[keyMain][keySub] += 1.0
  98.  
  99.  
  100.  
  101.     # Normalize original data (with 1 added) and update data
  102.  
  103.     valueSums = [sum(item)+4 for item in values]
  104.  
  105.     # print valueSums
  106.  
  107.  
  108.  
  109.     for keyMain in dataDict:
  110.  
  111.         for keySub in dataDict[keyMain]:
  112.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  113.  
  114.  
  115.     return dataDict
  116.  
  117.  
  118.  
  119.  
  120.  
  121. def parseData(fn, dataset=1, key='>'):
  122.  
  123.     '''
  124.  
  125.     Read a formatted data file of sequences
  126.  
  127.     Return a list of sequences
  128.  
  129.     The first element in the list is the header
  130.  
  131.     '''   
  132.  
  133.     # initialize output list
  134.  
  135.     dataList = []
  136.  
  137.  
  138.  
  139.     # open file for reading
  140.  
  141.     f = open(fn)
  142.  
  143.  
  144.  
  145.     # skip to required data set
  146.  
  147.     for _ in range(dataset):
  148.  
  149.         try:
  150.  
  151.             s = f.next()
  152.  
  153.             while not s.startswith(key):
  154.  
  155.                 s = f.next()
  156.  
  157.         except StopIteration, e:
  158.  
  159.             print 'We have reached the end of the file!'
  160.  
  161.             f.close()
  162.  
  163.             return False
  164.  
  165.  
  166.  
  167.     # initialize output list
  168.  
  169.     dataList = [s,]
  170.  
  171.  
  172.     for line in f:
  173.  
  174.         if not line.startswith(key):
  175.  
  176.             dataList.append(line.strip())
  177.  
  178.         else:
  179.  
  180.             break
  181.  
  182.  
  183.  
  184.     f.close()
  185.  
  186.     return dataList
  187.  
  188.  
  189.  
  190. if __name__ == '__main__':
  191.  
  192.  
  193.  
  194.     arraySet = 4
  195.     #print arraySet
  196.  
  197.     seqSet = 4
  198.     #print seqSet
  199.  
  200.  
  201.  
  202.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  203.  
  204.  
  205.  
  206.     fnArray = r'all_redfly.transfac.txt'
  207.  
  208.     fnSeq = r'redfly_sequence.fasta'
  209.     indxSeq=1
  210.     while True:
  211.         dataSeq=parseData(fnSeq,indxSeq)
  212.         if not dataSeq:
  213.             break
  214.         indxArray=1
  215.         while True:
  216.                 dataArray = parseArray(fnArray, arraySet)
  217.                 #dataSeq = parseData(fnSeq, seqSet)
  218.                 if not dataArray:
  219.                     break
  220.                 # This is the complete sequence
  221.                 seq = ''.join(dataSeq[1:])
  222.                 # These are the subkeys of dataArray - '01', '02', '03',.............
  223.                 subKeys = dataArray['A'].keys()
  224.                 subKeys.sort()
  225.  
  226.  
  227.  
  228.     # Calculate num/den for each slice of sequence
  229.  
  230.     # Each sequence slice length = length of subKeys
  231.  
  232.     # Example:
  233.  
  234.     # seq = 'ATCGATA'
  235.  
  236.     # subKeys length = 3
  237.  
  238.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  239.  
  240.                 numList = []
  241.  
  242.                 denList = []
  243.  
  244.                 seqList = []
  245.  
  246.                 for i in xrange(len(seq) - len(subKeys) + 1):
  247.  
  248.                     subseq = seq[0:len(subKeys)]
  249.  
  250.                     seqList.append(subseq)
  251.                     num, den = 1, 1
  252.  
  253.                     for j, s in enumerate(subseq):
  254.  
  255.                         num *= dataArray[s][subKeys[j]]
  256.  
  257.                         den *= value[s]
  258.  
  259.                         numList.append(num)
  260.  
  261.                         denList.append(den)
  262.  
  263.                         seq = seq[1:]
  264.  
  265.  
  266.  
  267.                         resultList = []
  268.  
  269.                         for i, num in enumerate(numList):
  270.  
  271.                             resultList.append(log10(num/denList[i]))
  272.                     indxArray+=1
  273.                 indxSeq +=1
  274.  
  275.                 outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  276.                 print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
  277.                 print 'Sequence Header: %s' % dataSeq[0]
  278.                 print outStr
  279.  
Jul 17 '07 #61
bvdet
2,851 Recognized Expert Moderator Specialist
Let's make a new function, iterate on it, and write the results to a file:
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.     # sequence factor dictionary
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     dataArray = parseArray(fnArray, arraySet)
  6.     if dataArray:
  7.         dataSeq = parseData(fnSeq, seqSet)
  8.         if not dataSeq:
  9.             return False
  10.     else:
  11.         return None
  12.  
  13.     # This is the complete sequence  
  14.     seq = ''.join(dataSeq[1:])
  15.     # These are the subkeys of dataArray - '01', '02', '03',.............
  16.     subKeys = dataArray['A'].keys()
  17.     subKeys.sort()
  18.  
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in xrange(len(seq) - len(subKeys) + 1):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             den *= value[s]
  35.         numList.append(num)
  36.         denList.append(den)
  37.         seq = seq[1:]
  38.  
  39.     resultList = []
  40.     for i, num in enumerate(numList):
  41.         resultList.append(num/denList[i])
  42.  
  43.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  44.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  45.  
  46. if __name__ == '__main__':
  47.  
  48.     fnArray = 'array.txt'
  49.     fnSeq = 'seq.txt'
  50.  
  51.     outputfile = 'sequence_calc_data.txt'
  52.  
  53.     arraySet = 1
  54.     outList = []
  55.     calcdata = 1
  56.     while not calcdata is None:
  57.         seqSet = 1
  58.         while True:
  59.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  60.             if calcdata:
  61.                 outList.append(calcdata)
  62.                 seqSet += 1
  63.             else:
  64.                 break
  65.         arraySet += 1
  66.  
  67.     f = open(outputfile, 'w')
  68.     f.write('\n'.join(outList))
  69.     f.close()  
This resulted in a 3.1 mb file. Following are the first few lines of the first and last compilation:
Expand|Select|Wrap|Line Numbers
  1. Array set # = 1
  2. Sequence set # = 1
  3. Sequence Header: >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
  4.  
  5. Sequence = CCAGTCCACCGGCCGC Calculation = 0.000025722315
  6. Sequence = CAGTCCACCGGCCGCC Calculation = 0.000000000318
  7. Sequence = AGTCCACCGGCCGCCG Calculation = 0.000595631200
  8. Sequence = GTCCACCGGCCGCCGA Calculation = 0.000120125057
  9. Sequence = TCCACCGGCCGCCGAT Calculation = 0.000000089016
  10. ...........................
  11. Array set # = 4
  12. Sequence set # = 8
  13. Sequence Header: >Obp19b_prom|Drosophila melanogaster|Obp19b|FBgn0031110|X:20224439..20227440
  14.  
  15. Sequence = ATTGCTGACGGGTCGA Calculation = 0.000005535136
  16. Sequence = TTGCTGACGGGTCGAA Calculation = 0.000003984295
  17. Sequence = TGCTGACGGGTCGAAT Calculation = 0.000053179344
  18. Sequence = GCTGACGGGTCGAATG Calculation = 0.000031549069
  19. .............................
Jul 18 '07 #62
aboxylica
111 New Member
THis is the code.my o/p is an empty array.why is this happening?

Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.     lineList = []
  42.  
  43.  
  44.  
  45.     line = f.next().strip()
  46.  
  47.     while not line.startswith(term):
  48.  
  49.         if line != '':
  50.  
  51.             lineList.append(line.strip().split())
  52.  
  53.         line = f.next().strip()
  54.  
  55.  
  56.  
  57.     f.close()
  58.  
  59.  
  60.  
  61.     # Key list
  62.  
  63.     keys = [i[0] for i in lineList]
  64.  
  65.     # Values list
  66.  
  67.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  68.  
  69.  
  70.  
  71.     # Create a dictionary from keys and values
  72.  
  73.     lineDict = dict(zip(keys, values))
  74.  
  75.  
  76.  
  77.     dataDict = {}
  78.  
  79.  
  80.  
  81.     for i, item in enumerate(headerList):
  82.  
  83.         dataDict[item] = {}
  84.  
  85.         for key in lineDict:
  86.  
  87.             dataDict[item][key] = lineDict[key][i]
  88.  
  89.  
  90.  
  91.     # Add 1.0 to every element in dataDict subdictionaries
  92.  
  93.     for keyMain in dataDict:
  94.  
  95.         for keySub in dataDict[keyMain]:
  96.  
  97.             dataDict[keyMain][keySub] += 1.0
  98.  
  99.  
  100.  
  101.     # Normalize original data (with 1 added) and update data
  102.  
  103.     valueSums = [sum(item)+4 for item in values]
  104.  
  105.     # print valueSums
  106.  
  107.  
  108.  
  109.     for keyMain in dataDict:
  110.  
  111.         for keySub in dataDict[keyMain]:
  112.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  113.  
  114.  
  115.     return dataDict
  116.  
  117.  
  118.  
  119.  
  120.  
  121. def parseData(fn, dataset=1, key='>'):
  122.  
  123.     '''
  124.  
  125.     Read a formatted data file of sequences
  126.  
  127.     Return a list of sequences
  128.  
  129.     The first element in the list is the header
  130.  
  131.     '''   
  132.  
  133.     # initialize output list
  134.  
  135.     dataList = []
  136.  
  137.  
  138.  
  139.     # open file for reading
  140.  
  141.     f = open(fn)
  142.  
  143.  
  144.  
  145.     # skip to required data set
  146.  
  147.     for _ in range(dataset):
  148.  
  149.         try:
  150.  
  151.             s = f.next()
  152.  
  153.             while not s.startswith(key):
  154.  
  155.                 s = f.next()
  156.  
  157.         except StopIteration, e:
  158.  
  159.             print 'We have reached the end of the file!'
  160.  
  161.             f.close()
  162.  
  163.             return False
  164.  
  165.  
  166.  
  167.     # initialize output list
  168.  
  169.     dataList = [s,]
  170.  
  171.  
  172.     for line in f:
  173.  
  174.         if not line.startswith(key):
  175.  
  176.             dataList.append(line.strip())
  177.  
  178.         else:
  179.  
  180.             break
  181.  
  182.  
  183.  
  184.     f.close()
  185.  
  186.     return dataList
  187.  
  188.  
  189.  
  190.  
  191. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  192.  
  193.     # sequence factor dictionary
  194.  
  195.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  196.  
  197.  
  198.  
  199.     dataArray = parseArray(fnArray, arraySet)
  200.  
  201.     if dataArray:
  202.  
  203.         dataSeq = parseData(fnSeq, seqSet)
  204.  
  205.         if not dataSeq:
  206.  
  207.             return False
  208.  
  209.         else:
  210.  
  211.             return None
  212.  
  213.  
  214.  
  215.         # This is the complete sequence 
  216.  
  217.         seq = ''.join(dataSeq[1:])
  218.  
  219.         # These are the subkeys of dataArray - '01', '02', '03',.............
  220.  
  221.         subKeys = dataArray['A'].keys()
  222.  
  223.         subKeys.sort()
  224.  
  225.  
  226.  
  227.         # Calculate num/den for each slice of sequence
  228.  
  229.           # Each sequence slice length = length of subKeys
  230.  
  231.           # Example:
  232.             # seq = 'ATCGATA'
  233.  
  234.           # subKeys length = 3
  235.  
  236.           # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  237.  
  238.         numList = []
  239.  
  240.         denList = []
  241.  
  242.         seqList = []
  243.  
  244.         for i in xrange(len(seq) - len(subKeys) + 1):
  245.  
  246.             subseq = seq[0:len(subKeys)]
  247.  
  248.             seqList.append(subseq)
  249.  
  250.             num, den = 1, 1
  251.  
  252.             for j, s in enumerate(subseq):
  253.  
  254.                 num *= dataArray[s][subKeys[j]]
  255.  
  256.                 den *= value[s]
  257.  
  258.                 numList.append(num)
  259.  
  260.                 denList.append(den)
  261.  
  262.                 seq = seq[1:]
  263.  
  264.  
  265.  
  266.         resultList = []
  267.  
  268.         for i, num in enumerate(numList):
  269.  
  270.             resultList.append(num/denList[i])
  271.  
  272.  
  273.  
  274.             outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)   for i, res in enumerate(resultList)])
  275.  
  276.             return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  277.  
  278. if __name__ == '__main__':
  279.  
  280.  
  281.     fnArray =r'all_redfly.transfac' 
  282.     fnSeq = r'redfly_sequence.fasta'
  283.  
  284.     outputfile =  "sequence_calc_data.txt"
  285.  
  286.  
  287.  
  288.     arraySet = 1
  289.  
  290.     outList = []
  291.  
  292.     calcdata = 1
  293.  
  294.     while not calcdata is None:
  295.  
  296.         seqSet = 1
  297.  
  298.         while True:
  299.  
  300.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  301.             print calcdata
  302.  
  303.             if calcdata:
  304.  
  305.                 outList.append(calcdata)
  306.  
  307.                 seqSet += 1
  308.  
  309.             else:
  310.  
  311.                 break
  312.  
  313.         arraySet += 1
  314.  
  315.  
  316.  
  317.  
  318.     f = open(outputfile, 'w')
  319.  
  320.     f.write('\n'.join(outList))
  321.  
  322.     f.close()
  323.     f=open(outputfile,"r")
  324.     file_con=f.readlines()
  325.     print file_con
  326.     for line in file_con:
  327.         print line
  328.  
  329.  
Jul 18 '07 #63
aboxylica
111 New Member
I seem to get an list index out of range error:
Traceback (most recent call last):
File "newbie1.py ", line 311, in <module>
calcdata = compileData(fnA rray, fnSeq, arraySet, seqSet)
File "newbie1.py ", line 285, in compileData
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resul tList)])
IndexError: list index out of range
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.     if dataArray:
  206.  
  207.         dataSeq = parseData(fnSeq, seqSet)
  208.  
  209.  
  210.         if not dataSeq:
  211.  
  212.             return False
  213.  
  214.     else:
  215.  
  216.         return None
  217.  
  218.  
  219.  
  220.  
  221.     # This is the complete sequence 
  222.  
  223.     seq = ''.join(dataSeq[1:])
  224.  
  225.  
  226.  
  227.     # These are the subkeys of dataArray - '01', '02', '03',.............
  228.  
  229.     subKeys = dataArray['A'].keys()
  230.  
  231.     subKeys.sort()
  232.  
  233.  
  234.  
  235.  
  236.     # Calculate num/den for each slice of sequence
  237.  
  238.     # Each sequence slice length = length of subKeys
  239.  
  240.     # Example:
  241.     # seq = 'ATCGATA'
  242.  
  243.     # subKeys length = 3
  244.  
  245.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  246.  
  247.     numList = []
  248.  
  249.     denList = []
  250.  
  251.     seqList = []
  252.  
  253.     for i in xrange(len(seq) - len(subKeys) + 1):
  254.  
  255.         subseq = seq[0:len(subKeys)]
  256.  
  257.         seqList.append(subseq)
  258.  
  259.  
  260.         num, den = 1, 1
  261.  
  262.         for j, s in enumerate(subseq):
  263.  
  264.             num *= dataArray[s][subKeys[j]]
  265.  
  266.             den *= value[s]
  267.  
  268.             numList.append(num)
  269.  
  270.             denList.append(den)
  271.  
  272.             seq = seq[1:]
  273.  
  274.  
  275.  
  276.     resultList = []
  277.  
  278.     for i, num in enumerate(numList):
  279.  
  280.         resultList.append(log10(num/denList[i]))
  281.         print (resultList)
  282.  
  283.  
  284.  
  285.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resultList)])
  286.  
  287.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  288.  
  289. if __name__ == '__main__':
  290.  
  291.  
  292.     fnArray ='all_redfly.transfac' 
  293.     fnSeq = 'redfly_sequence.fasta'
  294.  
  295.     outputfile =  "sequence_calc_data.txt"
  296.  
  297.  
  298.  
  299.     arraySet = 1
  300.  
  301.     outList = []
  302.  
  303.     calcdata = 1
  304.  
  305.     while not calcdata is None:
  306.  
  307.         seqSet = 1
  308.  
  309.         while True:
  310.  
  311.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  312.  
  313.             if calcdata:
  314.  
  315.                 outList.append(calcdata)
  316.  
  317.                 seqSet += 1
  318.  
  319.             else:
  320.  
  321.                 break
  322.  
  323.         arraySet += 1
  324.  
  325.  
  326.  
  327.  
  328.  
  329.     f = open(outputfile, 'w')
  330.  
  331.     f.write('\n'.join(outList))
  332.  
  333.     f.close()
  334.     f=open(outputfile,"r")
  335.     file_con=f.readlines()
  336.     print file_con
  337.     for line in file_con:
  338.         print line
  339.  
waiting for ur reply,
cheers!
Jul 18 '07 #64
bvdet
2,851 Recognized Expert Moderator Specialist
I am not sure why you add so many spaces in between the lines of code. I personally find it unreadable. Anyway, when you were adding all the spaces, some of the code ended up at the incorrect indentation:
Expand|Select|Wrap|Line Numbers
  1. ........for j, s in enumerate(subseq):
  2.  
  3.  
  4.             num *= dataArray[s][subKeys[j]]
  5.  
  6.  
  7.             den *= value[s]
  8.  
  9.  
  10.             numList.append(num)
  11.  
  12.  
  13.             denList.append(den)
  14.  
  15.  
  16.             seq = seq[1:]
  17.  
SHOULD BE:
Expand|Select|Wrap|Line Numbers
  1. ........for j, s in enumerate(subseq):
  2.             num *= dataArray[s][subKeys[j]]
  3.             den *= value[s]
  4.         numList.append(num)
  5.         denList.append(den)
  6.         seq = seq[1:]
Jul 18 '07 #65
aboxylica
111 New Member
hey,
That was the mistake.amazing !! thanks a million!!
I got some doubts about the program.
i have some doubts. first understandingan d then get back to you.
THANKS A MILLION!
cheers!!
Jul 18 '07 #66
aboxylica
111 New Member
hey,
here is the code where I tried removing the try catch block and couple of things which will make it easier for me to understand.but looks like there is some problem ..I will of course use them in my main program.But I was just trying to understand when I tried executing the iteration was not happening and when I said
print outList instead of storing it in a file it was not iterating.This is the code
can you tell me whats happening???
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn,dataset=1,key='PO',term='/'):
  3.     f=open(fn)
  4.     for _ in range(dataset):
  5.         line=f.next()
  6.         while not line.startswith(key):
  7.             line=f.next()
  8.     headerList=line.strip().split()[1:]
  9.     lineList=[]
  10.     line=f.next().strip()
  11.     while not line.startswith(term):
  12.         if line!='':
  13.             lineList.append(line.strip().split())
  14.         line=f.next().strip()
  15.         # f.close()
  16.     keys=[i[0] for i in lineList]
  17.     values=[[float(s) for s in item] for item in [j[1:] for j in lineList]]
  18.     lineDict=dict(zip(keys,values))
  19.     dataDict={}
  20.     for i,item in enumerate(headerList):
  21.         dataDict[item]={}
  22.         for key in lineDict:
  23.             dataDict[item][key]=lineDict[key][i]
  24.     for keyMain in dataDict:
  25.         for keySub in dataDict[keyMain]:
  26.             dataDict[keyMain][keySub]+=1.0
  27.     valueSums=[sum(item)+4 for item in values]
  28.     for keyMain in dataDict:
  29.         for keySub in dataDict[keyMain]:
  30.             dataDict[keyMain][keySub]/=valueSums[int(keySub)-1]
  31.     return dataDict
  32. #fn="weight_matrix.transfac.txt"
  33. #p=parseArray(fn)
  34. #print p
  35. def parseData(fn,dataset=1,key='>'):
  36.     dataList=[]
  37.     f=open(fn)
  38.     for _ in range(dataset):
  39.         s=f.next()
  40.     dataList=[s,]
  41.  
  42.     for line in f:
  43.         if not line.startswith(key):
  44.             dataList.append(line.strip())
  45.         else:
  46.             break
  47.     return dataList
  48. #fn="redfly_sequence.fasta"
  49. #p=parseData(fn)
  50. #print p
  51. def compileData(fnArray,fnSeq,arraySet=1,seqSet=1):
  52.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  53.     dataArray=parseArray(fnArray,arraySet)
  54.     if dataArray:
  55.         dataSeq=parseData(fnSeq,seqSet)
  56.     seq=''.join(dataSeq[1:])
  57.     subKeys=dataArray['A'].keys()
  58.     subKeys.sort()
  59.     numList=[]
  60.     denList=[]
  61.     seqList=[]
  62.     for i in xrange(len(seq)-len(subKeys)):
  63.         subseq=seq[0:len(subKeys)]
  64.         seqList.append(subseq)
  65.         num,den=1,1
  66.         for j,s in enumerate(subseq):
  67.             num*=dataArray[s][subKeys[j]]
  68.             den*=value[s]
  69.         numList.append(num)
  70.         denList.append(den)
  71.         seq=seq[1:]
  72.     resultList=[]
  73.     for i,num in enumerate(numList):
  74.         if (log10(num/denList[i]))>2:
  75.             resultList.append(log10(num/denList[i]))
  76.     outStr='\n'.join(['sequence=%s Calculation=%0.12f'%(seqList[i],res) for i,res in enumerate(resultList)])
  77.     return 'array set#= %d\nSequence set #=%d\nSequence Header: %s\n%s' %(arraySet,seqSet,dataSeq[0],outStr)
  78. fnArray='weight_matrix.transfac.txt'
  79. fnSeq='redfly_sequence.fasta'
  80. arraySet=1
  81. outList=[]
  82. calcdata=1
  83. while not calcdata is None:
  84.     seqSet=1
  85.     while True:
  86.         calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
  87.         if calcdata:
  88.             outList.append(calcdata)
  89.  
  90.  
  91.             seqSet+=1
  92.         else:
  93.             break
  94.  
  95.     arraySet+=1
  96. print outList
  97. f=open(outputfile,'w')
  98. f.write('/n'.join(outList))
  99. f.close()
  100.  
  101.  
  102.  
waiting
cheers!!
Jul 18 '07 #67
bvdet
2,851 Recognized Expert Moderator Specialist
After running the script, I can do this:
Expand|Select|Wrap|Line Numbers
  1. >>> print outList[1]
  2. Array set # = 1
  3. Sequence set # = 2
  4. Sequence Header: >Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
  5.  
  6. Sequence = AGTCGACCAGCACGAG Calculation = -0.872390330485
  7. Sequence = GTCGACCAGCACGAGA Calculation = -3.287525755636
  8. Sequence = TCGACCAGCACGAGAT Calculation = -4.346213357398
  9. Sequence = CGACCAGCACGAGATC Calculation = -2.329064001005
  10. .........................
I don't want to print the entire outList because it's over 3 MB.
You may have changed something you should not have. Maybe you should copy the code again. If you need to change things, change only one thing at a time and test to make sure it still works.
Jul 18 '07 #68
aboxylica
111 New Member
hello!
I hope you people remember the problem above..
i got little problems with that
that was just opening a file containing files..now il be opening a directory containing different sequence files
this is how the code looks now!
am trying to change the i/p file to folder by showing the path of the folder but its going to the exception file..can you tell me why?
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.  
  284.         if (log10(num/denList[i]))>=2:
  285.  
  286.         resultList.append(int(abs(1)))
  287.  
  288.  
  289.  
  290.  
  291.  
  292.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  293.  
  294.  
  295.  
  296.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  297.  
  298.  
  299. if __name__ == '__main__':
  300.  
  301.  
  302.  
  303.     fnArray ='half.txt'
  304.  
  305.     fnSeq = 'C:\\python25\ding\YAL005C.txt'
  306.  
  307.  
  308.  
  309.     outputfile =  "sequence_calc_data.txt"
  310.  
  311.  
  312.  
  313.     arraySet = 1
  314.  
  315.     outList = []
  316.  
  317.     calcdata = 1
  318.  
  319.     while not calcdata is None:
  320.  
  321.         seqSet = 1
  322.  
  323.         while True:
  324.  
  325.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  326.             print calcdata
  327.  
  328.             if calcdata:
  329.  
  330.                 outList.append(calcdata)
  331.  
  332.                 seqSet += 1
  333.  
  334.             else:
  335.  
  336.                 break
  337.  
  338.         arraySet += 1
  339.  
  340.  
  341.  
  342.  
  343.  
  344.     f = open(outputfile, 'w')
  345.  
  346.     f.write('\n'.join(outList))
  347.  
  348.     f.close()
  349.     #f=open(outputfile,"r")
  350.     #file_con=f.readlines()
  351.     #for line in file_con:
  352.      #   print line
  353.  
please tell me what can i do??
Dec 11 '07 #69
aboxylica
111 New Member
here is my code which is reading a directory containing files..... it seems to go to the exception part always.. i dono why..i think it checks for the first file in the folder and then comes out..how do i check if its going to all the files..
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.             print "am here"
  25.  
  26.             while not line.startswith(key):
  27.                 print "oh yes"
  28.  
  29.                 line = f.next()
  30.  
  31.         except StopIteration, e:
  32.             print '###############################'
  33.  
  34.             print 'We have reached the end of the file!'
  35.  
  36.             f.close()
  37.  
  38.             return False
  39.  
  40.  
  41.  
  42.     headerList = line.strip().split()[1:]
  43.  
  44.  
  45.     lineList = []
  46.  
  47.  
  48.  
  49.     line = f.next().strip()
  50.  
  51.     while not line.startswith(term):
  52.  
  53.         if line != '':
  54.  
  55.             lineList.append(line.strip().split())
  56.  
  57.  
  58.         line = f.next().strip()
  59.  
  60.  
  61.  
  62.     f.close()
  63.  
  64.  
  65.  
  66.     # Key list
  67.  
  68.     keys = [i[0] for i in lineList]
  69.  
  70.     # Values list
  71.  
  72.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  73.  
  74.  
  75.  
  76.     # Create a dictionary from keys and values
  77.  
  78.     lineDict = dict(zip(keys, values))
  79.  
  80.  
  81.  
  82.     dataDict = {}
  83.  
  84.  
  85.  
  86.     for i, item in enumerate(headerList):
  87.  
  88.         dataDict[item] = {}
  89.  
  90.         for key in lineDict:
  91.  
  92.             dataDict[item][key] = lineDict[key][i]
  93.  
  94.  
  95.  
  96.     # Add 1.0 to every element in dataDict subdictionaries
  97.  
  98.     for keyMain in dataDict:
  99.  
  100.         for keySub in dataDict[keyMain]:
  101.  
  102.             dataDict[keyMain][keySub] += 1.0
  103.  
  104.  
  105.  
  106.     # Normalize original data (with 1 added) and update data
  107.  
  108.     valueSums = [sum(item)+4 for item in values]
  109.  
  110.     # print valueSums
  111.  
  112.  
  113.  
  114.     for keyMain in dataDict:
  115.  
  116.         for keySub in dataDict[keyMain]:
  117.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  118.  
  119.     return dataDict
  120.  
  121.  
  122.  
  123.  
  124.  
  125. def parseData(fn, dataset=1, key='>'):
  126.  
  127.     '''
  128.  
  129.     Read a formatted data file of sequences
  130.  
  131.     Return a list of sequences
  132.  
  133.     The first element in the list is the header
  134.  
  135.     '''   
  136.  
  137.     # initialize output list
  138.  
  139.     dataList = []
  140.  
  141.  
  142.  
  143.     # open file for reading
  144.  
  145.     f = open(fn)
  146.  
  147.  
  148.  
  149.     # skip to required data set
  150.  
  151.     for _ in range(dataset):
  152.  
  153.  
  154.         try:
  155.  
  156.             s = f.next()
  157.  
  158.             while not s.startswith(key):
  159.  
  160.  
  161.                 s = f.next()
  162.  
  163.         except StopIteration, e:
  164.  
  165.             print 'We have reached the end of the file!'
  166.             print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
  167.  
  168.             f.close()
  169.  
  170.             return False
  171.  
  172.  
  173.  
  174.     # initialize output list
  175.  
  176.     dataList = [s,]
  177.  
  178.  
  179.     for line in f:
  180.  
  181.         if not line.startswith(key):
  182.  
  183.             dataList.append(line.strip())
  184.  
  185.         else:
  186.  
  187.             break
  188.  
  189.  
  190.  
  191.     f.close()
  192.  
  193.     return dataList
  194.  
  195.  
  196.  
  197.  
  198.  
  199. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  200.  
  201.     # sequence factor dictionary
  202.  
  203.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  204.  
  205.  
  206.  
  207.     dataArray = parseArray(fnArray, arraySet)
  208.  
  209.  
  210.     if dataArray:
  211.  
  212.         dataSeq = parseData(fnSeq, seqSet)
  213.  
  214.  
  215.         if not dataSeq:
  216.  
  217.             return False
  218.  
  219.     else:
  220.  
  221.         return None
  222.  
  223.  
  224.  
  225.  
  226.     # This is the complete sequence 
  227.  
  228.     seq = ''.join(dataSeq[1:])
  229.  
  230.  
  231.  
  232.  
  233.  
  234.     # These are the subkeys of dataArray - '01', '02', '03',.............
  235.  
  236.     subKeys = dataArray['A'].keys()
  237.  
  238.     subKeys.sort()
  239.  
  240.  
  241.  
  242.  
  243.  
  244.     # Calculate num/den for each slice of sequence
  245.  
  246.     # Each sequence slice length = length of subKeys
  247.  
  248.     # Example:
  249.     # seq = 'ATCGATA'
  250.  
  251.     # subKeys length = 3
  252.  
  253.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  254.  
  255.     numList = []
  256.  
  257.     denList = []
  258.  
  259.     seqList = []
  260.  
  261.     for i in xrange(len(seq) - len(subKeys)):
  262.  
  263.         subseq = seq[0:len(subKeys)]
  264.  
  265.         seqList.append(subseq)
  266.  
  267.  
  268.         num, den = 1, 1
  269.  
  270.         for j, s in enumerate(subseq):
  271.  
  272.             num *= dataArray[s][subKeys[j]]
  273.  
  274.             den *= value[s]
  275.  
  276.         numList.append(num)
  277.  
  278.         denList.append(den)
  279.  
  280.         seq = seq[1:]
  281.  
  282.  
  283.  
  284.     resultList = []
  285.  
  286.     for i, num in enumerate(numList):
  287.         #p=log10(num/denList[i])
  288.         #if (p) >=2:
  289.             #print "#########",abs(int(p))
  290.         if (log10(num/denList[i]))>=2:
  291.             #print "i am here"
  292.         resultList.append(int(abs(1)))
  293.     #print resultList
  294.     #for i in resultList:
  295.     #mean=sum(resultList)/len(resultList)
  296.         #sub=mean-i
  297.         #queue = []
  298.         #queue = (sub)**2
  299.         #print sqrt(queue/len(resultList))
  300.  
  301.     #print mean,"@@@@@@@@@@"
  302.  
  303.  
  304.  
  305.  
  306.  
  307.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  308.     #print "this is line 294"
  309.  
  310.  
  311.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  312.  
  313.  
  314. if __name__ == '__main__':
  315.  
  316.  
  317.  
  318.     fnArray ='C:\\python25\\half.txt'
  319.     import os
  320.     seq_=os.listdir("ding")
  321.     print seq_
  322.     os.chdir("C:\\python25\\New Folder")
  323.     for file_ in seq_:
  324.         if os.path.isfile(file_):
  325.             rem=open(file_)
  326.             dingg=rem.readlines()
  327.     fnSeq = dingg
  328.  
  329.  
  330.  
  331.     outputfile =  "sequence_calc_data.txt"
  332.  
  333.  
  334.  
  335.     arraySet = 1
  336.  
  337.     outList = []
  338.  
  339.     calcdata = 1
  340.  
  341.     while not calcdata is None:
  342.  
  343.         seqSet = 1
  344.  
  345.         while True:
  346.  
  347.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  348.             print calcdata
  349.  
  350.             if calcdata:
  351.  
  352.                 outList.append(calcdata)
  353.  
  354.                 seqSet += 1
  355.  
  356.             else:
  357.  
  358.                 break
  359.  
  360.         arraySet += 1
  361.  
  362.  
  363.  
  364.  
  365.  
  366.     f = open(outputfile, 'w')
  367.  
  368.     f.write('\n'.join(outList))
  369.  
  370.     f.close()
  371.     #f=open(outputfile,"r")
  372.     #file_con=f.readlines()
  373.     #for line in file_con:
  374.      #   print line
  375.  
waiting for ur reply,
cheers!
Dec 11 '07 #70

Sign in to post your reply or Sign up for a free account.

Similar topics

8
4067
by: kaptain kernel | last post by:
i've got a while loop thats iterating through a text file and pumping the contents into a database. the file is quite large (over 150mb). the looping causes my CPU load to race up to 100 per cent. Even if i remove the mysql insert query and just loop through the file , it still hits 100 per cent CPU. This has the knock on effect of slowing my script down so that mysql inserts are occuring every 1/2 second or so.
5
9444
by: B-Dog | last post by:
I have an old dos program that uses dat files to store the data and I'm trying to convert to dotnet. I'd like to be able to import the data into an access database but I don't know which format the dat files are in. Here is the first few lines of the dat file. If anyone could help me figure out which type this is it would be greatly appreciated. ˙˙˙˙ ®   öf¨"ó Mr John Smithh Parkside Jewelers 1776...
2
24242
by: deko | last post by:
I have a table that contains a bunch of pictures. When the user selects a particular image in a form, I need a way to extract the selected bitmap image (stored in an OLE Object table field) to the file system so the user can do stuff with "somePicture.bmp", for example. Is there an easy way to do this? Thanks in advance.
10
24099
by: bienwell | last post by:
Hi, I have a question about file included in ASP.NET. I have a file that includes all the Sub functions (e.g FileFunct.vb). One of the functions in this file is : Sub TestFunct(ByVal strInput As String) return (strInput & " test") End Sub
3
9565
by: Chung Leong | last post by:
Here's the rest of the tutorial I started earlier: Aside from text within a document, Indexing Service let you search on meta information stored in the files. For example, MusicArtist and MusicAlbum let you find MP3 and other music files based on the singer and album name; DocAuthor let you find Office documents created by a certain user; DocAppName let you find files of a particular program, and so on. Indexing Service uses plug-ins...
1
2962
by: Alex | last post by:
Hello, I have a stored procedure that processes an individual file from a directory and archives it in a subdirectory.Now, the problem is, when i execute it , it will only process one file. What i want to do is to check to see if there are any files in the folder, and if there are , process them all, and once done, go to the next part in a DTS package, if there are no files, simply go to the next part in the DTS package. I tried an...
1
6528
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting" setting to be "E_ALL", notices are still not getting reported. The perms on my file are 664, with owner root and group root. The php.ini file is located at /usr/local/lib/php/php.ini. Any ideas why the setting does not seem to be having an effect? ...
5
2700
by: Mark | last post by:
Hi I have an application (in vb.NET 2005) which holds data in SQL Server and some of the SQL records are simply paths to related files. I would like to be able to do a text search on both the SQL data and the contents of any referenced files. The returned list being a listing which includes both records containing the text and files containing the text. Does anyone have a simple example of programatically searching for files
0
1972
by: anthon | last post by:
Hi all - first post! anywho; I need to create a function for speeding up and down a looping clip. imagine a rotating object, triggered by an action, and slowly decreasing in speed, till it reaches a point 0 (compare a hand spinning a fortune wheel). now, this is quite an easy this to achieve, since you just have to set an interval to increase rotation, with a value that decreases over time (on every call), till it finally reaches a...
0
9826
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11234
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10925
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9640
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
7171
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
6058
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4682
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4276
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3282
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.