473,382 Members | 1,332 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

looping through a big file containing a set of files.

111 100+
hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.

//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//

the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. import random
  3. f=open("deeps1.txt","r")
  4. line=f.next()
  5. while not line.startswith('PO'):
  6.     line=f.next()
  7.  
  8. headerlist=line.strip().split()[1:]
  9. linelist=[]
  10.  
  11.  
  12. line=f.next().strip()
  13. while not line.startswith('/'):
  14.     if line != '':
  15.         linelist.append(line.strip().split())
  16.     line=f.next().strip()
  17.  
  18. keys=[i[0] for i in linelist]
  19. values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
  20.  
  21. array={}
  22. linedict=dict(zip(keys,values))
  23. keys = linedict.keys()
  24. keys.sort()
  25. for key in keys:
  26.     array=[key,linedict[key]]
  27.  
  28. datadict={}
  29. datadict1={}
  30. for i,item in enumerate(headerlist):
  31.     datadict[item]={}
  32.     for key_ in linedict:
  33.         datadict[item][key_]=linedict[key_][i]
  34.  
  35.  
  36. for keymain in datadict:
  37.     for keysub in datadict[keymain]:
  38.         datadict[keymain][keysub]+=1.0
  39.  
  40. datadict1=datadict.copy()
  41. for keysub in datadict:
  42.     for keysub in datadict[keymain]:
  43.         datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
  44.  
  45.  
  46.  
  47. def readfasta():
  48.     file1= open("chr011.py",'r')
  49.     file_content=file1.readlines()
  50.     first=1
  51.     list1=""    
  52.     for line in file_content:
  53.         if line[0]==">":
  54.             if first==0:
  55.                 print "***********"
  56.                 list1+=sequence
  57.                 print "***********"
  58.             else:
  59.                 first=0
  60.                 sequence=""
  61.                 seq=""
  62.                 for i in range(0,len(line)-1):
  63.                     seq+=line[i]
  64.         else:
  65.                 for i in range(0,len(line)-1):
  66.             sequence+=line[i]  
  67.     list1+=sequence
  68.     return list1
  69.  
  70.  
  71.  
  72. p=readfasta()
  73.  
  74.  
  75.  
  76.  
  77.  
  78. res=1
  79. part=""
  80. q=len(p)
  81. seqq=""
  82.  
  83. value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  84. for i in range(q-16):
  85.     part=p[i:i+16]
  86.     seqq=part
  87.     res=1
  88.     score=1
  89.     for j in range(16):
  90.         key=seqq[j]
  91.         res=res*datadict1[key]["%02d"%(j+1)]
  92.         #print res
  93.     for key in seqq:
  94.         score=score * value[key]
  95.     #print score,"*******************",res
  96.     log_ratio=log10(res/score)
  97.     print i,log_ratio
  98.  
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5771
aboxylica
111 100+
and please do send the full code. i tried executing mine.looks like there are some errors.so il check up with urs and try executing.
waiting,
cheers!!
Jul 16 '07 #51
bvdet
2,851 Expert Mod 2GB
and what does this line mean??
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
A module type is a container that holds objects loaded with the import statement. The identifier '__name__' is assigned to the module name. If the module is run as a top level program, '__name__' is assigned to the string '__main__'.

Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.     .... execute code ....
  3.  
The code under the 'if' statement only executes if the module is run as a program. That way the functions defined above can be imported by other programs without executing the conditional code.
Jul 16 '07 #52
bvdet
2,851 Expert Mod 2GB
and please do send the full code. i tried executing mine.looks like there are some errors.so il check up with urs and try executing.
waiting,
cheers!!
The whole thing:
Expand|Select|Wrap|Line Numbers
  1. def parseArray(fn, dataset=1, key='PO', term='/'):
  2.     '''
  3.     Read a formatted data file in matrix format and
  4.     compile data into a dictionary
  5.     '''
  6.     f = open(fn)
  7.  
  8.     # skip to required data set
  9.     for _ in range(dataset):
  10.         try:
  11.             line = f.next()
  12.             while not line.startswith(key):
  13.                 line = f.next()
  14.         except StopIteration, e:
  15.             print 'We have reached the end of the file!'
  16.             f.close()
  17.             return False
  18.  
  19.     headerList = line.strip().split()[1:]
  20.     lineList = []
  21.  
  22.     line = f.next().strip()
  23.     while not line.startswith(term):
  24.         if line != '':
  25.             lineList.append(line.strip().split())
  26.         line = f.next().strip()
  27.  
  28.     f.close()
  29.  
  30.     # Key list
  31.     keys = [i[0] for i in lineList]
  32.     # Values list
  33.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  34.  
  35.     # Create a dictionary from keys and values
  36.     lineDict = dict(zip(keys, values))
  37.  
  38.     dataDict = {}
  39.  
  40.     for i, item in enumerate(headerList):
  41.         dataDict[item] = {}
  42.         for key in lineDict:
  43.             dataDict[item][key] = lineDict[key][i]
  44.  
  45.     # Add 1.0 to every element in dataDict subdictionaries
  46.     for keyMain in dataDict:
  47.         for keySub in dataDict[keyMain]:
  48.             dataDict[keyMain][keySub] += 1.0
  49.  
  50.     # Normalize original data (with 1 added) and update data
  51.     valueSums = [sum(item)+4 for item in values]
  52.     # print valueSums
  53.  
  54.     for keyMain in dataDict:
  55.         for keySub in dataDict[keyMain]:
  56.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  57.  
  58.     return dataDict
  59.  
  60.  
  61. def parseData(fn, dataset=1, key='>'):
  62.     '''
  63.     Read a formatted data file of alpha sequences
  64.     Return a list of sequences
  65.     The first element in the list is the header
  66.     '''    
  67.     # initialize output list
  68.     dataList = []
  69.  
  70.     # open file for reading
  71.     f = open(fn)
  72.  
  73.     # skip to required data set
  74.     for _ in range(dataset):
  75.         try:
  76.             s = f.next()
  77.             while not s.startswith(key):
  78.                 s = f.next()
  79.         except StopIteration, e:
  80.             print 'We have reached the end of the file!'
  81.             f.close()
  82.             return False
  83.  
  84.     # initialize output list
  85.     dataList = [s,]
  86.  
  87.     for line in f:
  88.         if not line.startswith(key):
  89.             dataList.append(line.strip())
  90.         else:
  91.             break
  92.  
  93.     f.close()
  94.     return dataList
  95.  
  96. if __name__ == '__main__':
  97.  
  98.     arraySet = 1
  99.     seqSet = 3
  100.  
  101.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  102.  
  103.     fnArray = r'H:\TEMP\temsys\data9.txt'
  104.     fnSeq = r'H:\TEMP\temsys\data12.txt'
  105.  
  106.     dataArray = parseArray(fnArray, arraySet)
  107.     dataSeq = parseData(fnSeq, seqSet)
  108.  
  109.     # This is the complete sequence  
  110.     seq = ''.join(dataSeq[1:])
  111.     # These are the subkeys of dataArray - '01', '02', '03',.............
  112.     subKeys = dataArray['A'].keys()
  113.     subKeys.sort()
  114.  
  115.     # Calculate num/den for each slice of sequence
  116.     # Each sequence slice length = length of subKeys
  117.     # Example:
  118.     # seq = 'ATCGATA'
  119.     # subKeys length = 3
  120.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  121.     numList = []
  122.     denList = []
  123.     seqList = []
  124.     for i in xrange(len(seq) - len(subKeys) + 1):
  125.         subseq = seq[0:len(subKeys)]
  126.         seqList.append(subseq)
  127.         num, den = 1, 1
  128.         for j, s in enumerate(subseq):
  129.             num *= dataArray[s][subKeys[j]]
  130.             den *= value[s]
  131.         numList.append(num)
  132.         denList.append(den)
  133.         seq = seq[1:]
  134.  
  135.     resultList = []
  136.     for i, num in enumerate(numList):
  137.         resultList.append(num/denList[i])
  138.  
  139.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  140.     print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
  141.     print 'Sequence Header: %s' % dataSeq[0]
  142.     print outStr
Check the values in variables dataArray and dataSeq for appropriateness.
Jul 16 '07 #53
aboxylica
111 100+
has log(num/deno) been done??
Jul 17 '07 #54
aboxylica
111 100+
looks like it is correct, I inserted a known sequence at a particular position which should yield a high value,it did give a hike at that position:) ..i did the log calculation..
Expand|Select|Wrap|Line Numbers
  1.  for i, num in enumerate(numList):
  2.  
  3.         resultList.append(log10(num/denList[i]))
  4.         print resultList
  5.  
looks right to me.!only generalising this is left!!
cheers!!
Jul 17 '07 #55
aboxylica
111 100+
hey,
I am waiting for ur reply bvdet.
I know am disturbing you too much but am really new to programming and python.so thats why I couldn find a way out.
waiting,
cheers!!
Jul 17 '07 #56
bvdet
2,851 Expert Mod 2GB
hey,
I am waiting for ur reply bvdet.
I know am disturbing you too much but am really new to programming and python.so thats why I couldn find a way out.
waiting,
cheers!!
You have taken on an ambitious project if you are new to programming. I have given you all that you need to complete the project. All you need to do is:
Iterate on each data file until you reach the end of each.
Compile the data for each iteration.
Something like this (untested):
Expand|Select|Wrap|Line Numbers
  1. indxSeq = 1
  2. while True:
  3.     dataSeq = parseData(fnSeq, indxSeq)
  4.     if not dataSeq:
  5.         break
  6.     indxArray = 1
  7.         while True:
  8.             dataArray = parseArray(fnArray, indxArray)
  9.             if not dataArray:
  10.                 break
  11.             ............. calculations ................ 
  12.             indxArray += 1
  13.         indxSeq += 1
It is not fair to me or others on this list for me to write the whole program for you. Study some tutorials, read Python documentation, practice with some simple exercises, and give it a shot. Come back with some code, and we will try to help if you need it. This project is interesting. Good luck and have fun!
Jul 17 '07 #57
aboxylica
111 100+
hey,
Thanks a lot. I will try and come back with queries.sorry for bugging you.
Jul 17 '07 #58
bvdet
2,851 Expert Mod 2GB
hey,
Thanks a lot. I will try and come back with queries.sorry for bugging you.
You don't have to apologize. I only did what I wanted to. :)
Jul 17 '07 #59
aboxylica
111 100+
hey,
Thats very sweet of you:)
I dont really know how to improve.as in when am reading the tutorial.I just gain theoritical knowledge.How do I put it to practical knowledge.
I have gone through the python documentation.But when it comes to programming I cant really find a way out.like this case.. I had the code to work for a single sequence and a single matrix.I could have improvised that..but i was not able to think.how do I crack it and become better?
Jul 17 '07 #60
aboxylica
111 100+
Am getting a list index out of range error!!
but I cant add a -1 to the loop.can i??whats wrong with this??

Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.     lineList = []
  42.  
  43.  
  44.  
  45.     line = f.next().strip()
  46.  
  47.     while not line.startswith(term):
  48.  
  49.         if line != '':
  50.  
  51.             lineList.append(line.strip().split())
  52.  
  53.         line = f.next().strip()
  54.  
  55.  
  56.  
  57.     f.close()
  58.  
  59.  
  60.  
  61.     # Key list
  62.  
  63.     keys = [i[0] for i in lineList]
  64.  
  65.     # Values list
  66.  
  67.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  68.  
  69.  
  70.  
  71.     # Create a dictionary from keys and values
  72.  
  73.     lineDict = dict(zip(keys, values))
  74.  
  75.  
  76.  
  77.     dataDict = {}
  78.  
  79.  
  80.  
  81.     for i, item in enumerate(headerList):
  82.  
  83.         dataDict[item] = {}
  84.  
  85.         for key in lineDict:
  86.  
  87.             dataDict[item][key] = lineDict[key][i]
  88.  
  89.  
  90.  
  91.     # Add 1.0 to every element in dataDict subdictionaries
  92.  
  93.     for keyMain in dataDict:
  94.  
  95.         for keySub in dataDict[keyMain]:
  96.  
  97.             dataDict[keyMain][keySub] += 1.0
  98.  
  99.  
  100.  
  101.     # Normalize original data (with 1 added) and update data
  102.  
  103.     valueSums = [sum(item)+4 for item in values]
  104.  
  105.     # print valueSums
  106.  
  107.  
  108.  
  109.     for keyMain in dataDict:
  110.  
  111.         for keySub in dataDict[keyMain]:
  112.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  113.  
  114.  
  115.     return dataDict
  116.  
  117.  
  118.  
  119.  
  120.  
  121. def parseData(fn, dataset=1, key='>'):
  122.  
  123.     '''
  124.  
  125.     Read a formatted data file of sequences
  126.  
  127.     Return a list of sequences
  128.  
  129.     The first element in the list is the header
  130.  
  131.     '''   
  132.  
  133.     # initialize output list
  134.  
  135.     dataList = []
  136.  
  137.  
  138.  
  139.     # open file for reading
  140.  
  141.     f = open(fn)
  142.  
  143.  
  144.  
  145.     # skip to required data set
  146.  
  147.     for _ in range(dataset):
  148.  
  149.         try:
  150.  
  151.             s = f.next()
  152.  
  153.             while not s.startswith(key):
  154.  
  155.                 s = f.next()
  156.  
  157.         except StopIteration, e:
  158.  
  159.             print 'We have reached the end of the file!'
  160.  
  161.             f.close()
  162.  
  163.             return False
  164.  
  165.  
  166.  
  167.     # initialize output list
  168.  
  169.     dataList = [s,]
  170.  
  171.  
  172.     for line in f:
  173.  
  174.         if not line.startswith(key):
  175.  
  176.             dataList.append(line.strip())
  177.  
  178.         else:
  179.  
  180.             break
  181.  
  182.  
  183.  
  184.     f.close()
  185.  
  186.     return dataList
  187.  
  188.  
  189.  
  190. if __name__ == '__main__':
  191.  
  192.  
  193.  
  194.     arraySet = 4
  195.     #print arraySet
  196.  
  197.     seqSet = 4
  198.     #print seqSet
  199.  
  200.  
  201.  
  202.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  203.  
  204.  
  205.  
  206.     fnArray = r'all_redfly.transfac.txt'
  207.  
  208.     fnSeq = r'redfly_sequence.fasta'
  209.     indxSeq=1
  210.     while True:
  211.         dataSeq=parseData(fnSeq,indxSeq)
  212.         if not dataSeq:
  213.             break
  214.         indxArray=1
  215.         while True:
  216.                 dataArray = parseArray(fnArray, arraySet)
  217.                 #dataSeq = parseData(fnSeq, seqSet)
  218.                 if not dataArray:
  219.                     break
  220.                 # This is the complete sequence
  221.                 seq = ''.join(dataSeq[1:])
  222.                 # These are the subkeys of dataArray - '01', '02', '03',.............
  223.                 subKeys = dataArray['A'].keys()
  224.                 subKeys.sort()
  225.  
  226.  
  227.  
  228.     # Calculate num/den for each slice of sequence
  229.  
  230.     # Each sequence slice length = length of subKeys
  231.  
  232.     # Example:
  233.  
  234.     # seq = 'ATCGATA'
  235.  
  236.     # subKeys length = 3
  237.  
  238.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  239.  
  240.                 numList = []
  241.  
  242.                 denList = []
  243.  
  244.                 seqList = []
  245.  
  246.                 for i in xrange(len(seq) - len(subKeys) + 1):
  247.  
  248.                     subseq = seq[0:len(subKeys)]
  249.  
  250.                     seqList.append(subseq)
  251.                     num, den = 1, 1
  252.  
  253.                     for j, s in enumerate(subseq):
  254.  
  255.                         num *= dataArray[s][subKeys[j]]
  256.  
  257.                         den *= value[s]
  258.  
  259.                         numList.append(num)
  260.  
  261.                         denList.append(den)
  262.  
  263.                         seq = seq[1:]
  264.  
  265.  
  266.  
  267.                         resultList = []
  268.  
  269.                         for i, num in enumerate(numList):
  270.  
  271.                             resultList.append(log10(num/denList[i]))
  272.                     indxArray+=1
  273.                 indxSeq +=1
  274.  
  275.                 outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  276.                 print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
  277.                 print 'Sequence Header: %s' % dataSeq[0]
  278.                 print outStr
  279.  
Jul 17 '07 #61
bvdet
2,851 Expert Mod 2GB
Let's make a new function, iterate on it, and write the results to a file:
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.     # sequence factor dictionary
  3.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  4.  
  5.     dataArray = parseArray(fnArray, arraySet)
  6.     if dataArray:
  7.         dataSeq = parseData(fnSeq, seqSet)
  8.         if not dataSeq:
  9.             return False
  10.     else:
  11.         return None
  12.  
  13.     # This is the complete sequence  
  14.     seq = ''.join(dataSeq[1:])
  15.     # These are the subkeys of dataArray - '01', '02', '03',.............
  16.     subKeys = dataArray['A'].keys()
  17.     subKeys.sort()
  18.  
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in xrange(len(seq) - len(subKeys) + 1):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             den *= value[s]
  35.         numList.append(num)
  36.         denList.append(den)
  37.         seq = seq[1:]
  38.  
  39.     resultList = []
  40.     for i, num in enumerate(numList):
  41.         resultList.append(num/denList[i])
  42.  
  43.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
  44.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  45.  
  46. if __name__ == '__main__':
  47.  
  48.     fnArray = 'array.txt'
  49.     fnSeq = 'seq.txt'
  50.  
  51.     outputfile = 'sequence_calc_data.txt'
  52.  
  53.     arraySet = 1
  54.     outList = []
  55.     calcdata = 1
  56.     while not calcdata is None:
  57.         seqSet = 1
  58.         while True:
  59.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  60.             if calcdata:
  61.                 outList.append(calcdata)
  62.                 seqSet += 1
  63.             else:
  64.                 break
  65.         arraySet += 1
  66.  
  67.     f = open(outputfile, 'w')
  68.     f.write('\n'.join(outList))
  69.     f.close()  
This resulted in a 3.1 mb file. Following are the first few lines of the first and last compilation:
Expand|Select|Wrap|Line Numbers
  1. Array set # = 1
  2. Sequence set # = 1
  3. Sequence Header: >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
  4.  
  5. Sequence = CCAGTCCACCGGCCGC Calculation = 0.000025722315
  6. Sequence = CAGTCCACCGGCCGCC Calculation = 0.000000000318
  7. Sequence = AGTCCACCGGCCGCCG Calculation = 0.000595631200
  8. Sequence = GTCCACCGGCCGCCGA Calculation = 0.000120125057
  9. Sequence = TCCACCGGCCGCCGAT Calculation = 0.000000089016
  10. ...........................
  11. Array set # = 4
  12. Sequence set # = 8
  13. Sequence Header: >Obp19b_prom|Drosophila melanogaster|Obp19b|FBgn0031110|X:20224439..20227440
  14.  
  15. Sequence = ATTGCTGACGGGTCGA Calculation = 0.000005535136
  16. Sequence = TTGCTGACGGGTCGAA Calculation = 0.000003984295
  17. Sequence = TGCTGACGGGTCGAAT Calculation = 0.000053179344
  18. Sequence = GCTGACGGGTCGAATG Calculation = 0.000031549069
  19. .............................
Jul 18 '07 #62
aboxylica
111 100+
THis is the code.my o/p is an empty array.why is this happening?

Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.     lineList = []
  42.  
  43.  
  44.  
  45.     line = f.next().strip()
  46.  
  47.     while not line.startswith(term):
  48.  
  49.         if line != '':
  50.  
  51.             lineList.append(line.strip().split())
  52.  
  53.         line = f.next().strip()
  54.  
  55.  
  56.  
  57.     f.close()
  58.  
  59.  
  60.  
  61.     # Key list
  62.  
  63.     keys = [i[0] for i in lineList]
  64.  
  65.     # Values list
  66.  
  67.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  68.  
  69.  
  70.  
  71.     # Create a dictionary from keys and values
  72.  
  73.     lineDict = dict(zip(keys, values))
  74.  
  75.  
  76.  
  77.     dataDict = {}
  78.  
  79.  
  80.  
  81.     for i, item in enumerate(headerList):
  82.  
  83.         dataDict[item] = {}
  84.  
  85.         for key in lineDict:
  86.  
  87.             dataDict[item][key] = lineDict[key][i]
  88.  
  89.  
  90.  
  91.     # Add 1.0 to every element in dataDict subdictionaries
  92.  
  93.     for keyMain in dataDict:
  94.  
  95.         for keySub in dataDict[keyMain]:
  96.  
  97.             dataDict[keyMain][keySub] += 1.0
  98.  
  99.  
  100.  
  101.     # Normalize original data (with 1 added) and update data
  102.  
  103.     valueSums = [sum(item)+4 for item in values]
  104.  
  105.     # print valueSums
  106.  
  107.  
  108.  
  109.     for keyMain in dataDict:
  110.  
  111.         for keySub in dataDict[keyMain]:
  112.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  113.  
  114.  
  115.     return dataDict
  116.  
  117.  
  118.  
  119.  
  120.  
  121. def parseData(fn, dataset=1, key='>'):
  122.  
  123.     '''
  124.  
  125.     Read a formatted data file of sequences
  126.  
  127.     Return a list of sequences
  128.  
  129.     The first element in the list is the header
  130.  
  131.     '''   
  132.  
  133.     # initialize output list
  134.  
  135.     dataList = []
  136.  
  137.  
  138.  
  139.     # open file for reading
  140.  
  141.     f = open(fn)
  142.  
  143.  
  144.  
  145.     # skip to required data set
  146.  
  147.     for _ in range(dataset):
  148.  
  149.         try:
  150.  
  151.             s = f.next()
  152.  
  153.             while not s.startswith(key):
  154.  
  155.                 s = f.next()
  156.  
  157.         except StopIteration, e:
  158.  
  159.             print 'We have reached the end of the file!'
  160.  
  161.             f.close()
  162.  
  163.             return False
  164.  
  165.  
  166.  
  167.     # initialize output list
  168.  
  169.     dataList = [s,]
  170.  
  171.  
  172.     for line in f:
  173.  
  174.         if not line.startswith(key):
  175.  
  176.             dataList.append(line.strip())
  177.  
  178.         else:
  179.  
  180.             break
  181.  
  182.  
  183.  
  184.     f.close()
  185.  
  186.     return dataList
  187.  
  188.  
  189.  
  190.  
  191. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  192.  
  193.     # sequence factor dictionary
  194.  
  195.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  196.  
  197.  
  198.  
  199.     dataArray = parseArray(fnArray, arraySet)
  200.  
  201.     if dataArray:
  202.  
  203.         dataSeq = parseData(fnSeq, seqSet)
  204.  
  205.         if not dataSeq:
  206.  
  207.             return False
  208.  
  209.         else:
  210.  
  211.             return None
  212.  
  213.  
  214.  
  215.         # This is the complete sequence 
  216.  
  217.         seq = ''.join(dataSeq[1:])
  218.  
  219.         # These are the subkeys of dataArray - '01', '02', '03',.............
  220.  
  221.         subKeys = dataArray['A'].keys()
  222.  
  223.         subKeys.sort()
  224.  
  225.  
  226.  
  227.         # Calculate num/den for each slice of sequence
  228.  
  229.           # Each sequence slice length = length of subKeys
  230.  
  231.           # Example:
  232.             # seq = 'ATCGATA'
  233.  
  234.           # subKeys length = 3
  235.  
  236.           # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  237.  
  238.         numList = []
  239.  
  240.         denList = []
  241.  
  242.         seqList = []
  243.  
  244.         for i in xrange(len(seq) - len(subKeys) + 1):
  245.  
  246.             subseq = seq[0:len(subKeys)]
  247.  
  248.             seqList.append(subseq)
  249.  
  250.             num, den = 1, 1
  251.  
  252.             for j, s in enumerate(subseq):
  253.  
  254.                 num *= dataArray[s][subKeys[j]]
  255.  
  256.                 den *= value[s]
  257.  
  258.                 numList.append(num)
  259.  
  260.                 denList.append(den)
  261.  
  262.                 seq = seq[1:]
  263.  
  264.  
  265.  
  266.         resultList = []
  267.  
  268.         for i, num in enumerate(numList):
  269.  
  270.             resultList.append(num/denList[i])
  271.  
  272.  
  273.  
  274.             outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)   for i, res in enumerate(resultList)])
  275.  
  276.             return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  277.  
  278. if __name__ == '__main__':
  279.  
  280.  
  281.     fnArray =r'all_redfly.transfac' 
  282.     fnSeq = r'redfly_sequence.fasta'
  283.  
  284.     outputfile =  "sequence_calc_data.txt"
  285.  
  286.  
  287.  
  288.     arraySet = 1
  289.  
  290.     outList = []
  291.  
  292.     calcdata = 1
  293.  
  294.     while not calcdata is None:
  295.  
  296.         seqSet = 1
  297.  
  298.         while True:
  299.  
  300.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  301.             print calcdata
  302.  
  303.             if calcdata:
  304.  
  305.                 outList.append(calcdata)
  306.  
  307.                 seqSet += 1
  308.  
  309.             else:
  310.  
  311.                 break
  312.  
  313.         arraySet += 1
  314.  
  315.  
  316.  
  317.  
  318.     f = open(outputfile, 'w')
  319.  
  320.     f.write('\n'.join(outList))
  321.  
  322.     f.close()
  323.     f=open(outputfile,"r")
  324.     file_con=f.readlines()
  325.     print file_con
  326.     for line in file_con:
  327.         print line
  328.  
  329.  
Jul 18 '07 #63
aboxylica
111 100+
I seem to get an list index out of range error:
Traceback (most recent call last):
File "newbie1.py", line 311, in <module>
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
File "newbie1.py", line 285, in compileData
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resultList)])
IndexError: list index out of range
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.     if dataArray:
  206.  
  207.         dataSeq = parseData(fnSeq, seqSet)
  208.  
  209.  
  210.         if not dataSeq:
  211.  
  212.             return False
  213.  
  214.     else:
  215.  
  216.         return None
  217.  
  218.  
  219.  
  220.  
  221.     # This is the complete sequence 
  222.  
  223.     seq = ''.join(dataSeq[1:])
  224.  
  225.  
  226.  
  227.     # These are the subkeys of dataArray - '01', '02', '03',.............
  228.  
  229.     subKeys = dataArray['A'].keys()
  230.  
  231.     subKeys.sort()
  232.  
  233.  
  234.  
  235.  
  236.     # Calculate num/den for each slice of sequence
  237.  
  238.     # Each sequence slice length = length of subKeys
  239.  
  240.     # Example:
  241.     # seq = 'ATCGATA'
  242.  
  243.     # subKeys length = 3
  244.  
  245.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  246.  
  247.     numList = []
  248.  
  249.     denList = []
  250.  
  251.     seqList = []
  252.  
  253.     for i in xrange(len(seq) - len(subKeys) + 1):
  254.  
  255.         subseq = seq[0:len(subKeys)]
  256.  
  257.         seqList.append(subseq)
  258.  
  259.  
  260.         num, den = 1, 1
  261.  
  262.         for j, s in enumerate(subseq):
  263.  
  264.             num *= dataArray[s][subKeys[j]]
  265.  
  266.             den *= value[s]
  267.  
  268.             numList.append(num)
  269.  
  270.             denList.append(den)
  271.  
  272.             seq = seq[1:]
  273.  
  274.  
  275.  
  276.     resultList = []
  277.  
  278.     for i, num in enumerate(numList):
  279.  
  280.         resultList.append(log10(num/denList[i]))
  281.         print (resultList)
  282.  
  283.  
  284.  
  285.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resultList)])
  286.  
  287.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  288.  
  289. if __name__ == '__main__':
  290.  
  291.  
  292.     fnArray ='all_redfly.transfac' 
  293.     fnSeq = 'redfly_sequence.fasta'
  294.  
  295.     outputfile =  "sequence_calc_data.txt"
  296.  
  297.  
  298.  
  299.     arraySet = 1
  300.  
  301.     outList = []
  302.  
  303.     calcdata = 1
  304.  
  305.     while not calcdata is None:
  306.  
  307.         seqSet = 1
  308.  
  309.         while True:
  310.  
  311.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  312.  
  313.             if calcdata:
  314.  
  315.                 outList.append(calcdata)
  316.  
  317.                 seqSet += 1
  318.  
  319.             else:
  320.  
  321.                 break
  322.  
  323.         arraySet += 1
  324.  
  325.  
  326.  
  327.  
  328.  
  329.     f = open(outputfile, 'w')
  330.  
  331.     f.write('\n'.join(outList))
  332.  
  333.     f.close()
  334.     f=open(outputfile,"r")
  335.     file_con=f.readlines()
  336.     print file_con
  337.     for line in file_con:
  338.         print line
  339.  
waiting for ur reply,
cheers!
Jul 18 '07 #64
bvdet
2,851 Expert Mod 2GB
I am not sure why you add so many spaces in between the lines of code. I personally find it unreadable. Anyway, when you were adding all the spaces, some of the code ended up at the incorrect indentation:
Expand|Select|Wrap|Line Numbers
  1. ........for j, s in enumerate(subseq):
  2.  
  3.  
  4.             num *= dataArray[s][subKeys[j]]
  5.  
  6.  
  7.             den *= value[s]
  8.  
  9.  
  10.             numList.append(num)
  11.  
  12.  
  13.             denList.append(den)
  14.  
  15.  
  16.             seq = seq[1:]
  17.  
SHOULD BE:
Expand|Select|Wrap|Line Numbers
  1. ........for j, s in enumerate(subseq):
  2.             num *= dataArray[s][subKeys[j]]
  3.             den *= value[s]
  4.         numList.append(num)
  5.         denList.append(den)
  6.         seq = seq[1:]
Jul 18 '07 #65
aboxylica
111 100+
hey,
That was the mistake.amazing!! thanks a million!!
I got some doubts about the program.
i have some doubts. first understandingand then get back to you.
THANKS A MILLION!
cheers!!
Jul 18 '07 #66
aboxylica
111 100+
hey,
here is the code where I tried removing the try catch block and couple of things which will make it easier for me to understand.but looks like there is some problem ..I will of course use them in my main program.But I was just trying to understand when I tried executing the iteration was not happening and when I said
print outList instead of storing it in a file it was not iterating.This is the code
can you tell me whats happening???
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn,dataset=1,key='PO',term='/'):
  3.     f=open(fn)
  4.     for _ in range(dataset):
  5.         line=f.next()
  6.         while not line.startswith(key):
  7.             line=f.next()
  8.     headerList=line.strip().split()[1:]
  9.     lineList=[]
  10.     line=f.next().strip()
  11.     while not line.startswith(term):
  12.         if line!='':
  13.             lineList.append(line.strip().split())
  14.         line=f.next().strip()
  15.         # f.close()
  16.     keys=[i[0] for i in lineList]
  17.     values=[[float(s) for s in item] for item in [j[1:] for j in lineList]]
  18.     lineDict=dict(zip(keys,values))
  19.     dataDict={}
  20.     for i,item in enumerate(headerList):
  21.         dataDict[item]={}
  22.         for key in lineDict:
  23.             dataDict[item][key]=lineDict[key][i]
  24.     for keyMain in dataDict:
  25.         for keySub in dataDict[keyMain]:
  26.             dataDict[keyMain][keySub]+=1.0
  27.     valueSums=[sum(item)+4 for item in values]
  28.     for keyMain in dataDict:
  29.         for keySub in dataDict[keyMain]:
  30.             dataDict[keyMain][keySub]/=valueSums[int(keySub)-1]
  31.     return dataDict
  32. #fn="weight_matrix.transfac.txt"
  33. #p=parseArray(fn)
  34. #print p
  35. def parseData(fn,dataset=1,key='>'):
  36.     dataList=[]
  37.     f=open(fn)
  38.     for _ in range(dataset):
  39.         s=f.next()
  40.     dataList=[s,]
  41.  
  42.     for line in f:
  43.         if not line.startswith(key):
  44.             dataList.append(line.strip())
  45.         else:
  46.             break
  47.     return dataList
  48. #fn="redfly_sequence.fasta"
  49. #p=parseData(fn)
  50. #print p
  51. def compileData(fnArray,fnSeq,arraySet=1,seqSet=1):
  52.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  53.     dataArray=parseArray(fnArray,arraySet)
  54.     if dataArray:
  55.         dataSeq=parseData(fnSeq,seqSet)
  56.     seq=''.join(dataSeq[1:])
  57.     subKeys=dataArray['A'].keys()
  58.     subKeys.sort()
  59.     numList=[]
  60.     denList=[]
  61.     seqList=[]
  62.     for i in xrange(len(seq)-len(subKeys)):
  63.         subseq=seq[0:len(subKeys)]
  64.         seqList.append(subseq)
  65.         num,den=1,1
  66.         for j,s in enumerate(subseq):
  67.             num*=dataArray[s][subKeys[j]]
  68.             den*=value[s]
  69.         numList.append(num)
  70.         denList.append(den)
  71.         seq=seq[1:]
  72.     resultList=[]
  73.     for i,num in enumerate(numList):
  74.         if (log10(num/denList[i]))>2:
  75.             resultList.append(log10(num/denList[i]))
  76.     outStr='\n'.join(['sequence=%s Calculation=%0.12f'%(seqList[i],res) for i,res in enumerate(resultList)])
  77.     return 'array set#= %d\nSequence set #=%d\nSequence Header: %s\n%s' %(arraySet,seqSet,dataSeq[0],outStr)
  78. fnArray='weight_matrix.transfac.txt'
  79. fnSeq='redfly_sequence.fasta'
  80. arraySet=1
  81. outList=[]
  82. calcdata=1
  83. while not calcdata is None:
  84.     seqSet=1
  85.     while True:
  86.         calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
  87.         if calcdata:
  88.             outList.append(calcdata)
  89.  
  90.  
  91.             seqSet+=1
  92.         else:
  93.             break
  94.  
  95.     arraySet+=1
  96. print outList
  97. f=open(outputfile,'w')
  98. f.write('/n'.join(outList))
  99. f.close()
  100.  
  101.  
  102.  
waiting
cheers!!
Jul 18 '07 #67
bvdet
2,851 Expert Mod 2GB
After running the script, I can do this:
Expand|Select|Wrap|Line Numbers
  1. >>> print outList[1]
  2. Array set # = 1
  3. Sequence set # = 2
  4. Sequence Header: >Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
  5.  
  6. Sequence = AGTCGACCAGCACGAG Calculation = -0.872390330485
  7. Sequence = GTCGACCAGCACGAGA Calculation = -3.287525755636
  8. Sequence = TCGACCAGCACGAGAT Calculation = -4.346213357398
  9. Sequence = CGACCAGCACGAGATC Calculation = -2.329064001005
  10. .........................
I don't want to print the entire outList because it's over 3 MB.
You may have changed something you should not have. Maybe you should copy the code again. If you need to change things, change only one thing at a time and test to make sure it still works.
Jul 18 '07 #68
aboxylica
111 100+
hello!
I hope you people remember the problem above..
i got little problems with that
that was just opening a file containing files..now il be opening a directory containing different sequence files
this is how the code looks now!
am trying to change the i/p file to folder by showing the path of the folder but its going to the exception file..can you tell me why?
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.  
  284.         if (log10(num/denList[i]))>=2:
  285.  
  286.         resultList.append(int(abs(1)))
  287.  
  288.  
  289.  
  290.  
  291.  
  292.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  293.  
  294.  
  295.  
  296.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  297.  
  298.  
  299. if __name__ == '__main__':
  300.  
  301.  
  302.  
  303.     fnArray ='half.txt'
  304.  
  305.     fnSeq = 'C:\\python25\ding\YAL005C.txt'
  306.  
  307.  
  308.  
  309.     outputfile =  "sequence_calc_data.txt"
  310.  
  311.  
  312.  
  313.     arraySet = 1
  314.  
  315.     outList = []
  316.  
  317.     calcdata = 1
  318.  
  319.     while not calcdata is None:
  320.  
  321.         seqSet = 1
  322.  
  323.         while True:
  324.  
  325.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  326.             print calcdata
  327.  
  328.             if calcdata:
  329.  
  330.                 outList.append(calcdata)
  331.  
  332.                 seqSet += 1
  333.  
  334.             else:
  335.  
  336.                 break
  337.  
  338.         arraySet += 1
  339.  
  340.  
  341.  
  342.  
  343.  
  344.     f = open(outputfile, 'w')
  345.  
  346.     f.write('\n'.join(outList))
  347.  
  348.     f.close()
  349.     #f=open(outputfile,"r")
  350.     #file_con=f.readlines()
  351.     #for line in file_con:
  352.      #   print line
  353.  
please tell me what can i do??
Dec 11 '07 #69
aboxylica
111 100+
here is my code which is reading a directory containing files..... it seems to go to the exception part always.. i dono why..i think it checks for the first file in the folder and then comes out..how do i check if its going to all the files..
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.             print "am here"
  25.  
  26.             while not line.startswith(key):
  27.                 print "oh yes"
  28.  
  29.                 line = f.next()
  30.  
  31.         except StopIteration, e:
  32.             print '###############################'
  33.  
  34.             print 'We have reached the end of the file!'
  35.  
  36.             f.close()
  37.  
  38.             return False
  39.  
  40.  
  41.  
  42.     headerList = line.strip().split()[1:]
  43.  
  44.  
  45.     lineList = []
  46.  
  47.  
  48.  
  49.     line = f.next().strip()
  50.  
  51.     while not line.startswith(term):
  52.  
  53.         if line != '':
  54.  
  55.             lineList.append(line.strip().split())
  56.  
  57.  
  58.         line = f.next().strip()
  59.  
  60.  
  61.  
  62.     f.close()
  63.  
  64.  
  65.  
  66.     # Key list
  67.  
  68.     keys = [i[0] for i in lineList]
  69.  
  70.     # Values list
  71.  
  72.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  73.  
  74.  
  75.  
  76.     # Create a dictionary from keys and values
  77.  
  78.     lineDict = dict(zip(keys, values))
  79.  
  80.  
  81.  
  82.     dataDict = {}
  83.  
  84.  
  85.  
  86.     for i, item in enumerate(headerList):
  87.  
  88.         dataDict[item] = {}
  89.  
  90.         for key in lineDict:
  91.  
  92.             dataDict[item][key] = lineDict[key][i]
  93.  
  94.  
  95.  
  96.     # Add 1.0 to every element in dataDict subdictionaries
  97.  
  98.     for keyMain in dataDict:
  99.  
  100.         for keySub in dataDict[keyMain]:
  101.  
  102.             dataDict[keyMain][keySub] += 1.0
  103.  
  104.  
  105.  
  106.     # Normalize original data (with 1 added) and update data
  107.  
  108.     valueSums = [sum(item)+4 for item in values]
  109.  
  110.     # print valueSums
  111.  
  112.  
  113.  
  114.     for keyMain in dataDict:
  115.  
  116.         for keySub in dataDict[keyMain]:
  117.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  118.  
  119.     return dataDict
  120.  
  121.  
  122.  
  123.  
  124.  
  125. def parseData(fn, dataset=1, key='>'):
  126.  
  127.     '''
  128.  
  129.     Read a formatted data file of sequences
  130.  
  131.     Return a list of sequences
  132.  
  133.     The first element in the list is the header
  134.  
  135.     '''   
  136.  
  137.     # initialize output list
  138.  
  139.     dataList = []
  140.  
  141.  
  142.  
  143.     # open file for reading
  144.  
  145.     f = open(fn)
  146.  
  147.  
  148.  
  149.     # skip to required data set
  150.  
  151.     for _ in range(dataset):
  152.  
  153.  
  154.         try:
  155.  
  156.             s = f.next()
  157.  
  158.             while not s.startswith(key):
  159.  
  160.  
  161.                 s = f.next()
  162.  
  163.         except StopIteration, e:
  164.  
  165.             print 'We have reached the end of the file!'
  166.             print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
  167.  
  168.             f.close()
  169.  
  170.             return False
  171.  
  172.  
  173.  
  174.     # initialize output list
  175.  
  176.     dataList = [s,]
  177.  
  178.  
  179.     for line in f:
  180.  
  181.         if not line.startswith(key):
  182.  
  183.             dataList.append(line.strip())
  184.  
  185.         else:
  186.  
  187.             break
  188.  
  189.  
  190.  
  191.     f.close()
  192.  
  193.     return dataList
  194.  
  195.  
  196.  
  197.  
  198.  
  199. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  200.  
  201.     # sequence factor dictionary
  202.  
  203.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  204.  
  205.  
  206.  
  207.     dataArray = parseArray(fnArray, arraySet)
  208.  
  209.  
  210.     if dataArray:
  211.  
  212.         dataSeq = parseData(fnSeq, seqSet)
  213.  
  214.  
  215.         if not dataSeq:
  216.  
  217.             return False
  218.  
  219.     else:
  220.  
  221.         return None
  222.  
  223.  
  224.  
  225.  
  226.     # This is the complete sequence 
  227.  
  228.     seq = ''.join(dataSeq[1:])
  229.  
  230.  
  231.  
  232.  
  233.  
  234.     # These are the subkeys of dataArray - '01', '02', '03',.............
  235.  
  236.     subKeys = dataArray['A'].keys()
  237.  
  238.     subKeys.sort()
  239.  
  240.  
  241.  
  242.  
  243.  
  244.     # Calculate num/den for each slice of sequence
  245.  
  246.     # Each sequence slice length = length of subKeys
  247.  
  248.     # Example:
  249.     # seq = 'ATCGATA'
  250.  
  251.     # subKeys length = 3
  252.  
  253.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  254.  
  255.     numList = []
  256.  
  257.     denList = []
  258.  
  259.     seqList = []
  260.  
  261.     for i in xrange(len(seq) - len(subKeys)):
  262.  
  263.         subseq = seq[0:len(subKeys)]
  264.  
  265.         seqList.append(subseq)
  266.  
  267.  
  268.         num, den = 1, 1
  269.  
  270.         for j, s in enumerate(subseq):
  271.  
  272.             num *= dataArray[s][subKeys[j]]
  273.  
  274.             den *= value[s]
  275.  
  276.         numList.append(num)
  277.  
  278.         denList.append(den)
  279.  
  280.         seq = seq[1:]
  281.  
  282.  
  283.  
  284.     resultList = []
  285.  
  286.     for i, num in enumerate(numList):
  287.         #p=log10(num/denList[i])
  288.         #if (p) >=2:
  289.             #print "#########",abs(int(p))
  290.         if (log10(num/denList[i]))>=2:
  291.             #print "i am here"
  292.         resultList.append(int(abs(1)))
  293.     #print resultList
  294.     #for i in resultList:
  295.     #mean=sum(resultList)/len(resultList)
  296.         #sub=mean-i
  297.         #queue = []
  298.         #queue = (sub)**2
  299.         #print sqrt(queue/len(resultList))
  300.  
  301.     #print mean,"@@@@@@@@@@"
  302.  
  303.  
  304.  
  305.  
  306.  
  307.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  308.     #print "this is line 294"
  309.  
  310.  
  311.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  312.  
  313.  
  314. if __name__ == '__main__':
  315.  
  316.  
  317.  
  318.     fnArray ='C:\\python25\\half.txt'
  319.     import os
  320.     seq_=os.listdir("ding")
  321.     print seq_
  322.     os.chdir("C:\\python25\\New Folder")
  323.     for file_ in seq_:
  324.         if os.path.isfile(file_):
  325.             rem=open(file_)
  326.             dingg=rem.readlines()
  327.     fnSeq = dingg
  328.  
  329.  
  330.  
  331.     outputfile =  "sequence_calc_data.txt"
  332.  
  333.  
  334.  
  335.     arraySet = 1
  336.  
  337.     outList = []
  338.  
  339.     calcdata = 1
  340.  
  341.     while not calcdata is None:
  342.  
  343.         seqSet = 1
  344.  
  345.         while True:
  346.  
  347.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  348.             print calcdata
  349.  
  350.             if calcdata:
  351.  
  352.                 outList.append(calcdata)
  353.  
  354.                 seqSet += 1
  355.  
  356.             else:
  357.  
  358.                 break
  359.  
  360.         arraySet += 1
  361.  
  362.  
  363.  
  364.  
  365.  
  366.     f = open(outputfile, 'w')
  367.  
  368.     f.write('\n'.join(outList))
  369.  
  370.     f.close()
  371.     #f=open(outputfile,"r")
  372.     #file_con=f.readlines()
  373.     #for line in file_con:
  374.      #   print line
  375.  
waiting for ur reply,
cheers!
Dec 11 '07 #70
bvdet
2,851 Expert Mod 2GB
What error message are you receiving? It appears that you are trying to read data in from the files in a directory. You should be passing the file name to your functions, not the text from one of the files. Try something like this:
Expand|Select|Wrap|Line Numbers
  1. import os
  2. # Create a list of directory entries
  3. dir_name = 'your_directory'
  4. fList = os.listdir(dir_name)
  5. # The directory entries do not include the full path
  6. # Create a list of file names with full path, eliminating subdirectory entries
  7. fList1 = [os.path.join(dir_name, f) for f in fList if os.path.isfile(os.path.join(dir_name, f))]
  8. seqSetIndex = 0
  9. fnSeq = fList1[seqSetIndex]
  10. while True:
  11.     ... do stuff ...
  12.     seqSetIndex += 1
'fnSeq' is the file name will full path that is passed to your parsing function.
Dec 12 '07 #71
aboxylica
111 100+
to show you typically how my files are.. I am storing them inside python 25 folder..the folder is called New folder which has around five thousand files..so i cant possibly give the names of all the files. for eg YAL001C,YAL002w..etc
each file has sequence like this:
>Scer_YAL001C SGDID:S0000001 5' untranslated region, Chr I 151168 - 152167 (revcom), 1000 bp
ACTTGTAAATATATCTTTTATTTTCCGAGAGGAAAAAGTTTCAAAAAAAA AAAAAAAAAAAGAAGAAAAATAACTTTCTCATGTAATAAAAGGTAACTAA TGTAGACAAAAAAGTATACATTTAGCTTTTCTTTTTTTGATGATTTTTGA GTTTCATGTTACTAATCAGAACAATTAACGACACCTTCATTATGAAAAAA TTAATTAGCTATAAGTCTTCGAAGTAGAACATGATATTTGGCAATCACTC GAATAACTATCTTAATTTACCTGCTGAAATAATTTGAAAAAACACCCGAG GCAGCAGACGAAAGGTGTTTTTGCTAAACAATGATTGATTTCTGGCGCCA TTTCTACATTCTGAACAGTTCATCTCATTTCAGTAACAGTACTTCAATGG AATATTTATTAAAGAAAGTGCTTAAAAAAGTATTATAAAACGATACATGG ACTGACTCAAGATTGAGCTAATAAGGTCCACCGCCTAGTGCTTAAGAGTT CTGTACCACTATAATAATTTATCTTGATCGTATTATGTGTAAAAAAAAGG CGCTTGAAATGAAAGCTCCGAAAATTAAAATACTTTGACTGCTTCGGAAA ACAAAAACATATAAATAAATTTAAAAAATAAACTGTAAAATATTTAAAAA CTATTAAAAATATTTTATATTTTTAAAATTATTTATTATTATGTCATGTG ACAAGACTTAAATCATTACATAAAAGGTTTTGAAGTTCAATGTCAAAGTC AATATAATAAGCATACTAAGGCACACTTATGCAAATCGAGTTATTGAAGC TGGTAAAATTATAAGATTTTTATTTTTATTTCTTTTATTTCTGCAAATCT GCATTTTCAAATACCGCTTGGTTTTTTGCATCATAAAGGGCGGCGCTTTC AGTCGCGAAAGTGAAATAAACAACCAGTCACACATATAACTTTCTTCTTG CCATAAGAGAGAAGAGGACGTTTGGTTGAAGCCAACTAGCCACAAGAAAA
>Spar YAL001C c218:24375..25375
TCAAGAGGGTTATTATATACCGATATTTGAATCCACACATGATCGAACTA ATATAATTCACTACTTAGCTGCTTAACCATTCTTTGCCATTATAATAAAT GTATCTCGATCGAATGCTGCATAGAGAAAAGCGCTTAAAAAAGTGTTCCG AAATATAACATATTTTAAACGTTTCGGAAACCAAAAACATATATTTATTA TCTTAAAGTAATCTAAAAAATGAAAGAAACTTTGATATATTTAAAACATA ATATAATTATTTTATAAAAGTAACATGTGATTAAACTCACAAAGCCTAAA AATGTTTTTAATCATTATGTTAAGCTGAATATAGTATATCTAATAATTGT TTATTTAAGCAGATTGAACTGTGAATGCTAGTAAATTTTTAAGGGTTTTT TGTTGTTCTGCTTTTCTCTAGTTTTGTAGTGTCAAATACAACTAACTGGA TTTTTTGCATCAGAAGGGGCGCTTTAACTCGCGAAAGTAAAAATAAACAA TTAATCACACATATCTTTTCCTCTTGCAGAAGCAAAGAAGAGGACGGTTG ATTGAAGCAAACCAGCCACAGAAAATATGGCGTTGACAATTTATCCTGAT GAACTGGTTAAAATAGTGTCTGATGAAATTGCATCAAATAAAGGAAGTAT GTATATGCCTCATTCTTCTATTCCATGTTCTTTTCAGGTGAGAAACGTGA TATATTGTAAGATTATTTACTAACGACTTATTAAAGAAATTACGTTAAAT CAGCTTTGGGATATATCTCGTAAATATTTTGATTTGTCTGACGAGAAAGT TAAACAATTTGTGCTTTCATGCGTGATGTTGAAAAAGGACATCGAGGTGT ACTGTGATAGCGTTATAACAACTAAAAACGTGACAAATATTATAGACGAC ACTAGTCATTCATACTCAGTAGGGATTACTGAGGACAGCCTGTGGACGTT ATTAACTGGATACACAAAGAAGGAGTCAACTATCGGGAATTCAGCATTTG A

similarly each file has such sequences my code has to go to all the sequences and do it for all the files.
when i run this code , it says

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.



C:\Python25>python this_final.py
['sequence_calc_data.txt', 'YAL001C.fasta', 'YAL002W.fasta', 'YAL003W.fasta', 'Y
AL005C.fasta', 'YAL007C.fasta']
am here
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
###############################
We have reached the end of the file!
None
why is this happening.
this is my code
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.         #p=log10(num/denList[i])
  284.         #if (p) >=2:
  285.             #print "#########",abs(int(p))
  286.         if (log10(num/denList[i]))>=2:
  287.             #print "i am here"
  288.         resultList.append(int(abs(1)))
  289.     #print resultList
  290.     #for i in resultList:
  291.     #mean=sum(resultList)/len(resultList)
  292.         #sub=mean-i
  293.         #queue = []
  294.         #queue = (sub)**2
  295.         #print sqrt(queue/len(resultList))
  296.  
  297.     #print mean,"@@@@@@@@@@"
  298.  
  299.  
  300.  
  301.  
  302.  
  303.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  304.     #print "this is line 294"
  305.  
  306.  
  307.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  308.  
  309.  
  310. if __name__ == '__main__':
  311.  
  312.  
  313.     fnArray ='C:\python25\half.txt'
  314.     fnSeq = 'C:\python25\New Folder'
  315.     import os
  316.     dir_name='New Folder'
  317.     fList=os.listdir(dir_name)
  318.     fList1=[os.path.join(dir_name,f) for f in fList if
  319.     os.path.isfile(os.path.join(dir_name,f))]
  320.     seqSetIndex=0
  321.     fnSeq=fList1[seqSetIndex]
  322.     while True:
  323.         seqSetIndex+=1
  324.  
  325.  
  326.  
  327.     outputfile =  "sequence_calc_data.txt"
  328.  
  329.  
  330.  
  331.     arraySet = 1
  332.  
  333.     outList = []
  334.  
  335.     calcdata = 1
  336.  
  337.     while not calcdata is None:
  338.  
  339.         seqSet = 1
  340.  
  341.         while True:
  342.  
  343.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  344.             print calcdata
  345.  
  346.             if calcdata:
  347.  
  348.                 outList.append(calcdata)
  349.  
  350.                 seqSet += 1
  351.  
  352.             else:
  353.  
  354.                 break
  355.  
  356.         arraySet += 1
  357.  
  358.  
  359.  
  360.  
  361.  
  362.     f = open(outputfile, 'w')
  363.  
  364.     f.write('\n'.join(outList))
  365.  
  366.     f.close()
  367.     #f=open(outputfile,"r")
  368.     #file_con=f.readlines()
  369.     #for line in file_con:
  370.      #   print line
  371.  
Dec 12 '07 #72
aboxylica
111 100+
when i run the code below.. its going to an infinite loop which says "reached the end of file"
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2.  
  3. def parseArray(fn, dataset=1, key='PO', term='/'):
  4.  
  5.     '''
  6.  
  7.     Read a formatted data file in matrix format and
  8.  
  9.     compile data into a dictionary
  10.  
  11.     '''
  12.  
  13.     f = open(fn)
  14.  
  15.  
  16.  
  17.     # skip to required data set
  18.  
  19.     for _ in range(dataset):
  20.  
  21.  
  22.         try:
  23.  
  24.             line = f.next()
  25.  
  26.             while not line.startswith(key):
  27.  
  28.                 line = f.next()
  29.  
  30.         except StopIteration, e:
  31.  
  32.             print 'We have reached the end of the file!'
  33.  
  34.             f.close()
  35.  
  36.             return False
  37.  
  38.  
  39.  
  40.     headerList = line.strip().split()[1:]
  41.  
  42.  
  43.     lineList = []
  44.  
  45.  
  46.  
  47.     line = f.next().strip()
  48.  
  49.     while not line.startswith(term):
  50.  
  51.         if line != '':
  52.  
  53.             lineList.append(line.strip().split())
  54.  
  55.  
  56.         line = f.next().strip()
  57.  
  58.  
  59.  
  60.     f.close()
  61.  
  62.  
  63.  
  64.     # Key list
  65.  
  66.     keys = [i[0] for i in lineList]
  67.  
  68.     # Values list
  69.  
  70.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  71.  
  72.  
  73.  
  74.     # Create a dictionary from keys and values
  75.  
  76.     lineDict = dict(zip(keys, values))
  77.  
  78.  
  79.  
  80.     dataDict = {}
  81.  
  82.  
  83.  
  84.     for i, item in enumerate(headerList):
  85.  
  86.         dataDict[item] = {}
  87.  
  88.         for key in lineDict:
  89.  
  90.             dataDict[item][key] = lineDict[key][i]
  91.  
  92.  
  93.  
  94.     # Add 1.0 to every element in dataDict subdictionaries
  95.  
  96.     for keyMain in dataDict:
  97.  
  98.         for keySub in dataDict[keyMain]:
  99.  
  100.             dataDict[keyMain][keySub] += 1.0
  101.  
  102.  
  103.  
  104.     # Normalize original data (with 1 added) and update data
  105.  
  106.     valueSums = [sum(item)+4 for item in values]
  107.  
  108.     # print valueSums
  109.  
  110.  
  111.  
  112.     for keyMain in dataDict:
  113.  
  114.         for keySub in dataDict[keyMain]:
  115.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  116.  
  117.     return dataDict
  118.  
  119.  
  120.  
  121.  
  122.  
  123. def parseData(fn, dataset=1, key='>'):
  124.  
  125.     '''
  126.  
  127.     Read a formatted data file of sequences
  128.  
  129.     Return a list of sequences
  130.  
  131.     The first element in the list is the header
  132.  
  133.     '''   
  134.  
  135.     # initialize output list
  136.  
  137.     dataList = []
  138.  
  139.  
  140.  
  141.     # open file for reading
  142.  
  143.     f = open(fn)
  144.  
  145.  
  146.  
  147.     # skip to required data set
  148.  
  149.     for _ in range(dataset):
  150.  
  151.  
  152.         try:
  153.  
  154.             s = f.next()
  155.  
  156.             while not s.startswith(key):
  157.  
  158.  
  159.                 s = f.next()
  160.  
  161.         except StopIteration, e:
  162.  
  163.             print 'We have reached the end of the file!'
  164.  
  165.             f.close()
  166.  
  167.             return False
  168.  
  169.  
  170.  
  171.     # initialize output list
  172.  
  173.     dataList = [s,]
  174.  
  175.  
  176.     for line in f:
  177.  
  178.         if not line.startswith(key):
  179.  
  180.             dataList.append(line.strip())
  181.  
  182.         else:
  183.  
  184.             break
  185.  
  186.  
  187.  
  188.     f.close()
  189.  
  190.     return dataList
  191.  
  192.  
  193.  
  194.  
  195.  
  196. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  197.  
  198.     # sequence factor dictionary
  199.  
  200.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  201.  
  202.  
  203.  
  204.     dataArray = parseArray(fnArray, arraySet)
  205.  
  206.  
  207.     if dataArray:
  208.  
  209.         dataSeq = parseData(fnSeq, seqSet)
  210.  
  211.  
  212.         if not dataSeq:
  213.  
  214.             return False
  215.  
  216.     else:
  217.  
  218.         return None
  219.  
  220.  
  221.  
  222.  
  223.     # This is the complete sequence 
  224.  
  225.     seq = ''.join(dataSeq[1:])
  226.  
  227.  
  228.  
  229.  
  230.  
  231.     # These are the subkeys of dataArray - '01', '02', '03',.............
  232.  
  233.     subKeys = dataArray['A'].keys()
  234.  
  235.     subKeys.sort()
  236.  
  237.  
  238.  
  239.  
  240.  
  241.     # Calculate num/den for each slice of sequence
  242.  
  243.     # Each sequence slice length = length of subKeys
  244.  
  245.     # Example:
  246.     # seq = 'ATCGATA'
  247.  
  248.     # subKeys length = 3
  249.  
  250.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  251.  
  252.     numList = []
  253.  
  254.     denList = []
  255.  
  256.     seqList = []
  257.  
  258.     for i in xrange(len(seq) - len(subKeys)):
  259.  
  260.         subseq = seq[0:len(subKeys)]
  261.  
  262.         seqList.append(subseq)
  263.  
  264.  
  265.         num, den = 1, 1
  266.  
  267.         for j, s in enumerate(subseq):
  268.  
  269.             num *= dataArray[s][subKeys[j]]
  270.  
  271.             den *= value[s]
  272.  
  273.         numList.append(num)
  274.  
  275.         denList.append(den)
  276.  
  277.         seq = seq[1:]
  278.  
  279.  
  280.  
  281.     resultList = []
  282.  
  283.     for i, num in enumerate(numList):
  284.         #p=log10(num/denList[i])
  285.         #if (p) >=2:
  286.             #print "#########",abs(int(p))
  287.         if (log10(num/denList[i]))>=2:
  288.             #print "i am here"
  289.         resultList.append(int(abs(1)))
  290.     #print resultList
  291.     #for i in resultList:
  292.     #mean=sum(resultList)/len(resultList)
  293.         #sub=mean-i
  294.         #queue = []
  295.         #queue = (sub)**2
  296.         #print sqrt(queue/len(resultList))
  297.  
  298.     #print mean,"@@@@@@@@@@"
  299.  
  300.  
  301.  
  302.  
  303.  
  304.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  305.     #print "this is line 294"
  306.  
  307.  
  308.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  309.  
  310.  
  311. if __name__ == '__main__':
  312.  
  313.  
  314.     fnArray ='C:\python25\half.txt'
  315.     fnSeq = 'C:\python25\New Folder'
  316.     import os
  317.     dir_name='New Folder'
  318.     fList=os.listdir(dir_name)
  319.     fList1=[os.path.join(dir_name,f) for f in fList if
  320.     os.path.isfile(os.path.join(dir_name,f))]
  321.     seqSetIndex=0
  322.     fnSeq=fList1[seqSetIndex]
  323.     while True:
  324.  
  325.  
  326.  
  327.  
  328.         outputfile =  "sequence_calc_data.txt"
  329.  
  330.  
  331.  
  332.         arraySet = 1
  333.  
  334.         outList = []
  335.  
  336.         calcdata = 1
  337.  
  338.         while not calcdata is None:
  339.  
  340.             seqSet = 1
  341.  
  342.             while True:
  343.  
  344.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  345.                 print calcdata
  346.  
  347.                 if calcdata:
  348.  
  349.                     outList.append(calcdata)
  350.  
  351.                     seqSet += 1
  352.  
  353.                 else:
  354.  
  355.                     break
  356.  
  357.         arraySet += 1
  358.         seqSetIndex+=1
  359.  
  360.  
  361.  
  362.  
  363.  
  364.     f = open(outputfile, 'w')
  365.  
  366.     f.write('\n'.join(outList))
  367.  
  368.     f.close()
  369.     #f=open(outputfile,"r")
  370.     #file_con=f.readlines()
  371.     #for line in file_con:
  372.      #   print line
  373.  
Dec 12 '07 #73
bvdet
2,851 Expert Mod 2GB
You have two while loops. When calcData() returns None (when the end of file is reached), you issue one break statement. You may need another to get out of the other loop.
Dec 12 '07 #74
aboxylica
111 100+
this is my code now.. i have added a break statement and still there seems to be some prob. it is coming out of the loop and saying
we have reached the end of file
None
here is the code
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fnSeq, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.         #p=log10(num/denList[i])
  284.         #if (p) >=2:
  285.             #print "#########",abs(int(p))
  286.         #if (log10(num/denList[i]))>=2:
  287.             #print "i am here"
  288.         resultList.append(int(abs(1)))
  289.     #print resultList
  290.     #for i in resultList:
  291.     #mean=sum(resultList)/len(resultList)
  292.         #sub=mean-i
  293.         #queue = []
  294.         #queue = (sub)**2
  295.         #print sqrt(queue/len(resultList))
  296.  
  297.     #print mean,"@@@@@@@@@@"
  298.  
  299.  
  300.  
  301.  
  302.  
  303.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  304.     #print "this is line 294"
  305.  
  306.  
  307.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  308.  
  309.  
  310. if __name__ == '__main__':
  311.  
  312.  
  313.     fnArray ='C:\python25\half.txt'
  314.     fnSeq = 'C:\python25\New Folder'
  315.     import os
  316.     dir_name='New Folder'
  317.     fList=os.listdir(dir_name)
  318.     fList1=[os.path.join(dir_name,f) for f in fList if os.path.isfile(os.path.join(dir_name,f))]
  319.     seqSetIndex=0
  320.     fnSeq=fList1[seqSetIndex]
  321.     while True:
  322.  
  323.  
  324.  
  325.  
  326.         outputfile =  "sequence_calc_data.txt"
  327.  
  328.  
  329.  
  330.         arraySet = 1
  331.  
  332.         outList = []
  333.  
  334.         calcdata = 1
  335.  
  336.         while not calcdata is None:
  337.  
  338.             seqSet = 1
  339.  
  340.             while True:
  341.  
  342.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  343.                 print calcdata
  344.  
  345.                 if calcdata:
  346.  
  347.                     outList.append(calcdata)
  348.  
  349.                     seqSet += 1
  350.  
  351.                 else:
  352.  
  353.                     break
  354.  
  355.             arraySet += 1
  356.             seqSetIndex+=1
  357.         else:
  358.             break
  359.  
  360.  
  361.  
  362.  
  363.  
  364.  
  365.     f = open(outputfile, 'w')
  366.  
  367.     f.write('\n'.join(outList))
  368.  
  369.     f.close()
  370.     #f=open(outputfile,"r")
  371.     #file_con=f.readlines()
  372.     #for line in file_con:
  373.      #   print line
  374.  
Dec 12 '07 #75
aboxylica
111 100+
hey!
That code seens to work but the problem is it is checking only for the first six files... i donno where is the looping problem!!
can you see it??
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fnSeq, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fnSeq)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.         #p=log10(num/denList[i])
  284.         #if (p) >=2:
  285.             #print "#########",abs(int(p))
  286.         #if (log10(num/denList[i]))>=2:
  287.             #print "i am here"
  288.         resultList.append(int(abs(1)))
  289.     #print resultList
  290.     #for i in resultList:
  291.     #mean=sum(resultList)/len(resultList)
  292.         #sub=mean-i
  293.         #queue = []
  294.         #queue = (sub)**2
  295.         #print sqrt(queue/len(resultList))
  296.  
  297.     #print mean,"@@@@@@@@@@"
  298.  
  299.  
  300.  
  301.  
  302.  
  303.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  304.     #print "this is line 294"
  305.  
  306.  
  307.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  308.  
  309.  
  310. if __name__ == '__main__':
  311.  
  312.  
  313.     fnArray ='one_redfly.transfac'
  314.     fnSeq = 'deepthi/upstream_regions'
  315.     import os
  316.     dir_name='upstream_regions'
  317.     fList=os.listdir(dir_name)
  318.     fList1=[os.path.join(dir_name,f) for f in fList if os.path.isfile(os.path.join(dir_name,f))]
  319.     seqSetIndex=0
  320.     fnSeq=fList1[seqSetIndex]
  321.  
  322.     while True:
  323.  
  324.  
  325.  
  326.  
  327.         outputfile =  "sequence_calc_data.txt"
  328.  
  329.  
  330.  
  331.         arraySet = 1
  332.  
  333.         outList = []
  334.  
  335.         calcdata = 1
  336.  
  337.         while not calcdata is None:
  338.  
  339.             seqSet = 1
  340.  
  341.             while True:
  342.  
  343.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  344.                 print calcdata
  345.  
  346.                 if calcdata:
  347.  
  348.                     outList.append(calcdata)
  349.  
  350.                     seqSet += 1
  351.  
  352.                 else:
  353.  
  354.                     break
  355.  
  356.             arraySet += 1
  357.             seqSetIndex+=1
  358.         else:
  359.             break
  360.  
  361.  
  362.  
  363.  
  364.  
  365.  
  366.     f = open(outputfile, 'w')
  367.  
  368.     f.write('\n'.join(outList))
  369.  
  370.     f.close()
  371.     #f=open(outputfile,"r")
  372.     #file_con=f.readlines()
  373.     #for line in file_con:
  374.      #   print line
  375.  
waiting for ur reply!
cheers!!
Dec 13 '07 #76
aboxylica
111 100+
when the seqset is becoming 6.. it is coming out of the loop..why is this happening???
Dec 13 '07 #77
bvdet
2,851 Expert Mod 2GB
This code is untested, but I think is close to what you are trying to do:
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.     import os
  3.  
  4.     fnArray = 'array_file'
  5.     outputfile = 'output_file'
  6.     masterList = []
  7.  
  8.     # user has multiple sequence files
  9.     # and must iterate over all the files
  10.     dir_name = 'your_directory'
  11.     fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
  12.                 if os.path.isfile(os.path.join(dir_name,f))]
  13.  
  14.     for fnSeq in fileList:
  15.         outList = []
  16.         calcdata = 1
  17.         while not calcdata is None:
  18.             seqSet = 1
  19.             while True:
  20.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  21.                 if calcdata:
  22.                     outList.append(calcdata)
  23.                     seqSet += 1
  24.                 else:
  25.                     break
  26.             arraySet += 1
  27.         masterList.append('\n'.join(outList))
  28.     f = open(outputfile, 'w')
  29.     f.write('\n'.join(masterList))
  30.     f.close()    
  31.  
Dec 13 '07 #78
aboxylica
111 100+
i donno the code is not executing.. sequence set is not at all incrementing..
when i print seqSet.. it is always one..it is not incrementing and the resultlist is empty.
i have been trying.. there is some problem in writing it to a file also..
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             #print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fnSeq, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             #print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.     print dataArray
  205.  
  206.  
  207.     if dataArray:
  208.  
  209.         dataSeq = parseData(fnSeq, seqSet)
  210.         #print dataSeq,"@@@@@@@@@@@@@@@@@@@"
  211.  
  212.  
  213.         if not dataSeq:
  214.  
  215.             return False
  216.  
  217.     else:
  218.  
  219.         return None
  220.  
  221.  
  222.  
  223.  
  224.     # This is the complete sequence 
  225.  
  226.     seq = ''.join(dataSeq[1:])
  227.  
  228.  
  229.  
  230.  
  231.  
  232.     # These are the subkeys of dataArray - '01', '02', '03',.............
  233.  
  234.     subKeys = dataArray['A'].keys()
  235.  
  236.     subKeys.sort()
  237.  
  238.  
  239.  
  240.  
  241.  
  242.     # Calculate num/den for each slice of sequence
  243.  
  244.     # Each sequence slice length = length of subKeys
  245.  
  246.     # Example:
  247.     # seq = 'ATCGATA'
  248.  
  249.     # subKeys length = 3
  250.  
  251.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  252.  
  253.     numList = []
  254.  
  255.     denList = []
  256.  
  257.     seqList = []
  258.  
  259.     for i in xrange(len(seq) - len(subKeys)):
  260.  
  261.         subseq = seq[0:len(subKeys)]
  262.  
  263.         seqList.append(subseq)
  264.  
  265.  
  266.         num, den = 1, 1
  267.  
  268.         for j, s in enumerate(subseq):
  269.  
  270.             num *= dataArray[s][subKeys[j]]
  271.  
  272.             den *= value[s]
  273.  
  274.         numList.append(num)
  275.  
  276.         denList.append(den)
  277.  
  278.         seq = seq[1:]
  279.  
  280.  
  281.  
  282.     resultList = []
  283.  
  284.     for i, num in enumerate(numList):
  285.         #p=log10(num/denList[i])
  286.         #if (p) >=2:
  287.             #print "#########",abs(int(p))
  288.         if (log10(num/denList[i]))>=2:
  289.             #print "i am here"
  290.         resultList.append(int(abs(1)))
  291.     #print resultList
  292.     #for i in resultList:
  293.     #mean=sum(resultList)/len(resultList)
  294.         #sub=mean-i
  295.         #queue = []
  296.         #queue = (sub)**2
  297.         #print sqrt(queue/len(resultList))
  298.  
  299.     #print mean,"@@@@@@@@@@"
  300.  
  301.  
  302.  
  303.  
  304.  
  305.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  306.     #print "this is line 294"
  307.  
  308.  
  309.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  310.  
  311.  
  312. if __name__ == '__main__':
  313.     import os
  314.  
  315.  
  316.  
  317.     fnArray ='half.txt'
  318.     outputfile='output_file'
  319.     masterList=[]
  320.     dir_name='New Folder'
  321.     fileList=[os.path.join(dir_name,f) for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name,f))]
  322.  
  323.  
  324.  
  325.     for fnSeq in fileList:
  326.         outList=[]
  327.         calcdata=1
  328.         arraySet=1
  329.  
  330.         while not calcdata is None:
  331.  
  332.             seqSet=1
  333.             while True:
  334.                 print "am here"
  335.  
  336.                 calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
  337.  
  338.                 if calcdata:
  339.                     outList.append(calcdata)
  340.                     #print outList
  341.                     seqSet += 1
  342.                     print seqSet,""################"
  343.                 else:
  344.                     break
  345.             arraySet+=1
  346.         masterList.append('\n'.join(outList))
  347.     #f=open(outputfile,'W')
  348.     #fwrite('\n'.join(masterList))
  349.     #f.close()
  350.  
waiting for ur reply
cheers!
Dec 13 '07 #79
bvdet
2,851 Expert Mod 2GB
There is no need to post the entire code repeatedly. Presumably the code before "if __name__ == '__main__':" works. I assume that you have one array file and multiple sequence files. You want to iterate on each array set in the array file on each sequence set for each sequence file in a separate directory. The following code works for me given the previous assumptions:
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.     import os
  3.  
  4.     fnArray = r'H:\array.txt'
  5.     outputfile = r'H:\seq_array_output.txt'
  6.     masterList = []
  7.  
  8.     # user has multiple sequence files
  9.     # and must iterate over all the files
  10.     dir_name = r'H:\TEMP\sequence_files
  11.     fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
  12.                 if os.path.isfile(os.path.join(dir_name,f))]
  13.  
  14.     for fnSeq in fileList:
  15.         outList = []
  16.         calcdata = 1
  17.         arraySet = 1
  18.         while not calcdata is None:
  19.             seqSet = 1
  20.             while True:
  21.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  22.                 if calcdata:
  23.                     outList.append(calcdata)
  24.                     seqSet += 1
  25.                 else:
  26.                     break
  27.             arraySet += 1
  28.         masterList.append('\n'.join(outList))
  29.     f = open(outputfile, 'w')
  30.     f.write('\n'.join(masterList))
  31.     f.close()    
Following is a snippet from the array file:
NA bin
PO A C G T
01 0.45 8.27 0.00 11.39
02 0.00 0.00 10.02 10.09
03 5.80 1.39 0.00 12.93
//
//
NA bap
PO A C G T
01 0.00 3.67 0.00 0.00
02 0.00 0.00 3.67 0.00
03 0.00 0.00 0.00 3.67
04 0.00 3.67 0.00 0.00
//
//
NA bcd
PO A C G T
01 42.55 8.75 145.86 8.14
02 0.14 0.53 204.64 0.00
03 126.83 78.02 0.11 0.34
04 0.21 0.17 0.00 204.92
05 0.00 12.38 0.43 192.50
06 174.48 0.95 1.32 28.56
07 79.53 4.70 100.44 20.64
//
//
Following is a snippet from one of the sequence files:
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTAC TCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAAT GTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATG CTCTGGG
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
AACATTG
>Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGA AATGGAGATGGATCACGTAGCCGGCCAT
GGCGG
>Him_distal|Drosophila melanogaster|Him|FBgn0030900|X:18039896..18043470
GGTTTTCTGCGATGGCTTCCGCGCCAGCTGAAGTATCTGATTTGCTGCCT TGTTTTTGTTGATATTTCTGCGAAGGGA
CTTGTGCTTTTCAAATGGCCTTTTTTTGGGATTACGGCAAGGGCGCGTTT CCCACGCTCGATCCCCACTTACCATTGG
TGCACGCGATTGCGGCAAGCTGCTGAGGCAAGCTATTAAACGCCACACTG GGCCGGGGGGCGGTACCGGTGGGCGTGG
AATATATAGATCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAG AGCCACCAACCTCG
I don't know what else I can do! Maybe you should check if your defined functions are working as expected.
Dec 13 '07 #80
aboxylica
111 100+
I sort of took away the spaces and tried executing.. the other functions are correct. please execute this code with the inputs i paste now and please paste ur o/p file
my directory containing the sequence files is called "up"

this contains 24 files.. but let me paste the first two files.
first file called YAL012w contains the sequence
>Scer_YAL012W SGDID:S0000010 5' untranslated region, Chr I 130534 - 130801, 268 bp
ATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCA TTCACGTGATCTCAGCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAA TCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTTATTTAAAGCATAAG ATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTAC AATTGAGGCCTATACACATAGACATTTGCACCTTATACATATACACACAA GACAAAACCAAAAAAAAT
>Smik YAL012W c875:991..1258
GCCACACCGTCTCTACACAGCTTCTTGTTCTTGTTATTCCTAAATATGTT GTAACGCTACGTAATTCCACCTTTCATTACTAATATTAAGCCATTCACGT GACTTTTTCCCAGCGCCTGGTGTGGCGCCACACTTTTTTTTCATAAGAAT CCTCGAGGAAAACAGAAAAATATTTCAGTTATTTAAACTATAAGATGCTT GGGGATATAATGATGGTGCCGTGTGTTTGTGTTTGAGTGTGTTTGAGTGT TCTGAAAGAGAGAGC
>Spar YAL012W c218:45412..45679 (rev)
TTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAA TTAGCCATTCACGTGATACCAGCCAGCTGTGGCGCCACACTTTTTTTCCA TAAAAATCCTCGAGGAAAAGAAAAGAAAAAATATTTCAGTTATTTAAACC AAAAGATGCCTGGGAGATAGAGCTGGTGCCGTGTGAAATCCAGTTTTGAA AGTGCAATTGAGTCCTACACACATAAGCATCCGCACCTTGTATACATATA CGCCAAACAAAACAAAAA
and the second file is called YAl013 and contains the following sequence
>Scer_YAL013W SGDID:S0000011 5' untranslated region, Chr I 129021 - 129271, 251 bp
TACCGTGTGTGCTTACGCATCTATTTGCTGTCGTGAGATCTGTCTCTATG CTTTATTCGTTTTCCATTGTAAAGTCTCAGCATTTATTTCTTGTTCTTTA CTTGTTTTTGCCCTTTCGGGTGATCAAAGTCGTGCTGGGAAATTTTATTC TTATAAAATGATTTTTAGAAATAATAAACTCATAACAGTGCAACGGCAAA GTACAAGGGAAGGAAGCACAGAAGCAAGAGGAGGCGCATCGATCGTGGCA G
>Sbay YAL013W c669:46323..46574 (rev)
ACGTCCATCCCGTGTGTGCTTGCACATCTGCTTGGCCTGCGAGCCGTCCT CTGGGCCATCCGCTCCCTATTGTGAAGTACCAGCCTTTGTTCCTGCTCTT CATAGTTTGCTCTTTCGGGTGATGAAAGTTGGGCTGGAAATTTTATCCCA AAGGACAATATGAAAATAAACTCGAAACTGTGTAGCAACCAAGAGGGAAG GGGAAAGGGAGAGACAGGAGCAAGAGGAAGCATCGGTCGTGGCAGATGAG TC
>Spar YAL013W c218:46936..47186 (rev)
ATACCGTGTGTGCTTACGCATCTATTTGTTCTCGTGAGCCCCGTCTCTGG ACTTTATCCGTTCGCTATTGTAAAGTCTTAGCCTTTATCTCTAGTTCTTT ACCTGCTTTTGCTCTTTCGGGTGATGAAAGTTGCGCTGGGAAATTTTATT CCCATAGAATGATTTTTAGAAATAATAAACTCATAACAGTGCAACGACCA AGTGCAGGGGAAAGTAGCATAGAAGCAAGAGGAGACGCATCGATCGCGGC A

the matrix file am using is called weight matrix.transfac.txt
//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
Dec 14 '07 #81
aboxylica
111 100+
and here is the code with spaces removed!
Expand|Select|Wrap|Line Numbers
  1.  
  2. from math import *
  3. print "hellllllllo"
  4. def parseArray(fn, dataset=1, key='PO', term='/'):
  5.  
  6.     '''
  7.  
  8.     Read a formatted data file in matrix format and
  9.  
  10.     compile data into a dictionary
  11.  
  12.     '''
  13.     f = open(fn)
  14.     # skip to required data set
  15.     for _ in range(dataset):
  16.         try:
  17.             line = f.next()
  18.             while not line.startswith(key):
  19.                 line = f.next()
  20.         except StopIteration, e:
  21.             print 'We have reached the end of the file!','matrix file'
  22.             f.close()
  23.             return False
  24.     headerList = line.strip().split()[1:]
  25.     lineList = []
  26.     line = f.next().strip()
  27.     while not line.startswith(term):
  28.         if line != '':
  29.             lineList.append(line.strip().split())
  30.     line = f.next().strip()
  31.     f.close()
  32.     # Key list
  33.     keys = [i[0] for i in lineList]
  34.     # Values list
  35.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  36.     # Create a dictionary from keys and values
  37.     lineDict = dict(zip(keys, values))
  38.     dataDict = {}
  39.     for i, item in enumerate(headerList):
  40.         dataDict[item] = {}
  41.         for key in lineDict:
  42.             dataDict[item][key] = lineDict[key][i]
  43.     # Add 1.0 to every element in dataDict subdictionaries
  44.     for keyMain in dataDict:
  45.         for keySub in dataDict[keyMain]:
  46.             dataDict[keyMain][keySub] += 1.0
  47.     # Normalize original data (with 1 added) and update data
  48.     valueSums = [sum(item)+4 for item in values]
  49.     # print valueSums
  50.     for keyMain in dataDict:
  51.         for keySub in dataDict[keyMain]:
  52.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  53.     return dataDict
  54.  
  55. def parseData(fnSeq, dataset=1, key='>'):
  56.  
  57.     '''
  58.  
  59.     Read a formatted data file of sequences
  60.  
  61.     Return a list of sequences
  62.  
  63.     The first element in the list is the header
  64.  
  65.     '''
  66.     # initialize output list
  67.     dataList = []
  68.     # open file for reading
  69.     f = open(fn)
  70.     # skip to required data set
  71.     for _ in range(dataset):
  72.         try:
  73.             s = f.next()
  74.             while not s.startswith(key):
  75.                 s = f.next()
  76.         except StopIteration, e:
  77.             print 'We have reached the end of the file!','sequence file'
  78.             f.close()
  79.             return False
  80.     # initialize output list
  81.     dataList = [s,]
  82.     for line in f:
  83.         if not line.startswith(key):
  84.             dataList.append(line.strip())
  85.         else:
  86.             break
  87.     f.close()
  88.     return dataList
  89.  
  90. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  91.  
  92.    # sequence factor dictionary
  93.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  94.     dataArray = parseArray(fnArray, arraySet)
  95.    #print dataArray
  96.     if dataArray:
  97.         dataSeq = parseData(fnSeq, seqSet)
  98.         #print dataSeq,"@@@@@@@@@@@@@@@@@@@"
  99.         if not dataSeq:
  100.             return False
  101.     else:
  102.         return None
  103.     # This is the complete sequence
  104.     seq = ''.join(dataSeq[1:])
  105.     # These are the subkeys of dataArray - '01', '02', '03',.............
  106.     subKeys = dataArray['A'].keys()
  107.     subKeys.sort()
  108.     # Calculate num/den for each slice of sequence
  109.     # Each sequence slice length = length of subKeys
  110.     # Example:
  111.     # seq = 'ATCGATA'
  112.     # subKeys length = 3
  113.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  114.     numList = []
  115.     denList = []
  116.     seqList = []
  117.     for i in xrange(len(seq) - len(subKeys)):
  118.         subseq = seq[0:len(subKeys)]
  119.         seqList.append(subseq)
  120.         num, den = 1, 1
  121.         for j, s in enumerate(subseq):
  122.             num *= dataArray[s][subKeys[j]]
  123.             den *= value[s]
  124.         numList.append(num)
  125.         denList.append(den)
  126.         seq = seq[1:]
  127.     resultList = []
  128.     for i, num in enumerate(numList):
  129.         #p=log10(num/denList[i])
  130.         #if (p) >=2:
  131.             #print "#########",abs(int(p))
  132.         if (log10(num/denList[i]))>=2:
  133.             #print "i am here"
  134.             resultList.append(int(abs(1)))
  135.     #print resultList
  136.     #for i in resultList:
  137.     #mean=sum(resultList)/len(resultList)
  138.         #sub=mean-i
  139.         #queue = []
  140.         #queue = (sub)**2
  141.         #print sqrt(queue/len(resultList))
  142.  
  143.     #print mean,"@@@@@@@@@@"
  144.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  145.     #print "this is line 294"
  146.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  147.  
  148.  
  149. if __name__ == '__main__':
  150.     import os
  151.     fnArray =r'c:\python25\weightmatrix.transfac.txt'
  152.     outputfile=r'c:\python25\output_file.txt'
  153.     masterList=[]
  154.     dir_name=r'c:\python25\up'
  155.     fileList=[os.path.join(dir_name,f) for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name,f))]
  156.     for fnSeq in fileList:
  157.         outList=[]
  158.         calcdata=1
  159.         arraySet=1
  160.         while not calcdata is None:
  161.             seqSet=1
  162.             while True:
  163.                 calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
  164.                 if calcdata:
  165.                     outList.append(calcdata)
  166.                     print calcdata
  167.                     seqSet += 1
  168.                     #print seqSet,""################"
  169.                 else:
  170.                     break
  171.             arraySet+=1
  172.         masterList.append('\n'.join(outList))
  173.     #f=open(outputfile,'W')
  174.     #fwrite('\n'.join(masterList))
  175.     #f.close()
  176.  
please paste ur output file!
i have been stuck for a longtime now. the file should give you values 1 with sequence parts corresponding to it.
waiting for your reply,
cheers!
Dec 14 '07 #82
aboxylica
111 100+
oh........ i got the problem! Thanks
Dec 16 '07 #83
aboxylica
111 100+
i am encountering a problem when i run my code for a huge folder containing many sequence files.. this is what it says
Traceback (most recent call last):
File "this_final_11.py", line 163, in <module>
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet )
File "this_final_11.py", line 121, in compileData
num *= dataArray[s][subKeys[j]]
KeyError: 'N'
Expand|Select|Wrap|Line Numbers
  1. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  2.  
  3.     # sequence factor dictionary
  4.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  5.     dataArray = parseArray(fnArray, arraySet)
  6.    #print dataArray
  7.     if dataArray:
  8.         dataSeq = parseData(fnSeq, seqSet)
  9.         #print dataSeq,"@@@@@@@@@@@@@@@@@@@"
  10.         if not dataSeq:
  11.             return False
  12.     else:
  13.         return None
  14.     # This is the complete sequence
  15.     seq = ''.join(dataSeq[1:])
  16.     # These are the subkeys of dataArray - '01', '02', '03',.............
  17.     subKeys = dataArray['A'].keys()
  18.     subKeys.sort()
  19.     # Calculate num/den for each slice of sequence
  20.     # Each sequence slice length = length of subKeys
  21.     # Example:
  22.     # seq = 'ATCGATA'
  23.     # subKeys length = 3
  24.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  25.     numList = []
  26.     denList = []
  27.     seqList = []
  28.     for i in xrange(len(seq) - len(subKeys)):
  29.         subseq = seq[0:len(subKeys)]
  30.         seqList.append(subseq)
  31.         num, den = 1, 1
  32.         for j, s in enumerate(subseq):
  33.             num *= dataArray[s][subKeys[j]]
  34.             print num,"######",dataArray[s][subKeys[j]],"#########"
  35.             den *= value[s]
  36.         numList.append(num)
  37.         denList.append(den)
  38.         seq = seq[1:]
  39.     resultList = []
  40.     for i, num in enumerate(numList):
  41.         #p=log10(num/denList[i])
  42.         #if (p) >=2:
  43.             #print "#########",abs(int(p))
  44.         if (log10(num/denList[i]))>=2:
  45.             #print "i am here"
  46.             resultList.append(int(abs(1)))
  47.     #print resultList
  48.     #for i in resultList:
  49.     #mean=sum(resultList)/len(resultList)
  50.         #sub=mean-i
  51.         #queue = []
  52.         #queue = (sub)**2
  53.         #print sqrt(queue/len(resultList))
  54.  
  55.     #print mean,"@@@@@@@@@@"
  56.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  57.     #print "this is line 294"
  58.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  59.  
can u see the problem.. i don get it
Dec 17 '07 #84
bvdet
2,851 Expert Mod 2GB
It appears that you have an 'N' in your sequence data. There is no such key in dataArray. Check your data files.
Dec 17 '07 #85
aboxylica
111 100+
okay.. i have to say that if it encounters "N" the value of this alphabet should be taken as "0"... I am able to do this for denominator.. but for the numerator being calculated.. i want to specify that the log value of "N" is zero .. and if many "NNNNN" come together.. this part of "NNNN" can be omitted.. how do i do this??
Dec 18 '07 #86
aboxylica
111 100+
this is what am doing and it doesnt seem to solve the problem.
Expand|Select|Wrap|Line Numbers
  1.  
  2. numList = []
  3.  
  4.  
  5.  
  6.     denList = []
  7.     seqList = []
  8.     for i in xrange(len(seq) - len(subKeys)):
  9.         subseq = seq[0:len(subKeys)]
  10.         seqList.append(subseq)
  11.         num, den = 1, 1
  12.         for j, s in enumerate(subseq):
  13.             if s=="N":
  14.                 value[s]=0
  15.  
  16.  
  17.                 num *= dataArray[s][subKeys[j]]
  18.                 den *= value[s]
  19.         numList.append(num)
  20.         denList.append(den)
  21.  
seq = seq[1:]
Dec 18 '07 #87
bvdet
2,851 Expert Mod 2GB
this is what am doing and it doesnt seem to solve the problem.
Expand|Select|Wrap|Line Numbers
  1.  
  2. numList = []
  3.  
  4.  
  5.  
  6.     denList = []
  7.     seqList = []
  8.     for i in xrange(len(seq) - len(subKeys)):
  9.         subseq = seq[0:len(subKeys)]
  10.         seqList.append(subseq)
  11.         num, den = 1, 1
  12.         for j, s in enumerate(subseq):
  13.             if s=="N":
  14.                 value[s]=0
  15.  
  16.  
  17.                 num *= dataArray[s][subKeys[j]]
  18.                 den *= value[s]
  19.         numList.append(num)
  20.         denList.append(den)
  21.  
seq = seq[1:]
You can test for a valid key and perform some alternative action if an invalid key is encountered:
Expand|Select|Wrap|Line Numbers
  1.         for j, s in enumerate(subseq):
  2.             if s in 'ACGT':
  3.                 num *= dataArray[s][subKeys[j]]
  4.                 den *= value[s]
  5.             else:
  6.                 ..............            
  7.         numList.append(num)
  8.         denList.append(den)
Be careful with setting the variable den to zero, because the value will be used as a divisor later and will cause a ZeroDivisionError.
Dec 18 '07 #88
aboxylica
111 100+
my code is running successffully..my code ran for five hours.results were ok


but when its writing to a file it said.


We have reached the end of the file! sequence file

We have reached the end of the file! matrix file

Traceback (most recent call last):

File "C:\Python25\this_final_1.py", line 174, in <module>

f.write('\n'.join(masterList))

MemoryError

i know its a very huge file but is there an efficient way to solve this problem??
Dec 19 '07 #89
aboxylica
111 100+
And i also think that this program should not take such a longtime to run..though my sequence folder is 18MB and array file is 69KB. though it is pretty huge the calculation is simple only.. can i compute how long it will typically take to calculate??
I think there is some serious error
waiting for ur reply,
cheers!!
Dec 19 '07 #90
aboxylica
111 100+
I have changed the indendation of this part so that only whatever is >1 is only printed is it correct?.. dont know if it is checking for the entire file!
Expand|Select|Wrap|Line Numbers
  1. for i, num in enumerate(numList):
  2.         if (log10(num/denList[i]))>=2:
  3.             outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  4.  
  5.         return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  6.  
will this check for the entire sequence and all the weight matrix??
Dec 19 '07 #91
bvdet
2,851 Expert Mod 2GB
Obviously accumulating such a large list is taxing your system. A small adjustment in the code should help, and allow you to see the results for each sequence file.
Expand|Select|Wrap|Line Numbers
  1. if __name__ == '__main__':
  2.  
  3.     import os
  4.  
  5.     fnArray = 'array.txt'
  6.     outputfile = 'seq_array_output.txt'
  7.  
  8.     # user has multiple sequence files
  9.     # and must iterate over all the files
  10.     dir_name = r'H:\TEMP\temsys\seq_array'
  11.     fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
  12.                 if os.path.isfile(os.path.join(dir_name,f))]
  13.  
  14.     for fnSeq in fileList:
  15.         outList = []
  16.         calcdata = 1
  17.         arraySet = 1
  18.         while calcdata:
  19.             seqSet = 1
  20.             while True:
  21.                 calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  22.                 if calcdata:
  23.                     outList.append(calcdata)
  24.                     seqSet += 1
  25.                 else:
  26.                     break
  27.             arraySet += 1
  28.         # write to output file for each sequence file
  29.         f = open(outputfile, 'a')
  30.         f.write('\n'.join(outList))
  31.         f.close()
  32.         print 'Data written for sequence file %s' % fnSeq
You should check the validity of the calculations on a sample array file and two or three sample sequence files. To limit the output to calculation results that are greater than 1:
Expand|Select|Wrap|Line Numbers
  1. ....resultList = []
  2.     seqList1 = []
  3.  
  4.     for i, num in enumerate(numList):
  5.         result = log10(num/denList[i])
  6.         # limit output to results greater than 1
  7.         if result > 1:
  8.             # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
  9.             resultList.append(result)
  10.             seqList1.append(seqList[i])
  11.  
  12.     outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
  13.                         (seqList1[i], res) for i, res in enumerate(resultList)])
  14.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
  15.            (arraySet, seqSet, dataSeq[0], outStr)
Dec 19 '07 #92
aboxylica
111 100+
but when i change
while not calcdata is None:
to while calcdata:


the array set is not iterating at all.. the set is one through out.why is this happening??
and what exactly is the difference between the two lines??
and also
if any log value is greater than two then am taking that value as one.. only for these sequences.. i want to print the seqset,arrayset and calculated value.

i want all these written to a file..

will it take a longtime to run??
Dec 19 '07 #93
bvdet
2,851 Expert Mod 2GB
You are correct. The code should be
Expand|Select|Wrap|Line Numbers
  1. while not calcdata is None:
I have one array file and two small sequence files for testing. It takes about 2 or 3 seconds.
Dec 19 '07 #94
aboxylica
111 100+
i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??
Dec 19 '07 #95
bvdet
2,851 Expert Mod 2GB
i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??
Try and see. The modified code from my earlier post writes the data to file for each sequence file. That should solve your memory error, but the only way to find for sure out is to try it.
Dec 19 '07 #96
aboxylica
111 100+
yup.. I got the o/p file for the last sequence file.. its about 87 KB. if i want all the values for all the 5000 files in my folder.. it will take 440MB..is it possible to write such a huge file?? because i need the entire output.
Dec 20 '07 #97
aboxylica
111 100+
okay.. i dont want to do that because this is not the final result of my task so.. i am not going to write it to a file.
what i should be doing is that..
i get the results which have log values of ones with four headers...like
>skud
>scer
>smik
>spar

if my result list has something like this
Array set # = 1

Sequence set # = 3

Sequence Header: >Skud YPR204W c2068:3442..4218



Sequence = AACTGTACACT Calculation = 1

Sequence = GATTATAACAA Calculation = 1

Sequence Header: >Smik YPR204W c2854:249..1235



Sequence = GAAGAGATACTAACAA Calculation = 1

Sequence = AATGACCGCGGCTCTT Calculation = 1

Array set # = 3

Sequence set # = 3

Sequence Header: >Skud YPR204W c2068:3442..4218



Sequence = AACTGTACACTGATTA Calculation = 1

Sequence = TAACAAGAACGGTTCA Calculation = 1

Sequence = TCGGAGCCTCGACTAA Calculation = 1

Sequence = AGACACTTGACGGACT Calculation = 1

Sequence = CACTTCAGATTACTTG Calculation = 1

Array set # = 3

Sequence set # = 4


this a part of the file
what i should do is... count the number of ones at every occurrence and write it to a list and find the average..
for example..skud=[2,5][at first occurrence of skud there were 2 ones then there were 5 ones] smik[2][there is only one occurence of this with two ones calculated]
now i should compute the average for skud which is 5+2/2=3(approx) similarly for smik 2/1=2

waiting for your reply,
cheers!
Dec 20 '07 #98
aboxylica
111 100+
I have done this average calculation to a part of my output_file.. but i should be doing this on resultlist directly. i dont now how to exactly do that.here is the code for calculating average for a small part of my o/p file
Expand|Select|Wrap|Line Numbers
  1.  
  2. # Open the file. File is in the same directory as source code
  3. f=file("result.txt")
  4.  
  5. # Put the Contents os the testfile in list a
  6.  
  7. a=f.readlines()
  8.  
  9. # create a 'mylist' which will contain only the data we require i.e. header and calculation line
  10.  
  11. mylist=[]
  12.  
  13.  
  14.  
  15.  
  16. for i in a:
  17.     if i.find('Header')==-1:
  18.         pass
  19.     else:
  20.         name=""
  21.         for j in i:
  22.             if j ==">":
  23.                 for k in range(1,5):
  24.                     indx = i.index(j)+k
  25.                     name+=i[indx]
  26.         mylist.append(name)
  27.  
  28.  
  29.  
  30.     if i.find('Calculation')==-1:
  31.         pass
  32.     else:
  33.         mylist.append(i)
  34.  
  35.  
  36. #name=" "
  37.  
  38.  
  39. # Make the List of all the yeast in the text file.The key values are set initially. they will be incrimented for corresponding ones
  40.  
  41. yeasts={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
  42.  
  43. # Dictioalry to count only those yeasts who have one (valid yeasts). It will skip blank yeasts
  44.  
  45.  
  46. valid={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
  47.  
  48. #Dictionary to store the values of average
  49.  
  50. average={}
  51.  
  52.  
  53. # Function to read the name of yeast .
  54. # e.g. >ABC as the first argument and index of >ABC as second argument. Returnes name ABC
  55.  
  56.  
  57. def readname(r):
  58.     name=""
  59.     for i in r:
  60.         if i ==">":
  61.             for j in range(1,5):
  62.                 indx = r.index(i)+j
  63.                 name+=r[indx]
  64.     return name
  65.  
  66.  
  67. def name(r):
  68.     name=""
  69.     for i in r:
  70.         if i ==">":
  71.             for j in range(1,5):
  72.                 indx = r.index(i)+j
  73.                 name+=r[indx]
  74.     return name
  75.  
  76.  
  77.  
  78.  
  79.  
  80.  
  81.  
  82.  
  83.  
  84.  
  85.  
  86.  
  87.  
  88.  
  89.  
  90. # Function to claculate ones for each species. Takes name of species as argument
  91.  
  92.  
  93. for v in range(len(mylist)-2):
  94.     s=mylist[v]
  95.     if '1' in s:
  96.         yeasts[name]+=1
  97.     else:
  98.         name=s
  99.         if v<len(mylist):
  100.             t=mylist[v+1]
  101.             if '1' in t:
  102.                 valid[name]+=1
  103.  
  104.  
  105.  
  106.  
  107.  
  108.  
  109.  
  110.  
  111.  
  112. # Calculations for output (Average) Number
  113.  
  114.  
  115. for keys in yeasts:
  116.     if yeasts[keys]==0:
  117.         print "Yeast %s has no occurence" %keys
  118.         pass
  119.     else:
  120.         if valid[keys]==0:
  121.             print "Yeast %s has no occurence" %keys
  122.             pass
  123.         else:
  124.             average[keys]=yeasts[keys]/valid[keys]
  125.  
  126. print yeasts
  127. print valid
  128. for keys in average:
  129.     print "%s Avareage is : %d" %(keys,average[keys])
  130.  
  131. f.close()
  132.  
i dont exactly know how to string this to my main program.. should i put this under main or seperate function as this is calculating from what the code is computing.
waiting for ur reply
cheers!
Dec 20 '07 #99
aboxylica
111 100+
I want to know if i should write a seperate function to compute average or can i do it in the main part??
Dec 20 '07 #100

Sign in to post your reply or Sign up for a free account.

Similar topics

8
by: kaptain kernel | last post by:
i've got a while loop thats iterating through a text file and pumping the contents into a database. the file is quite large (over 150mb). the looping causes my CPU load to race up to 100 per...
5
by: B-Dog | last post by:
I have an old dos program that uses dat files to store the data and I'm trying to convert to dotnet. I'd like to be able to import the data into an access database but I don't know which format...
2
by: deko | last post by:
I have a table that contains a bunch of pictures. When the user selects a particular image in a form, I need a way to extract the selected bitmap image (stored in an OLE Object table field) to the...
10
by: bienwell | last post by:
Hi, I have a question about file included in ASP.NET. I have a file that includes all the Sub functions (e.g FileFunct.vb). One of the functions in this file is : Sub TestFunct(ByVal...
3
by: Chung Leong | last post by:
Here's the rest of the tutorial I started earlier: Aside from text within a document, Indexing Service let you search on meta information stored in the files. For example, MusicArtist and...
1
by: Alex | last post by:
Hello, I have a stored procedure that processes an individual file from a directory and archives it in a subdirectory.Now, the problem is, when i execute it , it will only process one file. What...
1
by: laredotornado | last post by:
Hi, I'm using PHP 4.4.4 on Apache 2 on Fedora Core 5. PHP was installed using Apache's apxs and the php library was installed to /usr/local/php. However, when I set my "error_reporting"...
5
by: Mark | last post by:
Hi I have an application (in vb.NET 2005) which holds data in SQL Server and some of the SQL records are simply paths to related files. I would like to be able to do a text search on both the...
0
by: anthon | last post by:
Hi all - first post! anywho; I need to create a function for speeding up and down a looping clip. imagine a rotating object, triggered by an action, and slowly decreasing in speed, till it...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.