472,980 Members | 1,531 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,980 software developers and data experts.

how do i approach this problem

111 100+
i have a code which calculates a set of log values when two inputs are given..a set of sequence files and a set of matrix files.
now what i should be doing is that my i/p sequence is a folder containing many files.
each file will have two sets of sequences. in all the files i should calculate the log values of the first file seperately and second set of sequences seperately. then i should take an average of the values of the first sequence and avg of the values of the second sequence. for simplicity i will take only one weight matrix as the first step.
my code that computes the log values is here
Expand|Select|Wrap|Line Numbers
  1. from math import *
  2. def parseArray(fn, dataset=1, key='PO', term='/'):
  3.  
  4.     '''
  5.  
  6.     Read a formatted data file in matrix format and
  7.  
  8.     compile data into a dictionary
  9.  
  10.     '''
  11.  
  12.     f = open(fn)
  13.  
  14.  
  15.  
  16.     # skip to required data set
  17.  
  18.     for _ in range(dataset):
  19.  
  20.  
  21.         try:
  22.  
  23.             line = f.next()
  24.  
  25.             while not line.startswith(key):
  26.  
  27.                 line = f.next()
  28.  
  29.         except StopIteration, e:
  30.  
  31.             print 'We have reached the end of the file!'
  32.  
  33.             f.close()
  34.  
  35.             return False
  36.  
  37.  
  38.  
  39.     headerList = line.strip().split()[1:]
  40.  
  41.  
  42.     lineList = []
  43.  
  44.  
  45.  
  46.     line = f.next().strip()
  47.  
  48.     while not line.startswith(term):
  49.  
  50.         if line != '':
  51.  
  52.             lineList.append(line.strip().split())
  53.  
  54.  
  55.         line = f.next().strip()
  56.  
  57.  
  58.  
  59.     f.close()
  60.  
  61.  
  62.  
  63.     # Key list
  64.  
  65.     keys = [i[0] for i in lineList]
  66.  
  67.     # Values list
  68.  
  69.     values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
  70.  
  71.  
  72.  
  73.     # Create a dictionary from keys and values
  74.  
  75.     lineDict = dict(zip(keys, values))
  76.  
  77.  
  78.  
  79.     dataDict = {}
  80.  
  81.  
  82.  
  83.     for i, item in enumerate(headerList):
  84.  
  85.         dataDict[item] = {}
  86.  
  87.         for key in lineDict:
  88.  
  89.             dataDict[item][key] = lineDict[key][i]
  90.  
  91.  
  92.  
  93.     # Add 1.0 to every element in dataDict subdictionaries
  94.  
  95.     for keyMain in dataDict:
  96.  
  97.         for keySub in dataDict[keyMain]:
  98.  
  99.             dataDict[keyMain][keySub] += 1.0
  100.  
  101.  
  102.  
  103.     # Normalize original data (with 1 added) and update data
  104.  
  105.     valueSums = [sum(item)+4 for item in values]
  106.  
  107.     # print valueSums
  108.  
  109.  
  110.  
  111.     for keyMain in dataDict:
  112.  
  113.         for keySub in dataDict[keyMain]:
  114.             dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
  115.  
  116.     return dataDict
  117.  
  118.  
  119.  
  120.  
  121.  
  122. def parseData(fn, dataset=1, key='>'):
  123.  
  124.     '''
  125.  
  126.     Read a formatted data file of sequences
  127.  
  128.     Return a list of sequences
  129.  
  130.     The first element in the list is the header
  131.  
  132.     '''   
  133.  
  134.     # initialize output list
  135.  
  136.     dataList = []
  137.  
  138.  
  139.  
  140.     # open file for reading
  141.  
  142.     f = open(fn)
  143.  
  144.  
  145.  
  146.     # skip to required data set
  147.  
  148.     for _ in range(dataset):
  149.  
  150.  
  151.         try:
  152.  
  153.             s = f.next()
  154.  
  155.             while not s.startswith(key):
  156.  
  157.  
  158.                 s = f.next()
  159.  
  160.         except StopIteration, e:
  161.  
  162.             print 'We have reached the end of the file!'
  163.  
  164.             f.close()
  165.  
  166.             return False
  167.  
  168.  
  169.  
  170.     # initialize output list
  171.  
  172.     dataList = [s,]
  173.  
  174.  
  175.     for line in f:
  176.  
  177.         if not line.startswith(key):
  178.  
  179.             dataList.append(line.strip())
  180.  
  181.         else:
  182.  
  183.             break
  184.  
  185.  
  186.  
  187.     f.close()
  188.  
  189.     return dataList
  190.  
  191.  
  192.  
  193.  
  194.  
  195. def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
  196.  
  197.     # sequence factor dictionary
  198.  
  199.     value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
  200.  
  201.  
  202.  
  203.     dataArray = parseArray(fnArray, arraySet)
  204.  
  205.  
  206.     if dataArray:
  207.  
  208.         dataSeq = parseData(fnSeq, seqSet)
  209.  
  210.  
  211.         if not dataSeq:
  212.  
  213.             return False
  214.  
  215.     else:
  216.  
  217.         return None
  218.  
  219.  
  220.  
  221.  
  222.     # This is the complete sequence 
  223.  
  224.     seq = ''.join(dataSeq[1:])
  225.  
  226.  
  227.  
  228.  
  229.  
  230.     # These are the subkeys of dataArray - '01', '02', '03',.............
  231.  
  232.     subKeys = dataArray['A'].keys()
  233.  
  234.     subKeys.sort()
  235.  
  236.  
  237.  
  238.  
  239.  
  240.     # Calculate num/den for each slice of sequence
  241.  
  242.     # Each sequence slice length = length of subKeys
  243.  
  244.     # Example:
  245.     # seq = 'ATCGATA'
  246.  
  247.     # subKeys length = 3
  248.  
  249.     # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
  250.  
  251.     numList = []
  252.  
  253.     denList = []
  254.  
  255.     seqList = []
  256.  
  257.     for i in xrange(len(seq) - len(subKeys)):
  258.  
  259.         subseq = seq[0:len(subKeys)]
  260.  
  261.         seqList.append(subseq)
  262.  
  263.  
  264.         num, den = 1, 1
  265.  
  266.         for j, s in enumerate(subseq):
  267.  
  268.             num *= dataArray[s][subKeys[j]]
  269.  
  270.             den *= value[s]
  271.  
  272.         numList.append(num)
  273.  
  274.         denList.append(den)
  275.  
  276.         seq = seq[1:]
  277.  
  278.  
  279.  
  280.     resultList = []
  281.  
  282.     for i, num in enumerate(numList):
  283.         #p=log10(num/denList[i])
  284.         #if (p) >=2:
  285.             #print "#########",abs(int(p))
  286.         if (log10(num/denList[i]))>=2:
  287.             #print "i am here"
  288.         resultList.append(int(abs(1)))
  289.     #print resultList
  290.     #for i in resultList:
  291.     #mean=sum(resultList)/len(resultList)
  292.         #sub=mean-i
  293.         #queue = []
  294.         #queue = (sub)**2
  295.         #print sqrt(queue/len(resultList))
  296.  
  297.     #print mean,"@@@@@@@@@@"
  298.  
  299.  
  300.  
  301.  
  302.  
  303.     outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
  304.     #print "this is line 294"
  305.  
  306.  
  307.     return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
  308.  
  309.  
  310. if __name__ == '__main__':
  311.  
  312.  
  313.     fnArray ='one_redfly.transfac' 
  314.     fnSeq = 'upstream_regions'
  315.  
  316.  
  317.     outputfile =  "sequence_calc_data.txt"
  318.  
  319.  
  320.  
  321.     arraySet = 1
  322.  
  323.     outList = []
  324.  
  325.     calcdata = 1
  326.  
  327.     while not calcdata is None:
  328.  
  329.         seqSet = 1
  330.  
  331.         while True:
  332.  
  333.             calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
  334.             print calcdata
  335.  
  336.             if calcdata:
  337.  
  338.                 outList.append(calcdata)
  339.  
  340.                 seqSet += 1
  341.  
  342.             else:
  343.  
  344.                 break
  345.  
  346.         arraySet += 1
  347.  
  348.  
  349.  
  350.  
  351.  
  352.     f = open(outputfile, 'w')
  353.  
  354.     f.write('\n'.join(outList))
  355.  
  356.     f.close()
  357.     #f=open(outputfile,"r")
  358.     #file_con=f.readlines()
  359.     #for line in file_con:
  360.      #   print line
  361.  
Dec 12 '07 #1
0 893

Sign in to post your reply or Sign up for a free account.

Similar topics

7
by: Chinook | last post by:
OO approach to decision sequence? --------------------------------- In a recent thread (Cause for using objects?), Chris Smith replied with (in part): > If your table of photo data has...
16
by: Robert Zurer | last post by:
Can anyone suggest the best book or part of a book on this subject. I'm looking for an in-depth treatment with examples in C# TIA Robert Zurer robert@zurer.com
9
by: seguso | last post by:
Hello, I have a very simple problem I don't know how to approach. I need a suggestion about the general approach to take. I have a bunch of html pages on a machine, all in the same folder...
5
by: jehugaleahsa | last post by:
Hello: I am trying to find what is the very best approach to business objects in Windows Forms. Windows Forms presents an awesome opportunity to use DataTables and I would like to continue doing...
20
by: mike3 | last post by:
Hi. (Xposted to both comp.lang.c++ and comp.programming since I've got questions related to both C++ language and general programming) I've got the following C++ code. The first routine runs in...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
2
by: giovanniandrean | last post by:
The energy model is structured as follows and uses excel sheets to give input data: 1-Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
1
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
3
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
0
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
4
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.