hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.
//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//
the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGG TTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCC
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
ACACGCCAACACATTACACACAACACGAACTACACAAACACTGAGATTAA GGAAATTATTAAAAAAAATAATAAAATT
AATACAAAAAAAATATATATATATA
this is my code which works(prints the log value for one sequence and one matrix) -
from math import *
-
import random
-
f=open("deeps1.txt","r")
-
line=f.next()
-
while not line.startswith('PO'):
-
line=f.next()
-
-
headerlist=line.strip().split()[1:]
-
linelist=[]
-
-
-
line=f.next().strip()
-
while not line.startswith('/'):
-
if line != '':
-
linelist.append(line.strip().split())
-
line=f.next().strip()
-
-
keys=[i[0] for i in linelist]
-
values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
-
-
array={}
-
linedict=dict(zip(keys,values))
-
keys = linedict.keys()
-
keys.sort()
-
for key in keys:
-
array=[key,linedict[key]]
-
-
datadict={}
-
datadict1={}
-
for i,item in enumerate(headerlist):
-
datadict[item]={}
-
for key_ in linedict:
-
datadict[item][key_]=linedict[key_][i]
-
-
-
for keymain in datadict:
-
for keysub in datadict[keymain]:
-
datadict[keymain][keysub]+=1.0
-
-
datadict1=datadict.copy()
-
for keysub in datadict:
-
for keysub in datadict[keymain]:
-
datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
-
-
-
-
def readfasta():
-
file1= open("chr011.py",'r')
-
file_content=file1.readlines()
-
first=1
-
list1=""
-
for line in file_content:
-
if line[0]==">":
-
if first==0:
-
print "***********"
-
list1+=sequence
-
print "***********"
-
else:
-
first=0
-
sequence=""
-
seq=""
-
for i in range(0,len(line)-1):
-
seq+=line[i]
-
else:
-
for i in range(0,len(line)-1):
-
sequence+=line[i]
-
list1+=sequence
-
return list1
-
-
-
-
p=readfasta()
-
-
-
-
-
-
res=1
-
part=""
-
q=len(p)
-
seqq=""
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
for i in range(q-16):
-
part=p[i:i+16]
-
seqq=part
-
res=1
-
score=1
-
for j in range(16):
-
key=seqq[j]
-
res=res*datadict1[key]["%02d"%(j+1)]
-
#print res
-
for key in seqq:
-
score=score * value[key]
-
#print score,"*******************",res
-
log_ratio=log10(res/score)
-
print i,log_ratio
-
what changes should i make and how?/
waiting for your reply,
cheers!
Jul 13 '07
103 5479
and please do send the full code. i tried executing mine.looks like there are some errors.so il check up with urs and try executing.
waiting,
cheers!!
bvdet 2,851
Expert Mod 2GB
and what does this line mean?? -
if __name__ == '__main__':
-
A module type is a container that holds objects loaded with the import statement. The identifier '__name__' is assigned to the module name. If the module is run as a top level program, '__name__' is assigned to the string '__main__'. - if __name__ == '__main__':
-
.... execute code ....
-
The code under the 'if' statement only executes if the module is run as a program. That way the functions defined above can be imported by other programs without executing the conditional code.
bvdet 2,851
Expert Mod 2GB
and please do send the full code. i tried executing mine.looks like there are some errors.so il check up with urs and try executing.
waiting,
cheers!!
The whole thing: - def parseArray(fn, dataset=1, key='PO', term='/'):
-
'''
-
Read a formatted data file in matrix format and
-
compile data into a dictionary
-
'''
-
f = open(fn)
-
-
# skip to required data set
-
for _ in range(dataset):
-
try:
-
line = f.next()
-
while not line.startswith(key):
-
line = f.next()
-
except StopIteration, e:
-
print 'We have reached the end of the file!'
-
f.close()
-
return False
-
-
headerList = line.strip().split()[1:]
-
lineList = []
-
-
line = f.next().strip()
-
while not line.startswith(term):
-
if line != '':
-
lineList.append(line.strip().split())
-
line = f.next().strip()
-
-
f.close()
-
-
# Key list
-
keys = [i[0] for i in lineList]
-
# Values list
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
# Create a dictionary from keys and values
-
lineDict = dict(zip(keys, values))
-
-
dataDict = {}
-
-
for i, item in enumerate(headerList):
-
dataDict[item] = {}
-
for key in lineDict:
-
dataDict[item][key] = lineDict[key][i]
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] += 1.0
-
-
# Normalize original data (with 1 added) and update data
-
valueSums = [sum(item)+4 for item in values]
-
# print valueSums
-
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
def parseData(fn, dataset=1, key='>'):
-
'''
-
Read a formatted data file of alpha sequences
-
Return a list of sequences
-
The first element in the list is the header
-
'''
-
# initialize output list
-
dataList = []
-
-
# open file for reading
-
f = open(fn)
-
-
# skip to required data set
-
for _ in range(dataset):
-
try:
-
s = f.next()
-
while not s.startswith(key):
-
s = f.next()
-
except StopIteration, e:
-
print 'We have reached the end of the file!'
-
f.close()
-
return False
-
-
# initialize output list
-
dataList = [s,]
-
-
for line in f:
-
if not line.startswith(key):
-
dataList.append(line.strip())
-
else:
-
break
-
-
f.close()
-
return dataList
-
-
if __name__ == '__main__':
-
-
arraySet = 1
-
seqSet = 3
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
fnArray = r'H:\TEMP\temsys\data9.txt'
-
fnSeq = r'H:\TEMP\temsys\data12.txt'
-
-
dataArray = parseArray(fnArray, arraySet)
-
dataSeq = parseData(fnSeq, seqSet)
-
-
# This is the complete sequence
-
seq = ''.join(dataSeq[1:])
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
subKeys = dataArray['A'].keys()
-
subKeys.sort()
-
-
# Calculate num/den for each slice of sequence
-
# Each sequence slice length = length of subKeys
-
# Example:
-
# seq = 'ATCGATA'
-
# subKeys length = 3
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
numList = []
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys) + 1):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
-
-
resultList = []
-
for i, num in enumerate(numList):
-
resultList.append(num/denList[i])
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
-
print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
-
print 'Sequence Header: %s' % dataSeq[0]
-
print outStr
Check the values in variables dataArray and dataSeq for appropriateness.
has log(num/deno) been done??
looks like it is correct, I inserted a known sequence at a particular position which should yield a high value,it did give a hike at that position:) ..i did the log calculation.. -
for i, num in enumerate(numList):
-
-
resultList.append(log10(num/denList[i]))
-
print resultList
-
looks right to me.!only generalising this is left!!
cheers!!
hey,
I am waiting for ur reply bvdet.
I know am disturbing you too much but am really new to programming and python.so thats why I couldn find a way out.
waiting,
cheers!!
bvdet 2,851
Expert Mod 2GB
hey,
I am waiting for ur reply bvdet.
I know am disturbing you too much but am really new to programming and python.so thats why I couldn find a way out.
waiting,
cheers!!
You have taken on an ambitious project if you are new to programming. I have given you all that you need to complete the project. All you need to do is:
Iterate on each data file until you reach the end of each.
Compile the data for each iteration.
Something like this (untested): -
indxSeq = 1
-
while True:
-
dataSeq = parseData(fnSeq, indxSeq)
-
if not dataSeq:
-
break
-
indxArray = 1
-
while True:
-
dataArray = parseArray(fnArray, indxArray)
-
if not dataArray:
-
break
-
............. calculations ................
-
indxArray += 1
-
indxSeq += 1
It is not fair to me or others on this list for me to write the whole program for you. Study some tutorials, read Python documentation, practice with some simple exercises, and give it a shot. Come back with some code, and we will try to help if you need it. This project is interesting. Good luck and have fun!
hey,
Thanks a lot. I will try and come back with queries.sorry for bugging you.
bvdet 2,851
Expert Mod 2GB
hey,
Thanks a lot. I will try and come back with queries.sorry for bugging you.
You don't have to apologize. I only did what I wanted to. :)
hey,
Thats very sweet of you:)
I dont really know how to improve.as in when am reading the tutorial.I just gain theoritical knowledge.How do I put it to practical knowledge.
I have gone through the python documentation.But when it comes to programming I cant really find a way out.like this case.. I had the code to work for a single sequence and a single matrix.I could have improvised that..but i was not able to think.how do I crack it and become better?
Am getting a list index out of range error!!
but I cant add a -1 to the loop.can i??whats wrong with this?? -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
if __name__ == '__main__':
-
-
-
-
arraySet = 4
-
#print arraySet
-
-
seqSet = 4
-
#print seqSet
-
-
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
fnArray = r'all_redfly.transfac.txt'
-
-
fnSeq = r'redfly_sequence.fasta'
-
indxSeq=1
-
while True:
-
dataSeq=parseData(fnSeq,indxSeq)
-
if not dataSeq:
-
break
-
indxArray=1
-
while True:
-
dataArray = parseArray(fnArray, arraySet)
-
#dataSeq = parseData(fnSeq, seqSet)
-
if not dataArray:
-
break
-
# This is the complete sequence
-
seq = ''.join(dataSeq[1:])
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
subKeys = dataArray['A'].keys()
-
subKeys.sort()
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys) + 1):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
-
resultList.append(log10(num/denList[i]))
-
indxArray+=1
-
indxSeq +=1
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
-
print 'Array set # = %d\nSequence set # = %d' % (arraySet, seqSet)
-
print 'Sequence Header: %s' % dataSeq[0]
-
print outStr
-
bvdet 2,851
Expert Mod 2GB
Let's make a new function, iterate on it, and write the results to a file: - def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
# sequence factor dictionary
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
dataArray = parseArray(fnArray, arraySet)
-
if dataArray:
-
dataSeq = parseData(fnSeq, seqSet)
-
if not dataSeq:
-
return False
-
else:
-
return None
-
-
# This is the complete sequence
-
seq = ''.join(dataSeq[1:])
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
subKeys = dataArray['A'].keys()
-
subKeys.sort()
-
-
# Calculate num/den for each slice of sequence
-
# Each sequence slice length = length of subKeys
-
# Example:
-
# seq = 'ATCGATA'
-
# subKeys length = 3
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
numList = []
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys) + 1):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
-
-
resultList = []
-
for i, num in enumerate(numList):
-
resultList.append(num/denList[i])
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
if __name__ == '__main__':
-
-
fnArray = 'array.txt'
-
fnSeq = 'seq.txt'
-
-
outputfile = 'sequence_calc_data.txt'
-
-
arraySet = 1
-
outList = []
-
calcdata = 1
-
while not calcdata is None:
-
seqSet = 1
-
while True:
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
seqSet += 1
-
else:
-
break
-
arraySet += 1
-
-
f = open(outputfile, 'w')
-
f.write('\n'.join(outList))
-
f.close()
This resulted in a 3.1 mb file. Following are the first few lines of the first and last compilation: - Array set # = 1
-
Sequence set # = 1
-
Sequence Header: >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
-
-
Sequence = CCAGTCCACCGGCCGC Calculation = 0.000025722315
-
Sequence = CAGTCCACCGGCCGCC Calculation = 0.000000000318
-
Sequence = AGTCCACCGGCCGCCG Calculation = 0.000595631200
-
Sequence = GTCCACCGGCCGCCGA Calculation = 0.000120125057
-
Sequence = TCCACCGGCCGCCGAT Calculation = 0.000000089016
-
...........................
-
Array set # = 4
-
Sequence set # = 8
-
Sequence Header: >Obp19b_prom|Drosophila melanogaster|Obp19b|FBgn0031110|X:20224439..20227440
-
-
Sequence = ATTGCTGACGGGTCGA Calculation = 0.000005535136
-
Sequence = TTGCTGACGGGTCGAA Calculation = 0.000003984295
-
Sequence = TGCTGACGGGTCGAAT Calculation = 0.000053179344
-
Sequence = GCTGACGGGTCGAATG Calculation = 0.000031549069
-
.............................
THis is the code.my o/p is an empty array.why is this happening? -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys) + 1):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
-
resultList.append(num/denList[i])
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res) for i, res in enumerate(resultList)])
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
if __name__ == '__main__':
-
-
-
fnArray =r'all_redfly.transfac'
-
fnSeq = r'redfly_sequence.fasta'
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
f=open(outputfile,"r")
-
file_con=f.readlines()
-
print file_con
-
for line in file_con:
-
print line
-
-
I seem to get an list index out of range error:
Traceback (most recent call last):
File "newbie1.py", line 311, in <module>
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
File "newbie1.py", line 285, in compileData
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resultList)])
IndexError: list index out of range -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys) + 1):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
-
resultList.append(log10(num/denList[i]))
-
print (resultList)
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % (seqList[i], res)for i, res in enumerate(resultList)])
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
if __name__ == '__main__':
-
-
-
fnArray ='all_redfly.transfac'
-
fnSeq = 'redfly_sequence.fasta'
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
f=open(outputfile,"r")
-
file_con=f.readlines()
-
print file_con
-
for line in file_con:
-
print line
-
waiting for ur reply,
cheers!
bvdet 2,851
Expert Mod 2GB
I am not sure why you add so many spaces in between the lines of code. I personally find it unreadable. Anyway, when you were adding all the spaces, some of the code ended up at the incorrect indentation: - ........for j, s in enumerate(subseq):
-
-
-
num *= dataArray[s][subKeys[j]]
-
-
-
den *= value[s]
-
-
-
numList.append(num)
-
-
-
denList.append(den)
-
-
-
seq = seq[1:]
-
SHOULD BE: - ........for j, s in enumerate(subseq):
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
hey,
That was the mistake.amazing!! thanks a million!!
I got some doubts about the program.
i have some doubts. first understandingand then get back to you.
THANKS A MILLION!
cheers!!
hey,
here is the code where I tried removing the try catch block and couple of things which will make it easier for me to understand.but looks like there is some problem ..I will of course use them in my main program.But I was just trying to understand when I tried executing the iteration was not happening and when I said
print outList instead of storing it in a file it was not iterating.This is the code
can you tell me whats happening??? -
from math import *
-
def parseArray(fn,dataset=1,key='PO',term='/'):
-
f=open(fn)
-
for _ in range(dataset):
-
line=f.next()
-
while not line.startswith(key):
-
line=f.next()
-
headerList=line.strip().split()[1:]
-
lineList=[]
-
line=f.next().strip()
-
while not line.startswith(term):
-
if line!='':
-
lineList.append(line.strip().split())
-
line=f.next().strip()
-
# f.close()
-
keys=[i[0] for i in lineList]
-
values=[[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
lineDict=dict(zip(keys,values))
-
dataDict={}
-
for i,item in enumerate(headerList):
-
dataDict[item]={}
-
for key in lineDict:
-
dataDict[item][key]=lineDict[key][i]
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub]+=1.0
-
valueSums=[sum(item)+4 for item in values]
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub]/=valueSums[int(keySub)-1]
-
return dataDict
-
#fn="weight_matrix.transfac.txt"
-
#p=parseArray(fn)
-
#print p
-
def parseData(fn,dataset=1,key='>'):
-
dataList=[]
-
f=open(fn)
-
for _ in range(dataset):
-
s=f.next()
-
dataList=[s,]
-
-
for line in f:
-
if not line.startswith(key):
-
dataList.append(line.strip())
-
else:
-
break
-
return dataList
-
#fn="redfly_sequence.fasta"
-
#p=parseData(fn)
-
#print p
-
def compileData(fnArray,fnSeq,arraySet=1,seqSet=1):
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
dataArray=parseArray(fnArray,arraySet)
-
if dataArray:
-
dataSeq=parseData(fnSeq,seqSet)
-
seq=''.join(dataSeq[1:])
-
subKeys=dataArray['A'].keys()
-
subKeys.sort()
-
numList=[]
-
denList=[]
-
seqList=[]
-
for i in xrange(len(seq)-len(subKeys)):
-
subseq=seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num,den=1,1
-
for j,s in enumerate(subseq):
-
num*=dataArray[s][subKeys[j]]
-
den*=value[s]
-
numList.append(num)
-
denList.append(den)
-
seq=seq[1:]
-
resultList=[]
-
for i,num in enumerate(numList):
-
if (log10(num/denList[i]))>2:
-
resultList.append(log10(num/denList[i]))
-
outStr='\n'.join(['sequence=%s Calculation=%0.12f'%(seqList[i],res) for i,res in enumerate(resultList)])
-
return 'array set#= %d\nSequence set #=%d\nSequence Header: %s\n%s' %(arraySet,seqSet,dataSeq[0],outStr)
-
fnArray='weight_matrix.transfac.txt'
-
fnSeq='redfly_sequence.fasta'
-
arraySet=1
-
outList=[]
-
calcdata=1
-
while not calcdata is None:
-
seqSet=1
-
while True:
-
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
-
-
seqSet+=1
-
else:
-
break
-
-
arraySet+=1
-
print outList
-
f=open(outputfile,'w')
-
f.write('/n'.join(outList))
-
f.close()
-
-
-
waiting
cheers!!
bvdet 2,851
Expert Mod 2GB
After running the script, I can do this: - >>> print outList[1]
-
Array set # = 1
-
Sequence set # = 2
-
Sequence Header: >Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
-
-
Sequence = AGTCGACCAGCACGAG Calculation = -0.872390330485
-
Sequence = GTCGACCAGCACGAGA Calculation = -3.287525755636
-
Sequence = TCGACCAGCACGAGAT Calculation = -4.346213357398
-
Sequence = CGACCAGCACGAGATC Calculation = -2.329064001005
-
.........................
I don't want to print the entire outList because it's over 3 MB.
You may have changed something you should not have. Maybe you should copy the code again. If you need to change things, change only one thing at a time and test to make sure it still works.
hello!
I hope you people remember the problem above..
i got little problems with that
that was just opening a file containing files..now il be opening a directory containing different sequence files
this is how the code looks now!
am trying to change the i/p file to folder by showing the path of the folder but its going to the exception file..can you tell me why? -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
-
if (log10(num/denList[i]))>=2:
-
-
resultList.append(int(abs(1)))
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
-
fnArray ='half.txt'
-
-
fnSeq = 'C:\\python25\ding\YAL005C.txt'
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
please tell me what can i do??
here is my code which is reading a directory containing files..... it seems to go to the exception part always.. i dono why..i think it checks for the first file in the folder and then comes out..how do i check if its going to all the files.. -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
print "am here"
-
-
while not line.startswith(key):
-
print "oh yes"
-
-
line = f.next()
-
-
except StopIteration, e:
-
print '###############################'
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
-
fnArray ='C:\\python25\\half.txt'
-
import os
-
seq_=os.listdir("ding")
-
print seq_
-
os.chdir("C:\\python25\\New Folder")
-
for file_ in seq_:
-
if os.path.isfile(file_):
-
rem=open(file_)
-
dingg=rem.readlines()
-
fnSeq = dingg
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
waiting for ur reply,
cheers!
bvdet 2,851
Expert Mod 2GB
What error message are you receiving? It appears that you are trying to read data in from the files in a directory. You should be passing the file name to your functions, not the text from one of the files. Try something like this: - import os
-
# Create a list of directory entries
-
dir_name = 'your_directory'
-
fList = os.listdir(dir_name)
-
# The directory entries do not include the full path
-
# Create a list of file names with full path, eliminating subdirectory entries
-
fList1 = [os.path.join(dir_name, f) for f in fList if os.path.isfile(os.path.join(dir_name, f))]
-
seqSetIndex = 0
-
fnSeq = fList1[seqSetIndex]
-
while True:
-
... do stuff ...
-
seqSetIndex += 1
'fnSeq' is the file name will full path that is passed to your parsing function.
to show you typically how my files are.. I am storing them inside python 25 folder..the folder is called New folder which has around five thousand files..so i cant possibly give the names of all the files. for eg YAL001C,YAL002w..etc
each file has sequence like this:
>Scer_YAL001C SGDID:S0000001 5' untranslated region, Chr I 151168 - 152167 (revcom), 1000 bp
ACTTGTAAATATATCTTTTATTTTCCGAGAGGAAAAAGTTTCAAAAAAAA AAAAAAAAAAAGAAGAAAAATAACTTTCTCATGTAATAAAAGGTAACTAA TGTAGACAAAAAAGTATACATTTAGCTTTTCTTTTTTTGATGATTTTTGA GTTTCATGTTACTAATCAGAACAATTAACGACACCTTCATTATGAAAAAA TTAATTAGCTATAAGTCTTCGAAGTAGAACATGATATTTGGCAATCACTC GAATAACTATCTTAATTTACCTGCTGAAATAATTTGAAAAAACACCCGAG GCAGCAGACGAAAGGTGTTTTTGCTAAACAATGATTGATTTCTGGCGCCA TTTCTACATTCTGAACAGTTCATCTCATTTCAGTAACAGTACTTCAATGG AATATTTATTAAAGAAAGTGCTTAAAAAAGTATTATAAAACGATACATGG ACTGACTCAAGATTGAGCTAATAAGGTCCACCGCCTAGTGCTTAAGAGTT CTGTACCACTATAATAATTTATCTTGATCGTATTATGTGTAAAAAAAAGG CGCTTGAAATGAAAGCTCCGAAAATTAAAATACTTTGACTGCTTCGGAAA ACAAAAACATATAAATAAATTTAAAAAATAAACTGTAAAATATTTAAAAA CTATTAAAAATATTTTATATTTTTAAAATTATTTATTATTATGTCATGTG ACAAGACTTAAATCATTACATAAAAGGTTTTGAAGTTCAATGTCAAAGTC AATATAATAAGCATACTAAGGCACACTTATGCAAATCGAGTTATTGAAGC TGGTAAAATTATAAGATTTTTATTTTTATTTCTTTTATTTCTGCAAATCT GCATTTTCAAATACCGCTTGGTTTTTTGCATCATAAAGGGCGGCGCTTTC AGTCGCGAAAGTGAAATAAACAACCAGTCACACATATAACTTTCTTCTTG CCATAAGAGAGAAGAGGACGTTTGGTTGAAGCCAACTAGCCACAAGAAAA
>Spar YAL001C c218:24375..25375
TCAAGAGGGTTATTATATACCGATATTTGAATCCACACATGATCGAACTA ATATAATTCACTACTTAGCTGCTTAACCATTCTTTGCCATTATAATAAAT GTATCTCGATCGAATGCTGCATAGAGAAAAGCGCTTAAAAAAGTGTTCCG AAATATAACATATTTTAAACGTTTCGGAAACCAAAAACATATATTTATTA TCTTAAAGTAATCTAAAAAATGAAAGAAACTTTGATATATTTAAAACATA ATATAATTATTTTATAAAAGTAACATGTGATTAAACTCACAAAGCCTAAA AATGTTTTTAATCATTATGTTAAGCTGAATATAGTATATCTAATAATTGT TTATTTAAGCAGATTGAACTGTGAATGCTAGTAAATTTTTAAGGGTTTTT TGTTGTTCTGCTTTTCTCTAGTTTTGTAGTGTCAAATACAACTAACTGGA TTTTTTGCATCAGAAGGGGCGCTTTAACTCGCGAAAGTAAAAATAAACAA TTAATCACACATATCTTTTCCTCTTGCAGAAGCAAAGAAGAGGACGGTTG ATTGAAGCAAACCAGCCACAGAAAATATGGCGTTGACAATTTATCCTGAT GAACTGGTTAAAATAGTGTCTGATGAAATTGCATCAAATAAAGGAAGTAT GTATATGCCTCATTCTTCTATTCCATGTTCTTTTCAGGTGAGAAACGTGA TATATTGTAAGATTATTTACTAACGACTTATTAAAGAAATTACGTTAAAT CAGCTTTGGGATATATCTCGTAAATATTTTGATTTGTCTGACGAGAAAGT TAAACAATTTGTGCTTTCATGCGTGATGTTGAAAAAGGACATCGAGGTGT ACTGTGATAGCGTTATAACAACTAAAAACGTGACAAATATTATAGACGAC ACTAGTCATTCATACTCAGTAGGGATTACTGAGGACAGCCTGTGGACGTT ATTAACTGGATACACAAAGAAGGAGTCAACTATCGGGAATTCAGCATTTG A
similarly each file has such sequences my code has to go to all the sequences and do it for all the files.
when i run this code , it says
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Python25>python this_final.py
['sequence_calc_data.txt', 'YAL001C.fasta', 'YAL002W.fasta', 'YAL003W.fasta', 'Y
AL005C.fasta', 'YAL007C.fasta']
am here
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
oh yes
###############################
We have reached the end of the file!
None
why is this happening.
this is my code -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
fnArray ='C:\python25\half.txt'
-
fnSeq = 'C:\python25\New Folder'
-
import os
-
dir_name='New Folder'
-
fList=os.listdir(dir_name)
-
fList1=[os.path.join(dir_name,f) for f in fList if
-
os.path.isfile(os.path.join(dir_name,f))]
-
seqSetIndex=0
-
fnSeq=fList1[seqSetIndex]
-
while True:
-
seqSetIndex+=1
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
when i run the code below.. its going to an infinite loop which says "reached the end of file" -
from math import *
-
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fn, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
fnArray ='C:\python25\half.txt'
-
fnSeq = 'C:\python25\New Folder'
-
import os
-
dir_name='New Folder'
-
fList=os.listdir(dir_name)
-
fList1=[os.path.join(dir_name,f) for f in fList if
-
os.path.isfile(os.path.join(dir_name,f))]
-
seqSetIndex=0
-
fnSeq=fList1[seqSetIndex]
-
while True:
-
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
seqSetIndex+=1
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
bvdet 2,851
Expert Mod 2GB
You have two while loops. When calcData() returns None (when the end of file is reached), you issue one break statement. You may need another to get out of the other loop.
this is my code now.. i have added a break statement and still there seems to be some prob. it is coming out of the loop and saying
we have reached the end of file
None
here is the code -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fnSeq, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
#if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
fnArray ='C:\python25\half.txt'
-
fnSeq = 'C:\python25\New Folder'
-
import os
-
dir_name='New Folder'
-
fList=os.listdir(dir_name)
-
fList1=[os.path.join(dir_name,f) for f in fList if os.path.isfile(os.path.join(dir_name,f))]
-
seqSetIndex=0
-
fnSeq=fList1[seqSetIndex]
-
while True:
-
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
seqSetIndex+=1
-
else:
-
break
-
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
hey!
That code seens to work but the problem is it is checking only for the first six files... i donno where is the looping problem!!
can you see it?? -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fnSeq, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fnSeq)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
#if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
-
-
fnArray ='one_redfly.transfac'
-
fnSeq = 'deepthi/upstream_regions'
-
import os
-
dir_name='upstream_regions'
-
fList=os.listdir(dir_name)
-
fList1=[os.path.join(dir_name,f) for f in fList if os.path.isfile(os.path.join(dir_name,f))]
-
seqSetIndex=0
-
fnSeq=fList1[seqSetIndex]
-
-
while True:
-
-
-
-
-
outputfile = "sequence_calc_data.txt"
-
-
-
-
arraySet = 1
-
-
outList = []
-
-
calcdata = 1
-
-
while not calcdata is None:
-
-
seqSet = 1
-
-
while True:
-
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
print calcdata
-
-
if calcdata:
-
-
outList.append(calcdata)
-
-
seqSet += 1
-
-
else:
-
-
break
-
-
arraySet += 1
-
seqSetIndex+=1
-
else:
-
break
-
-
-
-
-
-
-
f = open(outputfile, 'w')
-
-
f.write('\n'.join(outList))
-
-
f.close()
-
#f=open(outputfile,"r")
-
#file_con=f.readlines()
-
#for line in file_con:
-
# print line
-
waiting for ur reply!
cheers!!
when the seqset is becoming 6.. it is coming out of the loop..why is this happening???
bvdet 2,851
Expert Mod 2GB
This code is untested, but I think is close to what you are trying to do: - if __name__ == '__main__':
-
import os
-
-
fnArray = 'array_file'
-
outputfile = 'output_file'
-
masterList = []
-
-
# user has multiple sequence files
-
# and must iterate over all the files
-
dir_name = 'your_directory'
-
fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
-
if os.path.isfile(os.path.join(dir_name,f))]
-
-
for fnSeq in fileList:
-
outList = []
-
calcdata = 1
-
while not calcdata is None:
-
seqSet = 1
-
while True:
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
seqSet += 1
-
else:
-
break
-
arraySet += 1
-
masterList.append('\n'.join(outList))
-
f = open(outputfile, 'w')
-
f.write('\n'.join(masterList))
-
f.close()
-
i donno the code is not executing.. sequence set is not at all incrementing..
when i print seqSet.. it is always one..it is not incrementing and the resultlist is empty.
i have been trying.. there is some problem in writing it to a file also.. -
from math import *
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
line = f.next()
-
-
while not line.startswith(key):
-
-
line = f.next()
-
-
except StopIteration, e:
-
-
#print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
headerList = line.strip().split()[1:]
-
-
-
lineList = []
-
-
-
-
line = f.next().strip()
-
-
while not line.startswith(term):
-
-
if line != '':
-
-
lineList.append(line.strip().split())
-
-
-
line = f.next().strip()
-
-
-
-
f.close()
-
-
-
-
# Key list
-
-
keys = [i[0] for i in lineList]
-
-
# Values list
-
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
-
-
-
# Create a dictionary from keys and values
-
-
lineDict = dict(zip(keys, values))
-
-
-
-
dataDict = {}
-
-
-
-
for i, item in enumerate(headerList):
-
-
dataDict[item] = {}
-
-
for key in lineDict:
-
-
dataDict[item][key] = lineDict[key][i]
-
-
-
-
# Add 1.0 to every element in dataDict subdictionaries
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
-
dataDict[keyMain][keySub] += 1.0
-
-
-
-
# Normalize original data (with 1 added) and update data
-
-
valueSums = [sum(item)+4 for item in values]
-
-
# print valueSums
-
-
-
-
for keyMain in dataDict:
-
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
-
return dataDict
-
-
-
-
-
-
def parseData(fnSeq, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
-
# initialize output list
-
-
dataList = []
-
-
-
-
# open file for reading
-
-
f = open(fn)
-
-
-
-
# skip to required data set
-
-
for _ in range(dataset):
-
-
-
try:
-
-
s = f.next()
-
-
while not s.startswith(key):
-
-
-
s = f.next()
-
-
except StopIteration, e:
-
-
#print 'We have reached the end of the file!'
-
-
f.close()
-
-
return False
-
-
-
-
# initialize output list
-
-
dataList = [s,]
-
-
-
for line in f:
-
-
if not line.startswith(key):
-
-
dataList.append(line.strip())
-
-
else:
-
-
break
-
-
-
-
f.close()
-
-
return dataList
-
-
-
-
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
-
-
-
dataArray = parseArray(fnArray, arraySet)
-
print dataArray
-
-
-
if dataArray:
-
-
dataSeq = parseData(fnSeq, seqSet)
-
#print dataSeq,"@@@@@@@@@@@@@@@@@@@"
-
-
-
if not dataSeq:
-
-
return False
-
-
else:
-
-
return None
-
-
-
-
-
# This is the complete sequence
-
-
seq = ''.join(dataSeq[1:])
-
-
-
-
-
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
-
subKeys = dataArray['A'].keys()
-
-
subKeys.sort()
-
-
-
-
-
-
# Calculate num/den for each slice of sequence
-
-
# Each sequence slice length = length of subKeys
-
-
# Example:
-
# seq = 'ATCGATA'
-
-
# subKeys length = 3
-
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
-
numList = []
-
-
denList = []
-
-
seqList = []
-
-
for i in xrange(len(seq) - len(subKeys)):
-
-
subseq = seq[0:len(subKeys)]
-
-
seqList.append(subseq)
-
-
-
num, den = 1, 1
-
-
for j, s in enumerate(subseq):
-
-
num *= dataArray[s][subKeys[j]]
-
-
den *= value[s]
-
-
numList.append(num)
-
-
denList.append(den)
-
-
seq = seq[1:]
-
-
-
-
resultList = []
-
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
-
-
-
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
import os
-
-
-
-
fnArray ='half.txt'
-
outputfile='output_file'
-
masterList=[]
-
dir_name='New Folder'
-
fileList=[os.path.join(dir_name,f) for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name,f))]
-
-
-
-
for fnSeq in fileList:
-
outList=[]
-
calcdata=1
-
arraySet=1
-
-
while not calcdata is None:
-
-
seqSet=1
-
while True:
-
print "am here"
-
-
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
-
-
if calcdata:
-
outList.append(calcdata)
-
#print outList
-
seqSet += 1
-
print seqSet,""################"
-
else:
-
break
-
arraySet+=1
-
masterList.append('\n'.join(outList))
-
#f=open(outputfile,'W')
-
#fwrite('\n'.join(masterList))
-
#f.close()
-
waiting for ur reply
cheers!
bvdet 2,851
Expert Mod 2GB
There is no need to post the entire code repeatedly. Presumably the code before "if __name__ == '__main__':" works. I assume that you have one array file and multiple sequence files. You want to iterate on each array set in the array file on each sequence set for each sequence file in a separate directory. The following code works for me given the previous assumptions: - if __name__ == '__main__':
-
import os
-
-
fnArray = r'H:\array.txt'
-
outputfile = r'H:\seq_array_output.txt'
-
masterList = []
-
-
# user has multiple sequence files
-
# and must iterate over all the files
-
dir_name = r'H:\TEMP\sequence_files
-
fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
-
if os.path.isfile(os.path.join(dir_name,f))]
-
-
for fnSeq in fileList:
-
outList = []
-
calcdata = 1
-
arraySet = 1
-
while not calcdata is None:
-
seqSet = 1
-
while True:
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
seqSet += 1
-
else:
-
break
-
arraySet += 1
-
masterList.append('\n'.join(outList))
-
f = open(outputfile, 'w')
-
f.write('\n'.join(masterList))
-
f.close()
Following is a snippet from the array file: NA bin
PO A C G T
01 0.45 8.27 0.00 11.39
02 0.00 0.00 10.02 10.09
03 5.80 1.39 0.00 12.93
//
//
NA bap
PO A C G T
01 0.00 3.67 0.00 0.00
02 0.00 0.00 3.67 0.00
03 0.00 0.00 0.00 3.67
04 0.00 3.67 0.00 0.00
//
//
NA bcd
PO A C G T
01 42.55 8.75 145.86 8.14
02 0.14 0.53 204.64 0.00
03 126.83 78.02 0.11 0.34
04 0.21 0.17 0.00 204.92
05 0.00 12.38 0.43 192.50
06 174.48 0.95 1.32 28.56
07 79.53 4.70 100.44 20.64
//
// Following is a snippet from one of the sequence files: >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..199271 33
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTC GAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAA GCCTTCGGATTCATTCATTGATCCACAT
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTAC TCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAAT GTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATG CTCTGGG
>Cp36_DRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8323349..8324136
AGTCGACCAGCACGAGATCTCACCTACCTTCTTTATAAGCGGGGTCTCTA GAAGCTAAATCCATGTCCACGTCAAACC
AAAGACTTGCGGTCTCCAGACCATTGAGTTCTATAAATGGGACTGAGCCA CACCATACACCACACACCACACATACAC
AACATTG
>Cp36_PRR|Drosophila melanogaster|Cp36|FBgn0000359|X:8324430..8324513
TCTAGAGATCTGGGCACGATGGCGAGACAAAGATGCGGCGCAAAATCGGA AATGGAGATGGATCACGTAGCCGGCCAT
GGCGG
>Him_distal|Drosophila melanogaster|Him|FBgn0030900|X:18039896..18043470
GGTTTTCTGCGATGGCTTCCGCGCCAGCTGAAGTATCTGATTTGCTGCCT TGTTTTTGTTGATATTTCTGCGAAGGGA
CTTGTGCTTTTCAAATGGCCTTTTTTTGGGATTACGGCAAGGGCGCGTTT CCCACGCTCGATCCCCACTTACCATTGG
TGCACGCGATTGCGGCAAGCTGCTGAGGCAAGCTATTAAACGCCACACTG GGCCGGGGGGCGGTACCGGTGGGCGTGG
AATATATAGATCGGAGATATCGCAGGACCCACAGCAGAGCAGAGCCGCAG AGCCACCAACCTCG I don't know what else I can do! Maybe you should check if your defined functions are working as expected.
I sort of took away the spaces and tried executing.. the other functions are correct. please execute this code with the inputs i paste now and please paste ur o/p file
my directory containing the sequence files is called "up"
this contains 24 files.. but let me paste the first two files.
first file called YAL012w contains the sequence
>Scer_YAL012W SGDID:S0000010 5' untranslated region, Chr I 130534 - 130801, 268 bp
ATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCA TTCACGTGATCTCAGCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAA TCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTTATTTAAAGCATAAG ATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTAC AATTGAGGCCTATACACATAGACATTTGCACCTTATACATATACACACAA GACAAAACCAAAAAAAAT
>Smik YAL012W c875:991..1258
GCCACACCGTCTCTACACAGCTTCTTGTTCTTGTTATTCCTAAATATGTT GTAACGCTACGTAATTCCACCTTTCATTACTAATATTAAGCCATTCACGT GACTTTTTCCCAGCGCCTGGTGTGGCGCCACACTTTTTTTTCATAAGAAT CCTCGAGGAAAACAGAAAAATATTTCAGTTATTTAAACTATAAGATGCTT GGGGATATAATGATGGTGCCGTGTGTTTGTGTTTGAGTGTGTTTGAGTGT TCTGAAAGAGAGAGC
>Spar YAL012W c218:45412..45679 (rev)
TTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAA TTAGCCATTCACGTGATACCAGCCAGCTGTGGCGCCACACTTTTTTTCCA TAAAAATCCTCGAGGAAAAGAAAAGAAAAAATATTTCAGTTATTTAAACC AAAAGATGCCTGGGAGATAGAGCTGGTGCCGTGTGAAATCCAGTTTTGAA AGTGCAATTGAGTCCTACACACATAAGCATCCGCACCTTGTATACATATA CGCCAAACAAAACAAAAA
and the second file is called YAl013 and contains the following sequence
>Scer_YAL013W SGDID:S0000011 5' untranslated region, Chr I 129021 - 129271, 251 bp
TACCGTGTGTGCTTACGCATCTATTTGCTGTCGTGAGATCTGTCTCTATG CTTTATTCGTTTTCCATTGTAAAGTCTCAGCATTTATTTCTTGTTCTTTA CTTGTTTTTGCCCTTTCGGGTGATCAAAGTCGTGCTGGGAAATTTTATTC TTATAAAATGATTTTTAGAAATAATAAACTCATAACAGTGCAACGGCAAA GTACAAGGGAAGGAAGCACAGAAGCAAGAGGAGGCGCATCGATCGTGGCA G
>Sbay YAL013W c669:46323..46574 (rev)
ACGTCCATCCCGTGTGTGCTTGCACATCTGCTTGGCCTGCGAGCCGTCCT CTGGGCCATCCGCTCCCTATTGTGAAGTACCAGCCTTTGTTCCTGCTCTT CATAGTTTGCTCTTTCGGGTGATGAAAGTTGGGCTGGAAATTTTATCCCA AAGGACAATATGAAAATAAACTCGAAACTGTGTAGCAACCAAGAGGGAAG GGGAAAGGGAGAGACAGGAGCAAGAGGAAGCATCGGTCGTGGCAGATGAG TC
>Spar YAL013W c218:46936..47186 (rev)
ATACCGTGTGTGCTTACGCATCTATTTGTTCTCGTGAGCCCCGTCTCTGG ACTTTATCCGTTCGCTATTGTAAAGTCTTAGCCTTTATCTCTAGTTCTTT ACCTGCTTTTGCTCTTTCGGGTGATGAAAGTTGCGCTGGGAAATTTTATT CCCATAGAATGATTTTTAGAAATAATAAACTCATAACAGTGCAACGACCA AGTGCAGGGGAAAGTAGCATAGAAGCAAGAGGAGACGCATCGATCGCGGC A
the matrix file am using is called weight matrix.transfac.txt
//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
and here is the code with spaces removed! -
-
from math import *
-
print "hellllllllo"
-
def parseArray(fn, dataset=1, key='PO', term='/'):
-
-
'''
-
-
Read a formatted data file in matrix format and
-
-
compile data into a dictionary
-
-
'''
-
f = open(fn)
-
# skip to required data set
-
for _ in range(dataset):
-
try:
-
line = f.next()
-
while not line.startswith(key):
-
line = f.next()
-
except StopIteration, e:
-
print 'We have reached the end of the file!','matrix file'
-
f.close()
-
return False
-
headerList = line.strip().split()[1:]
-
lineList = []
-
line = f.next().strip()
-
while not line.startswith(term):
-
if line != '':
-
lineList.append(line.strip().split())
-
line = f.next().strip()
-
f.close()
-
# Key list
-
keys = [i[0] for i in lineList]
-
# Values list
-
values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
-
# Create a dictionary from keys and values
-
lineDict = dict(zip(keys, values))
-
dataDict = {}
-
for i, item in enumerate(headerList):
-
dataDict[item] = {}
-
for key in lineDict:
-
dataDict[item][key] = lineDict[key][i]
-
# Add 1.0 to every element in dataDict subdictionaries
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] += 1.0
-
# Normalize original data (with 1 added) and update data
-
valueSums = [sum(item)+4 for item in values]
-
# print valueSums
-
for keyMain in dataDict:
-
for keySub in dataDict[keyMain]:
-
dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]
-
return dataDict
-
-
def parseData(fnSeq, dataset=1, key='>'):
-
-
'''
-
-
Read a formatted data file of sequences
-
-
Return a list of sequences
-
-
The first element in the list is the header
-
-
'''
-
# initialize output list
-
dataList = []
-
# open file for reading
-
f = open(fn)
-
# skip to required data set
-
for _ in range(dataset):
-
try:
-
s = f.next()
-
while not s.startswith(key):
-
s = f.next()
-
except StopIteration, e:
-
print 'We have reached the end of the file!','sequence file'
-
f.close()
-
return False
-
# initialize output list
-
dataList = [s,]
-
for line in f:
-
if not line.startswith(key):
-
dataList.append(line.strip())
-
else:
-
break
-
f.close()
-
return dataList
-
-
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
dataArray = parseArray(fnArray, arraySet)
-
#print dataArray
-
if dataArray:
-
dataSeq = parseData(fnSeq, seqSet)
-
#print dataSeq,"@@@@@@@@@@@@@@@@@@@"
-
if not dataSeq:
-
return False
-
else:
-
return None
-
# This is the complete sequence
-
seq = ''.join(dataSeq[1:])
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
subKeys = dataArray['A'].keys()
-
subKeys.sort()
-
# Calculate num/den for each slice of sequence
-
# Each sequence slice length = length of subKeys
-
# Example:
-
# seq = 'ATCGATA'
-
# subKeys length = 3
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
numList = []
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys)):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
-
resultList = []
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
-
-
if __name__ == '__main__':
-
import os
-
fnArray =r'c:\python25\weightmatrix.transfac.txt'
-
outputfile=r'c:\python25\output_file.txt'
-
masterList=[]
-
dir_name=r'c:\python25\up'
-
fileList=[os.path.join(dir_name,f) for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name,f))]
-
for fnSeq in fileList:
-
outList=[]
-
calcdata=1
-
arraySet=1
-
while not calcdata is None:
-
seqSet=1
-
while True:
-
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
print calcdata
-
seqSet += 1
-
#print seqSet,""################"
-
else:
-
break
-
arraySet+=1
-
masterList.append('\n'.join(outList))
-
#f=open(outputfile,'W')
-
#fwrite('\n'.join(masterList))
-
#f.close()
-
please paste ur output file!
i have been stuck for a longtime now. the file should give you values 1 with sequence parts corresponding to it.
waiting for your reply,
cheers!
oh........ i got the problem! Thanks
i am encountering a problem when i run my code for a huge folder containing many sequence files.. this is what it says
Traceback (most recent call last):
File "this_final_11.py", line 163, in <module>
calcdata=compileData(fnArray,fnSeq,arraySet,seqSet )
File "this_final_11.py", line 121, in compileData
num *= dataArray[s][subKeys[j]]
KeyError: 'N' -
def compileData(fnArray, fnSeq, arraySet=1, seqSet=1):
-
-
# sequence factor dictionary
-
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
-
dataArray = parseArray(fnArray, arraySet)
-
#print dataArray
-
if dataArray:
-
dataSeq = parseData(fnSeq, seqSet)
-
#print dataSeq,"@@@@@@@@@@@@@@@@@@@"
-
if not dataSeq:
-
return False
-
else:
-
return None
-
# This is the complete sequence
-
seq = ''.join(dataSeq[1:])
-
# These are the subkeys of dataArray - '01', '02', '03',.............
-
subKeys = dataArray['A'].keys()
-
subKeys.sort()
-
# Calculate num/den for each slice of sequence
-
# Each sequence slice length = length of subKeys
-
# Example:
-
# seq = 'ATCGATA'
-
# subKeys length = 3
-
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
-
numList = []
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys)):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
num *= dataArray[s][subKeys[j]]
-
print num,"######",dataArray[s][subKeys[j]],"#########"
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
-
resultList = []
-
for i, num in enumerate(numList):
-
#p=log10(num/denList[i])
-
#if (p) >=2:
-
#print "#########",abs(int(p))
-
if (log10(num/denList[i]))>=2:
-
#print "i am here"
-
resultList.append(int(abs(1)))
-
#print resultList
-
#for i in resultList:
-
#mean=sum(resultList)/len(resultList)
-
#sub=mean-i
-
#queue = []
-
#queue = (sub)**2
-
#print sqrt(queue/len(resultList))
-
-
#print mean,"@@@@@@@@@@"
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
#print "this is line 294"
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
can u see the problem.. i don get it
bvdet 2,851
Expert Mod 2GB
It appears that you have an 'N' in your sequence data. There is no such key in dataArray. Check your data files.
okay.. i have to say that if it encounters "N" the value of this alphabet should be taken as "0"... I am able to do this for denominator.. but for the numerator being calculated.. i want to specify that the log value of "N" is zero .. and if many "NNNNN" come together.. this part of "NNNN" can be omitted.. how do i do this??
this is what am doing and it doesnt seem to solve the problem. -
-
numList = []
-
-
-
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys)):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
if s=="N":
-
value[s]=0
-
-
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
bvdet 2,851
Expert Mod 2GB
this is what am doing and it doesnt seem to solve the problem. -
-
numList = []
-
-
-
-
denList = []
-
seqList = []
-
for i in xrange(len(seq) - len(subKeys)):
-
subseq = seq[0:len(subKeys)]
-
seqList.append(subseq)
-
num, den = 1, 1
-
for j, s in enumerate(subseq):
-
if s=="N":
-
value[s]=0
-
-
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
numList.append(num)
-
denList.append(den)
-
seq = seq[1:]
You can test for a valid key and perform some alternative action if an invalid key is encountered: - for j, s in enumerate(subseq):
-
if s in 'ACGT':
-
num *= dataArray[s][subKeys[j]]
-
den *= value[s]
-
else:
-
..............
-
numList.append(num)
-
denList.append(den)
Be careful with setting the variable den to zero, because the value will be used as a divisor later and will cause a ZeroDivisionError.
my code is running successffully..my code ran for five hours.results were ok
but when its writing to a file it said.
We have reached the end of the file! sequence file
We have reached the end of the file! matrix file
Traceback (most recent call last):
File "C:\Python25\this_final_1.py", line 174, in <module>
f.write('\n'.join(masterList))
MemoryError
i know its a very huge file but is there an efficient way to solve this problem??
And i also think that this program should not take such a longtime to run..though my sequence folder is 18MB and array file is 69KB. though it is pretty huge the calculation is simple only.. can i compute how long it will typically take to calculate??
I think there is some serious error
waiting for ur reply,
cheers!!
I have changed the indendation of this part so that only whatever is >1 is only printed is it correct?.. dont know if it is checking for the entire file! -
for i, num in enumerate(numList):
-
if (log10(num/denList[i]))>=2:
-
outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
-
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
-
will this check for the entire sequence and all the weight matrix??
bvdet 2,851
Expert Mod 2GB
Obviously accumulating such a large list is taxing your system. A small adjustment in the code should help, and allow you to see the results for each sequence file. - if __name__ == '__main__':
-
-
import os
-
-
fnArray = 'array.txt'
-
outputfile = 'seq_array_output.txt'
-
-
# user has multiple sequence files
-
# and must iterate over all the files
-
dir_name = r'H:\TEMP\temsys\seq_array'
-
fileList = [os.path.join(dir_name,f) for f in os.listdir(dir_name)\
-
if os.path.isfile(os.path.join(dir_name,f))]
-
-
for fnSeq in fileList:
-
outList = []
-
calcdata = 1
-
arraySet = 1
-
while calcdata:
-
seqSet = 1
-
while True:
-
calcdata = compileData(fnArray, fnSeq, arraySet, seqSet)
-
if calcdata:
-
outList.append(calcdata)
-
seqSet += 1
-
else:
-
break
-
arraySet += 1
-
# write to output file for each sequence file
-
f = open(outputfile, 'a')
-
f.write('\n'.join(outList))
-
f.close()
-
print 'Data written for sequence file %s' % fnSeq
You should check the validity of the calculations on a sample array file and two or three sample sequence files. To limit the output to calculation results that are greater than 1: - ....resultList = []
-
seqList1 = []
-
-
for i, num in enumerate(numList):
-
result = log10(num/denList[i])
-
# limit output to results greater than 1
-
if result > 1:
-
# print 'Seq = %s Result = %0.12f' % (seqList[i], result)
-
resultList.append(result)
-
seqList1.append(seqList[i])
-
-
outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
-
(seqList1[i], res) for i, res in enumerate(resultList)])
-
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
-
(arraySet, seqSet, dataSeq[0], outStr)
but when i change
while not calcdata is None:
to while calcdata:
the array set is not iterating at all.. the set is one through out.why is this happening??
and what exactly is the difference between the two lines??
and also
if any log value is greater than two then am taking that value as one.. only for these sequences.. i want to print the seqset,arrayset and calculated value.
i want all these written to a file..
will it take a longtime to run??
bvdet 2,851
Expert Mod 2GB
You are correct. The code should be - while not calcdata is None:
I have one array file and two small sequence files for testing. It takes about 2 or 3 seconds.
i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??
bvdet 2,851
Expert Mod 2GB
i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??
Try and see. The modified code from my earlier post writes the data to file for each sequence file. That should solve your memory error, but the only way to find for sure out is to try it.
yup.. I got the o/p file for the last sequence file.. its about 87 KB. if i want all the values for all the 5000 files in my folder.. it will take 440MB..is it possible to write such a huge file?? because i need the entire output.
okay.. i dont want to do that because this is not the final result of my task so.. i am not going to write it to a file.
what i should be doing is that..
i get the results which have log values of ones with four headers...like
>skud
>scer
>smik
>spar
if my result list has something like this
Array set # = 1
Sequence set # = 3
Sequence Header: >Skud YPR204W c2068:3442..4218
Sequence = AACTGTACACT Calculation = 1
Sequence = GATTATAACAA Calculation = 1
Sequence Header: >Smik YPR204W c2854:249..1235
Sequence = GAAGAGATACTAACAA Calculation = 1
Sequence = AATGACCGCGGCTCTT Calculation = 1
Array set # = 3
Sequence set # = 3
Sequence Header: >Skud YPR204W c2068:3442..4218
Sequence = AACTGTACACTGATTA Calculation = 1
Sequence = TAACAAGAACGGTTCA Calculation = 1
Sequence = TCGGAGCCTCGACTAA Calculation = 1
Sequence = AGACACTTGACGGACT Calculation = 1
Sequence = CACTTCAGATTACTTG Calculation = 1
Array set # = 3
Sequence set # = 4
this a part of the file
what i should do is... count the number of ones at every occurrence and write it to a list and find the average..
for example..skud=[2,5][at first occurrence of skud there were 2 ones then there were 5 ones] smik[2][there is only one occurence of this with two ones calculated]
now i should compute the average for skud which is 5+2/2=3(approx) similarly for smik 2/1=2
waiting for your reply,
cheers!
I have done this average calculation to a part of my output_file.. but i should be doing this on resultlist directly. i dont now how to exactly do that.here is the code for calculating average for a small part of my o/p file -
-
# Open the file. File is in the same directory as source code
-
f=file("result.txt")
-
-
# Put the Contents os the testfile in list a
-
-
a=f.readlines()
-
-
# create a 'mylist' which will contain only the data we require i.e. header and calculation line
-
-
mylist=[]
-
-
-
-
-
for i in a:
-
if i.find('Header')==-1:
-
pass
-
else:
-
name=""
-
for j in i:
-
if j ==">":
-
for k in range(1,5):
-
indx = i.index(j)+k
-
name+=i[indx]
-
mylist.append(name)
-
-
-
-
if i.find('Calculation')==-1:
-
pass
-
else:
-
mylist.append(i)
-
-
-
#name=" "
-
-
-
# Make the List of all the yeast in the text file.The key values are set initially. they will be incrimented for corresponding ones
-
-
yeasts={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
-
-
# Dictioalry to count only those yeasts who have one (valid yeasts). It will skip blank yeasts
-
-
-
valid={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
-
-
#Dictionary to store the values of average
-
-
average={}
-
-
-
# Function to read the name of yeast .
-
# e.g. >ABC as the first argument and index of >ABC as second argument. Returnes name ABC
-
-
-
def readname(r):
-
name=""
-
for i in r:
-
if i ==">":
-
for j in range(1,5):
-
indx = r.index(i)+j
-
name+=r[indx]
-
return name
-
-
-
def name(r):
-
name=""
-
for i in r:
-
if i ==">":
-
for j in range(1,5):
-
indx = r.index(i)+j
-
name+=r[indx]
-
return name
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
# Function to claculate ones for each species. Takes name of species as argument
-
-
-
for v in range(len(mylist)-2):
-
s=mylist[v]
-
if '1' in s:
-
yeasts[name]+=1
-
else:
-
name=s
-
if v<len(mylist):
-
t=mylist[v+1]
-
if '1' in t:
-
valid[name]+=1
-
-
-
-
-
-
-
-
-
-
# Calculations for output (Average) Number
-
-
-
for keys in yeasts:
-
if yeasts[keys]==0:
-
print "Yeast %s has no occurence" %keys
-
pass
-
else:
-
if valid[keys]==0:
-
print "Yeast %s has no occurence" %keys
-
pass
-
else:
-
average[keys]=yeasts[keys]/valid[keys]
-
-
print yeasts
-
print valid
-
for keys in average:
-
print "%s Avareage is : %d" %(keys,average[keys])
-
-
f.close()
-
i dont exactly know how to string this to my main program.. should i put this under main or seperate function as this is calculating from what the code is computing.
waiting for ur reply
cheers!
I want to know if i should write a seperate function to compute average or can i do it in the main part??
Post your reply Sign in to post your reply or Sign up for a free account.
Similar topics
8 posts
views
Thread by kaptain kernel |
last post: by
|
5 posts
views
Thread by B-Dog |
last post: by
|
2 posts
views
Thread by deko |
last post: by
|
10 posts
views
Thread by bienwell |
last post: by
|
3 posts
views
Thread by Chung Leong |
last post: by
|
1 post
views
Thread by Alex |
last post: by
|
1 post
views
Thread by laredotornado |
last post: by
|
5 posts
views
Thread by Mark |
last post: by
| | | | | | | | | | | |