By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,077 Members | 1,309 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,077 IT Pros & Developers. It's quick & easy.

mapping fasta files into dictionary (to create non-redundant fasta file)

P: 1

I am new to python. I have to mapp fasta file into dictionary. There are around 1000 sequences in my fasta file. The problem is that there are some the same sequences under different sequence id. I can sorted them out by accession number which is unique. The first line of my fasta file looks as follows:
>seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo

the next lines consist of amino acids.

I need to make non-redundant fasta file for these sequences on the base of unique AccessionNumber. I was sugessted to create dictionary but I am not sure how to do it for that problem. Can someone help me please.

Many tanks,
Feb 2 '10 #1
Share this Question
Share on Google+
2 Replies

Expert Mod 2.5K+
P: 2,851

Formatted data can be very simple to convert to a dictionary. Is your data delimited by the "|" character? It could be as simple as:
Expand|Select|Wrap|Line Numbers
  1. f = open("fasta.txt")
  2. headerList = f.readline().strip().split("|")
  3. dd = {}
  4. for line in f:
  5.     lineList = line.strip().split("|")
  6.     dd[lineList.pop(2)] = lineList
  7. f.close()
Using the code above, this data:
Expand|Select|Wrap|Line Numbers
  1. seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo
  2. AAA|XYZ|0001|23658876|Bill|line 1
  3. CCC|D&HFREE|0002|99999931|John|line 2
is converted to this dictionary:
Expand|Select|Wrap|Line Numbers
  1. >>> for key in dd:
  2. ...     print key, dd[key]
  3. ...     
  4. 0001 ['AAA', 'XYZ', '23658876', 'Bill', 'line 1']
  5. 0002 ['CCC', 'D&HFREE', '99999931', 'John', 'line 2']
  6. >>> 
Feb 2 '10 #2

Expert 100+
P: 391
Hi Elniunia

It's possible that I don't understand your situation precisely, but perhaps it's similar to mine. I often have data files which have a header row, and then many lines of data.

Temperature, Voltage, Current, etc
5.002, 1.32, 0.00032, etc
6.003, 1.42, 0.00042, etc

I then find it very convenient to make a dictionary of numpy arrays.
I have this function which I use to create this dictionary of arrays:
Expand|Select|Wrap|Line Numbers
  1. from numpy import *
  3. def MyOpen(myFile,textRow=0,dataStarts=1,hasHeadings=True,separater=NoneappendWhenNotDigit=True,returnArray=True):
  4.     """Opens txt file (myFile), which has a standard format of
  5.     text headings (with no space) separated by white space, followed
  6.     by numbers separated in the same way.
  7.     Output is a dictionary based the first row, with lists.
  8.     textRow is the row containing the headings.
  9.     dataStarts is the first row containing the data, and must be bigger
  10.     that textRow.
  11.     If there are no text headings then set hasHeadings to
  12.     False, and they'll be labelled in the dictionary by 'Col0' etc
  13.     If appendWhenNotDigit=True (default), then all rows will be appended.
  14.     Setting it to False, will mean that rows containing non-numeric values
  15.     will not be appended"""
  16.     f=open(myFile,'r')
  17.     g=f.readlines()
  18.     f.close()
  19.     ###change to lists###
  20.     h=[]
  21.     for n,i in enumerate(g):
  22.         if n<dataStarts and n<>textRow: continue
  24.         if separater==None:
  25.             temp1=i.split()
  26.         else:
  27.             temp1=i.split(separater)
  28.         temp2=[]
  29.         myAppend=True
  30.         for j in temp1:
  31.             #if j.isdigit():
  32.             #    temp2.append(int(j))
  33.             if isNumber(j.strip()):
  34.                 temp2.append(float(j.strip()))
  35.             else:
  36.                 temp2.append(j.strip())
  37.                 if n<>textRow and not appendWhenNotDigit:
  38.                     myAppend=False
  39.                     break
  40.         if myAppend: h.append(temp2)
  41.     ###create dictionary
  42.     d=dict([])
  43.     if hasHeadings:
  44.         for hi in h[0]:
  45.             d[hi]=[]
  46.     else:
  47.         for i in range(len(h[0])):
  48.             d["Col"+str(i)]=[]
  49.     for i in range(hasHeadings,len(h)):
  50.         for j in range(len(h[0])):
  51.             if hasHeadings:
  52.                 d[h[0][j]].append(h[i][j])
  53.             else:
  54.                 d["Col"+str(j)].append(h[i][j])
  55.     if returnArray==True:
  56.         e=dict([])
  57.         for k in d.keys():
  58.             e[k]=array(d[k])
  59.         return e
  60.     return d
There are several advantages to doing it this way.
Firstly if you need to calculate another set of results based on the data you've stored, it can be done like this:
Expand|Select|Wrap|Line Numbers
  1. def calc(a,d,A):
  2.     """a is the array based dictionary from the raw data & it will return
  3.     a dictionary where additional variables have been calculated"""
  4.     T=a["T/K"]
  5.     q=a["Theta"]
  6.     Z=a["Z"]
  7.     a["10/T"]=10/T
  8.     a["T-0.5"]=T**(-0.5)
  9.     return a
But the other thing you can do is first sort your data by AccessionNumber with this function:
Expand|Select|Wrap|Line Numbers
  1. def sort(a,sortName="T/K"):
  3.     """a is an array dictionary.  Sorts all arrays by one of them"""
  5.     #use  list.insert(bisect_left(list,element),elemnt) to create
  7.     #a mask and apply it to all the elements
  9.     mask=[]
  11.     vals=[]
  13.     for n,t in enumerate(a[sortName]):
  15.         ins=bisect_left(vals,t)
  17.         mask.insert(ins,n)
  19.         vals.insert(ins,t)
  21.     a2=dict()
  23.     for k in a.keys():
  25.         a2[k]=a[k][mask]
  27.     return a2
You just need to pass the dictionary you created to it and the name of the field you want to sort by.

Then I guess you want to remove duplicates. I haven't got a function for it, but something like this will do the job:
Expand|Select|Wrap|Line Numbers
  1. def removeDuplicates(a,sortName):
  2.     """a is an array dictionary.  Sorts all arrays by one of them"""
  3.     #use  list.insert(bisect_left(list,element),elemnt) to create
  4.     #a mask and apply it to all the elements
  5.     a=sort(a,sortName)    
  6.     mask=a[sortName][:-1]==a[sortName][1:]
  7.     mask=concatenate(array(True),mask)
  8.     for k in a.keys():
  9.         a2[k]=a[k][mask]
  10.     return a2
I'm afraid I haven't had a chance to test this code.
Feb 3 '10 #3

Post your reply

Sign in to post your reply or Sign up for a free account.