473,385 Members | 1,351 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

mapping fasta files into dictionary (to create non-redundant fasta file)

Hi,

I am new to python. I have to mapp fasta file into dictionary. There are around 1000 sequences in my fasta file. The problem is that there are some the same sequences under different sequence id. I can sorted them out by accession number which is unique. The first line of my fasta file looks as follows:
>seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo

the next lines consist of amino acids.


I need to make non-redundant fasta file for these sequences on the base of unique AccessionNumber. I was sugessted to create dictionary but I am not sure how to do it for that problem. Can someone help me please.

Many tanks,
E.
Feb 2 '10 #1
2 5043
bvdet
2,851 Expert Mod 2GB
Elniunia,

Formatted data can be very simple to convert to a dictionary. Is your data delimited by the "|" character? It could be as simple as:
Expand|Select|Wrap|Line Numbers
  1. f = open("fasta.txt")
  2. headerList = f.readline().strip().split("|")
  3. dd = {}
  4. for line in f:
  5.     lineList = line.strip().split("|")
  6.     dd[lineList.pop(2)] = lineList
  7. f.close()
Using the code above, this data:
Expand|Select|Wrap|Line Numbers
  1. seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo
  2. AAA|XYZ|0001|23658876|Bill|line 1
  3. CCC|D&HFREE|0002|99999931|John|line 2
is converted to this dictionary:
Expand|Select|Wrap|Line Numbers
  1. >>> for key in dd:
  2. ...     print key, dd[key]
  3. ...     
  4. 0001 ['AAA', 'XYZ', '23658876', 'Bill', 'line 1']
  5. 0002 ['CCC', 'D&HFREE', '99999931', 'John', 'line 2']
  6. >>> 
Feb 2 '10 #2
Glenton
391 Expert 256MB
Hi Elniunia

It's possible that I don't understand your situation precisely, but perhaps it's similar to mine. I often have data files which have a header row, and then many lines of data.

Eg
Temperature, Voltage, Current, etc
5.002, 1.32, 0.00032, etc
6.003, 1.42, 0.00042, etc
etc

I then find it very convenient to make a dictionary of numpy arrays.
I have this function which I use to create this dictionary of arrays:
Expand|Select|Wrap|Line Numbers
  1. from numpy import *
  2.  
  3. def MyOpen(myFile,textRow=0,dataStarts=1,hasHeadings=True,separater=NoneappendWhenNotDigit=True,returnArray=True):
  4.     """Opens txt file (myFile), which has a standard format of
  5.     text headings (with no space) separated by white space, followed
  6.     by numbers separated in the same way.
  7.     Output is a dictionary based the first row, with lists.
  8.     textRow is the row containing the headings.
  9.     dataStarts is the first row containing the data, and must be bigger
  10.     that textRow.
  11.     If there are no text headings then set hasHeadings to
  12.     False, and they'll be labelled in the dictionary by 'Col0' etc
  13.     If appendWhenNotDigit=True (default), then all rows will be appended.
  14.     Setting it to False, will mean that rows containing non-numeric values
  15.     will not be appended"""
  16.     f=open(myFile,'r')
  17.     g=f.readlines()
  18.     f.close()
  19.     ###change to lists###
  20.     h=[]
  21.     for n,i in enumerate(g):
  22.         if n<dataStarts and n<>textRow: continue
  23.  
  24.         if separater==None:
  25.             temp1=i.split()
  26.         else:
  27.             temp1=i.split(separater)
  28.         temp2=[]
  29.         myAppend=True
  30.         for j in temp1:
  31.             #if j.isdigit():
  32.             #    temp2.append(int(j))
  33.             if isNumber(j.strip()):
  34.                 temp2.append(float(j.strip()))
  35.             else:
  36.                 temp2.append(j.strip())
  37.                 if n<>textRow and not appendWhenNotDigit:
  38.                     myAppend=False
  39.                     break
  40.         if myAppend: h.append(temp2)
  41.     ###create dictionary
  42.     d=dict([])
  43.     if hasHeadings:
  44.         for hi in h[0]:
  45.             d[hi]=[]
  46.     else:
  47.         for i in range(len(h[0])):
  48.             d["Col"+str(i)]=[]
  49.     for i in range(hasHeadings,len(h)):
  50.         for j in range(len(h[0])):
  51.             if hasHeadings:
  52.                 d[h[0][j]].append(h[i][j])
  53.             else:
  54.                 d["Col"+str(j)].append(h[i][j])
  55.     if returnArray==True:
  56.         e=dict([])
  57.         for k in d.keys():
  58.             e[k]=array(d[k])
  59.         return e
  60.     return d
There are several advantages to doing it this way.
Firstly if you need to calculate another set of results based on the data you've stored, it can be done like this:
Expand|Select|Wrap|Line Numbers
  1. def calc(a,d,A):
  2.     """a is the array based dictionary from the raw data & it will return
  3.     a dictionary where additional variables have been calculated"""
  4.     T=a["T/K"]
  5.     q=a["Theta"]
  6.     Z=a["Z"]
  7.     a["10/T"]=10/T
  8.     a["T-0.5"]=T**(-0.5)
  9.     return a
But the other thing you can do is first sort your data by AccessionNumber with this function:
Expand|Select|Wrap|Line Numbers
  1. def sort(a,sortName="T/K"):
  2.  
  3.     """a is an array dictionary.  Sorts all arrays by one of them"""
  4.  
  5.     #use  list.insert(bisect_left(list,element),elemnt) to create
  6.  
  7.     #a mask and apply it to all the elements
  8.  
  9.     mask=[]
  10.  
  11.     vals=[]
  12.  
  13.     for n,t in enumerate(a[sortName]):
  14.  
  15.         ins=bisect_left(vals,t)
  16.  
  17.         mask.insert(ins,n)
  18.  
  19.         vals.insert(ins,t)
  20.  
  21.     a2=dict()
  22.  
  23.     for k in a.keys():
  24.  
  25.         a2[k]=a[k][mask]
  26.  
  27.     return a2
  28.  
You just need to pass the dictionary you created to it and the name of the field you want to sort by.

Then I guess you want to remove duplicates. I haven't got a function for it, but something like this will do the job:
Expand|Select|Wrap|Line Numbers
  1. def removeDuplicates(a,sortName):
  2.     """a is an array dictionary.  Sorts all arrays by one of them"""
  3.     #use  list.insert(bisect_left(list,element),elemnt) to create
  4.     #a mask and apply it to all the elements
  5.     a=sort(a,sortName)    
  6.     mask=a[sortName][:-1]==a[sortName][1:]
  7.     mask=concatenate(array(True),mask)
  8.     for k in a.keys():
  9.         a2[k]=a[k][mask]
  10.     return a2
  11.  
I'm afraid I haven't had a chance to test this code.
Feb 3 '10 #3

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: Thomas Jespersen | last post by:
Hello I want to create a MSI file programmatically. Do you know of any third party .NET component which can help me with that? I'm going to use it like a self extracting zip. So it is not...
7
by: Martin | last post by:
I have a situation where I'm displaying some information in a table on a web page. I've given the user the ability to make several different "queries" and show different sub-sets of the data. I...
2
by: sudheervemana | last post by:
Dear all, In my main directory there are some source files and i have another directory which includes several folders,each contains the make files.Now i want to debug my source code in either...
0
by: amartos | last post by:
Hi, I am trying to create a dll file from several resources files. I am using visual studio .net and I have a c++ project which has included a file called "Resource.es-ES.resources", in the...
27
by: jeniffer | last post by:
I need to create an excel file through a C program and then to populate it.How can it be done?
8
by: psbasha | last post by:
Hi, I would like to write the dictionary data to the file. The following stuff I am doing : sampleDict = { 100:,200:, 300:} list1 = sampleDict.keys() sf = ...
1
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ...
0
by: nettynet | last post by:
I'm really new in java. I'm trying to read 8000 URL files and write to a file. It's a kind of combining all files together...not overwritten. This is the URL of my all files...
3
crazy4perl
by: crazy4perl | last post by:
Hi All, I have some doubt related to xml. Actually I want to update a file which is in some format. So I am converting that file using Tap3edit perl module in a hash. Now I m trying to create a...
1
by: PavelRumberg | last post by:
Hello! I'm building a system which allow users to upload files for storage. There are many handled file types in my system but I can't figure out how to upload "Solid Works" projects, which are the...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.