mapping fasta files into dictionary (to create non-redundant fasta file)

Hi,

I am new to python. I have to mapp fasta file into dictionary. There are around 1000 sequences in my fasta file. The problem is that there are some the same sequences under different sequence id. I can sorted them out by accession number which is unique. The first line of my fasta file looks as follows:
>seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo

the next lines consist of amino acids.

I need to make non-redundant fasta file for these sequences on the base of unique AccessionNumber. I was sugessted to create dictionary but I am not sure how to do it for that problem. Can someone help me please.

Many tanks,
E.

Feb 2 '10 #1

Subscribe Post Reply

5043

bvdet

2,851

Expert Mod 2GB

Elniunia,

Formatted data can be very simple to convert to a dictionary. Is your data delimited by the "|" character? It could be as simple as:

Expand|Select|Wrap|Line Numbers

 f = open("fasta.txt")

headerList = f.readline().strip().split("|")

dd = {}

for line in f:

    lineList = line.strip().split("|")

    dd[lineList.pop(2)] = lineList

f.close()

Using the code above, this data:

Expand|Select|Wrap|Line Numbers

 seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo

AAA|XYZ|0001|23658876|Bill|line 1

CCC|D&HFREE|0002|99999931|John|line 2

is converted to this dictionary:

Expand|Select|Wrap|Line Numbers

 >>> for key in dd:

...     print key, dd[key]

...     

0001 ['AAA', 'XYZ', '23658876', 'Bill', 'line 1']

0002 ['CCC', 'D&HFREE', '99999931', 'John', 'line 2']

>>>

Feb 2 '10 #2

Glenton

391

Expert 256MB

Hi Elniunia

It's possible that I don't understand your situation precisely, but perhaps it's similar to mine. I often have data files which have a header row, and then many lines of data.

Eg
Temperature, Voltage, Current, etc
5.002, 1.32, 0.00032, etc
6.003, 1.42, 0.00042, etc
etc

I then find it very convenient to make a dictionary of numpy arrays.
I have this function which I use to create this dictionary of arrays:

Expand|Select|Wrap|Line Numbers

 from numpy import *
 
def MyOpen(myFile,textRow=0,dataStarts=1,hasHeadings=True,separater=NoneappendWhenNotDigit=True,returnArray=True):

    """Opens txt file (myFile), which has a standard format of

    text headings (with no space) separated by white space, followed

    by numbers separated in the same way.

    Output is a dictionary based the first row, with lists.

    textRow is the row containing the headings.

    dataStarts is the first row containing the data, and must be bigger

    that textRow.

    If there are no text headings then set hasHeadings to

    False, and they'll be labelled in the dictionary by 'Col0' etc

    If appendWhenNotDigit=True (default), then all rows will be appended.

    Setting it to False, will mean that rows containing non-numeric values

    will not be appended"""

    f=open(myFile,'r')

    g=f.readlines()

    f.close()

    ###change to lists###

    h=[]

    for n,i in enumerate(g):

        if n<dataStarts and n<>textRow: continue
 
        if separater==None:

            temp1=i.split()

        else:

            temp1=i.split(separater)

        temp2=[]

        myAppend=True

        for j in temp1:

            #if j.isdigit():

            #    temp2.append(int(j))

            if isNumber(j.strip()):

                temp2.append(float(j.strip()))

            else:

                temp2.append(j.strip())

                if n<>textRow and not appendWhenNotDigit:

                    myAppend=False

                    break

        if myAppend: h.append(temp2)

    ###create dictionary

    d=dict([])

    if hasHeadings:

        for hi in h[0]:

            d[hi]=[]

    else:

        for i in range(len(h[0])):

            d["Col"+str(i)]=[]

    for i in range(hasHeadings,len(h)):

        for j in range(len(h[0])):

            if hasHeadings:

                d[h[0][j]].append(h[i][j])

            else:

                d["Col"+str(j)].append(h[i][j])

    if returnArray==True:

        e=dict([])

        for k in d.keys():

            e[k]=array(d[k])

        return e

    return d

There are several advantages to doing it this way.
Firstly if you need to calculate another set of results based on the data you've stored, it can be done like this:

Expand|Select|Wrap|Line Numbers

 
def calc(a,d,A):

    """a is the array based dictionary from the raw data & it will return

    a dictionary where additional variables have been calculated"""

    T=a["T/K"]

    q=a["Theta"]

    Z=a["Z"]

    a["10/T"]=10/T

    a["T-0.5"]=T**(-0.5)

    return a

But the other thing you can do is first sort your data by AccessionNumber with this function:

Expand|Select|Wrap|Line Numbers

 def sort(a,sortName="T/K"):
 
    """a is an array dictionary.  Sorts all arrays by one of them"""
 
    #use  list.insert(bisect_left(list,element),elemnt) to create
 
    #a mask and apply it to all the elements
 
    mask=[]
 
    vals=[]
 
    for n,t in enumerate(a[sortName]):
 
        ins=bisect_left(vals,t)
 
        mask.insert(ins,n)
 
        vals.insert(ins,t)
 
    a2=dict()
 
    for k in a.keys():
 
        a2[k]=a[k][mask]
 
    return a2

You just need to pass the dictionary you created to it and the name of the field you want to sort by.

Then I guess you want to remove duplicates. I haven't got a function for it, but something like this will do the job:

Expand|Select|Wrap|Line Numbers

 def removeDuplicates(a,sortName):

    """a is an array dictionary.  Sorts all arrays by one of them"""

    #use  list.insert(bisect_left(list,element),elemnt) to create

    #a mask and apply it to all the elements

    a=sort(a,sortName)    

    mask=a[sortName][:-1]==a[sortName][1:]

    mask=concatenate(array(True),mask)

    for k in a.keys():

        a2[k]=a[k][mask]

    return a2

I'm afraid I haven't had a chance to test this code.

Feb 3 '10 #3

by: Thomas Jespersen | last post by:

Hello I want to create a MSI file programmatically. Do you know of any third party .NET component which can help me with that? I'm going to use it like a self extracting zip. So it is not...

.NET Framework

Create an Excel file with Javascript?

by: Martin | last post by:

I have a situation where I'm displaying some information in a table on a web page. I've given the user the ability to make several different "queries" and show different sub-sets of the data. I...

Javascript

How to create a project file from make files in VC++

by: sudheervemana | last post by:

Dear all, In my main directory there are some source files and i have another directory which includes several folders,each contains the make files.Now i want to debug my source code in either...

C / C++

Multiple resources files in only one dll file

by: amartos | last post by:

Hi, I am trying to create a dll file from several resources files. I am using visual studio .net and I have a c++ project which has included a file called "Resource.es-ES.resources", in the...

C# / C Sharp

How to create an excel file through a C program

by: jeniffer | last post by:

I need to create an excel file through a C program and then to populate it.How can it be done?

C / C++

Writing a dictionary data to a file?

by: psbasha | last post by:

Hi, I would like to write the dictionary data to the file. The following stuff I am doing : sampleDict = { 100:,200:, 300:} list1 = sampleDict.keys() sf = ...

Python

How to create a .txt file with unicode encoding

by: ujjwaltrivedi | last post by:

Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ...

C# / C Sharp

Problem with reading files and writing to a file without overwriting

by: nettynet | last post by:

I'm really new in java. I'm trying to read 8000 URL files and write to a file. It's a kind of combining all files together...not overwritten. This is the URL of my all files...

Java

How to create an XML file using perl.

by: crazy4perl | last post by:

Hi All, I have some doubt related to xml. Actually I want to update a file which is in some format. So I am converting that file using Tap3edit perl module in a hash. Now I m trying to create a...

Perl

How to create a zip file containing a whole directory of chosen file and upload it?

by: PavelRumberg | last post by:

Hello! I'm building a system which allow users to upload files for storage. There are many handled file types in my system but I can't figure out how to upload "Solid Works" projects, which are the...

Javascript

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

mapping fasta files into dictionary (to create non-redundant fasta file)

Similar topics