By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,587 Members | 1,168 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,587 IT Pros & Developers. It's quick & easy.

transfer genome text file into excel file and select each cell to blast in NCBI

P: 4
I would like to know how to select each item of data from text file transfer to each column in excel file.
Expand|Select|Wrap|Line Numbers
  1.                      source          1..159662
  2.                      /organism="Candidatus Carsonella ruddii PV"
  3.                      /mol_type="genomic DNA"
  4.                      /strain="PV"
  5.                      /specific_host="Pachypsylla venusta"
  6.                      /db_xref="taxon:387662"
  7.      gene            1..1317
  8.                      /locus_tag="CRP_001"
  9.                      /db_xref="GeneID:4414829"
  10.      CDS             1..1317
  11.                      /locus_tag="CRP_001"
  12.                      /codon_start=1
  13.                      /transl_table=11
  14.                      /product="tRNA modification GTPase"
  15.                      /protein_id="YP_802398.1"
  16.                      /db_xref="GI:116334903"
  17.                      /db_xref="GeneID:4414829"
  18.  
After that, how can I select each gene (in each cell) from excel file to blast (align) in NCBI database.
Feb 21 '07 #1
Share this Question
Share on Google+
7 Replies


bartonc
Expert 5K+
P: 6,596
I'm not sure if this is a python question. Are you using Visual Basic or Python?

I would like to know how to select each item of data from text file transfer to each column in excel file.
Expand|Select|Wrap|Line Numbers
  1.                      source          1..159662
  2.                      /organism="Candidatus Carsonella ruddii PV"
  3.                      /mol_type="genomic DNA"
  4.                      /strain="PV"
  5.                      /specific_host="Pachypsylla venusta"
  6.                      /db_xref="taxon:387662"
  7.      gene            1..1317
  8.                      /locus_tag="CRP_001"
  9.                      /db_xref="GeneID:4414829"
  10.      CDS             1..1317
  11.                      /locus_tag="CRP_001"
  12.                      /codon_start=1
  13.                      /transl_table=11
  14.                      /product="tRNA modification GTPase"
  15.                      /protein_id="YP_802398.1"
  16.                      /db_xref="GI:116334903"
  17.                      /db_xref="GeneID:4414829"
  18.  
After that, how can I select each gene (in each cell) from excel file to blast (align) in NCBI database.
Feb 21 '07 #2

P: 4
I use python however I'm just the beginner of programing. Ok, I think my first problem is how can I select only each word that say "gene locus = xxx" from the data set in text file and make a new data set like this pattern => "gene locus1 =xxx", "gene locus2 =xxx", "gene locus3 =xxx", ......

I'm not sure if this is a python question. Are you using Visual Basic or Python?
Feb 21 '07 #3

bartonc
Expert 5K+
P: 6,596
I use python however I'm just the beginner of programing. Ok, I think my first problem is how can I select only each word that say "gene locus = xxx" from the data set in text file and make a new data set like this pattern => "gene locus1 =xxx", "gene locus2 =xxx", "gene locus3 =xxx", ......
Did your post copy you text file format accurately? It looks a bit odd to me.
Feb 21 '07 #4

bvdet
Expert Mod 2.5K+
P: 2,851
I use python however I'm just the beginner of programing. Ok, I think my first problem is how can I select only each word that say "gene locus = xxx" from the data set in text file and make a new data set like this pattern => "gene locus1 =xxx", "gene locus2 =xxx", "gene locus3 =xxx", ......
One option that should be straightforward is to read each line in your data file, extract the text you need, and write the rusults to a tab delimited file formatted into rows and columns. This file can be imported directly into Excel. If you want us to help you with this, you should post a portion of a representative file and the format you require for the output (rows and columns of the spreadsheet).
Feb 21 '07 #5

P: 4
Thanks for the suggestion, here is part of my raw data from NCBI database.


gene 1..1317
/locus_tag="CRP_001"
/db_xref="GeneID:4414829"
CDS 1..1317
/locus_tag="CRP_001"
/codon_start=1
/transl_table=11
/product="tRNA modification GTPase"
/protein_id="YP_802398.1"
/db_xref="GI:116334903"
/db_xref="GeneID:4414829"
/translation="KNLKCFINKIVDNKDFSKNNYSDVKILFNKFSF"

gene 1314..2816
/locus_tag="CRP_002"
/db_xref="GeneID:4414830"
CDS 1314..2816
/locus_tag="CRP_002"
/codon_start=1
/transl_table=11
/product="glucose inhibited division protein A"
/protein_id="YP_802399.1"
/db_xref="GI:116334904"
/db_xref="GeneID:4414830"
/translation="KIKLFDNFYLFKLIIIMSKYYGYIKKKYFK"

gene 2785..3477
/locus_tag="CRP_003"
/db_xref="GeneID:4414831"
CDS 2785..3477
/locus_tag="CRP_003"
/codon_start=1
/transl_table=11
/product="F0F1-type ATP synthase A subunit"
/protein_id="YP_802400.1"
/db_xref="GI:116334905"
/db_xref="GeneID:4414831"
/translation="MVILKKNILNNFLNFKIIDLNLIILL"


And I need to transform it into the pattern as seen below.

No., gene, locus_tag, protein_id, GeneID,
1, 1..1317, CRP_001, YP_802398.1, 4414829,
2, 1314..2816, CRP_002, YP_802399.1, 4414830,
3, 2785..3477, CRP_003, YP_802400.1, 4414831,
4, ................ .............. .................. ...........

If it becomes like this pattern, I will be able to convert it into excel file.
Feb 21 '07 #6

bvdet
Expert Mod 2.5K+
P: 2,851
Thanks for the suggestion, here is part of my raw data from NCBI database.


gene 1..1317
/locus_tag="CRP_001"
/db_xref="GeneID:4414829"
CDS 1..1317
/locus_tag="CRP_001"
/codon_start=1
/transl_table=11
/product="tRNA modification GTPase"
/protein_id="YP_802398.1"
/db_xref="GI:116334903"
/db_xref="GeneID:4414829"
/translation="KNLKCFINKIVDNKDFSKNNYSDVKILFNKFSF"

gene 1314..2816
/locus_tag="CRP_002"
/db_xref="GeneID:4414830"
CDS 1314..2816
/locus_tag="CRP_002"
/codon_start=1
/transl_table=11
/product="glucose inhibited division protein A"
/protein_id="YP_802399.1"
/db_xref="GI:116334904"
/db_xref="GeneID:4414830"
/translation="KIKLFDNFYLFKLIIIMSKYYGYIKKKYFK"

gene 2785..3477
/locus_tag="CRP_003"
/db_xref="GeneID:4414831"
CDS 2785..3477
/locus_tag="CRP_003"
/codon_start=1
/transl_table=11
/product="F0F1-type ATP synthase A subunit"
/protein_id="YP_802400.1"
/db_xref="GI:116334905"
/db_xref="GeneID:4414831"
/translation="MVILKKNILNNFLNFKIIDLNLIILL"


And I need to transform it into the pattern as seen below.

No., gene, locus_tag, protein_id, GeneID,
1, 1..1317, CRP_001, YP_802398.1, 4414829,
2, 1314..2816, CRP_002, YP_802399.1, 4414830,
3, 2785..3477, CRP_003, YP_802400.1, 4414831,
4, ................ .............. .................. ...........

If it becomes like this pattern, I will be able to convert it into excel file.
This is not elegant, but see if it works for you:
Expand|Select|Wrap|Line Numbers
  1. import os
  2.  
  3. def geneData(fn, fOut):
  4.     f = open(fn, 'r')
  5.     line_no = 1
  6.     lineLst = []
  7.     itemLst = []
  8.     labelLst = ['No.', 'gene', 'locus_tag', 'protein_id', 'GeneID']
  9.     protien_id = False
  10.     for line in f:
  11.         if 'gene' in line:
  12.             itemLst += [str(line_no), line.split()[1]]
  13.         elif 'locus_tag' in line:
  14.             itemLst.append(line.split('=')[1].strip('"\n'))
  15.         elif 'protein_id' in  line:
  16.             itemLst.append(line.split('=')[1].strip('"\n'))
  17.             protien_id = True
  18.         elif 'GeneID' in line and protien_id == True:
  19.             itemLst.append(line.split(':')[1].strip('"\n'))
  20.             lineLst.append(itemLst)
  21.             itemLst = []
  22.             line_no += 1
  23.             protien_id = False
  24.     f.close()
  25.     f = open(fOut, 'w')
  26.     f.write(','.join(labelLst)+'\n')
  27.     for line in lineLst:
  28.         f.write(','.join(line)+'\n')
  29.     f.close()           
  30.     return labelLst+lineLst
  31.  
  32.  
  33. if __name__ == '__main__':
  34.     geneData('your_in_file', 'your_out_file')
Feb 21 '07 #7

P: 4
Thank you very much. I will try and learn it from your script.


This is not elegant, but see if it works for you:
Expand|Select|Wrap|Line Numbers
  1. import os
  2.  
  3. def geneData(fn, fOut):
  4.     f = open(fn, 'r')
  5.     line_no = 1
  6.     lineLst = []
  7.     itemLst = []
  8.     labelLst = ['No.', 'gene', 'locus_tag', 'protein_id', 'GeneID']
  9.     protien_id = False
  10.     for line in f:
  11.         if 'gene' in line:
  12.             itemLst += [str(line_no), line.split()[1]]
  13.         elif 'locus_tag' in line:
  14.             itemLst.append(line.split('=')[1].strip('"\n'))
  15.         elif 'protein_id' in  line:
  16.             itemLst.append(line.split('=')[1].strip('"\n'))
  17.             protien_id = True
  18.         elif 'GeneID' in line and protien_id == True:
  19.             itemLst.append(line.split(':')[1].strip('"\n'))
  20.             lineLst.append(itemLst)
  21.             itemLst = []
  22.             line_no += 1
  23.             protien_id = False
  24.     f.close()
  25.     f = open(fOut, 'w')
  26.     f.write(','.join(labelLst)+'\n')
  27.     for line in lineLst:
  28.         f.write(','.join(line)+'\n')
  29.     f.close()           
  30.     return labelLst+lineLst
  31.  
  32.  
  33. if __name__ == '__main__':
  34.     geneData('your_in_file', 'your_out_file')
Feb 21 '07 #8

Post your reply

Sign in to post your reply or Sign up for a free account.