By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,621 Members | 1,101 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,621 IT Pros & Developers. It's quick & easy.

How do I split a text file without stripping the character I'm splitting at?

P: 3
Hi everyone, I'm new to python and would like to split a FASTA (text) file into each different gene (separated by a ">"), randomly sample a certain number of the sequences, and print the result. I have the program almost working correctly, but for some reason text.split('>') strips all of the ">"s from the file. If there's some way I can either remove this strip or add back in the ">" character that would be amazing. Here's my program so far, I've added the ">" character to the print line, but that only adds it at the beginning of the result, I want it at the beginning of all of the splits.

My Program:
Expand|Select|Wrap|Line Numbers
  1. import random
  2. fileobj = open("MyFile")
  3. ignore  = fileobj.read(1)
  4. text    = fileobj.read()
  5. records = text.split('>')
  6. NewLines = random.sample(records, 3)
  7. print ">" + '\n'.join(NewLines)  
Result:

>FLP3FBN01A85QC length=268 xy=0397_0946 region=1 run=R_2008_12_09_13_51_01_
ACAGACCACTCACATGCTGCCTCCCGTAGGAGTTTGGGCCGTGTCTCAGT CCCAATGTGG
CCGTTCACCCTCTCAGGCCGGCTACTGATCGTCGCCTTGGTAGGCCGTTA CCCTACCAAC
AAGCTAATCAGACGCGGAGCCATCTTACACCACCTCAGTTTTTCACACCG GACCATGCGG
TCCTGTGCGCTTATGCGGTATTAGCACCTATTTCTAAGTGTTATCCCCCT GTGTAAGGCA
GGTCCTCCACGCGTTACTCACCCGTCCG

FLP3FBN01DH3NR length=257 xy=1319_0885 region=1 run=R_2008_12_09_13_51_01_
ACAGACCACTCACATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGT CCCAATGTGG
CCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTGAGCCGTTA CCTCACCAAC
TACCTAATCAGACGCGGGTCCATCTTGCACCACCGGAGTTTTTCACACTG TCCCATGCAG
GACCGTGCGCTTATGCGGTATTGCACCTATTTCTAAGTGTTATCCCCCAG TGCAAGGCAG
GTTACCCACGCGTTACT

FLP3FBN01D0219 length=268 xy=1535_1839 region=1 run=R_2008_12_09_13_51_01_
ACAGACCACTCACATGCTGCCTCCCGTAGGAGTTTGGGCCGTGTCTCAGT CCCAATGTGG
CCGTCCACCCTCTCAGGCCGGCTACTGATCGTCGCCTTGGTGGGCCTTTA CCCCGCCAAC
CAGCTAATCAGACGCGGGTCCATCTTGCACCACCGGAGTTTTTCACACTG TCCCATGCAG
GACCGTGCGCTTATGCGGTATTAGCACCTATTTCTAAGTGTTATCCCCCA GTGCAAGGCA
GGTTACCCACGCGTTACTCACCCGTCCG

So, basically all I need is the ">" at the beginning of each of the three paragraphs.

Thanks so much in advance!
Nov 5 '10 #1

✓ answered by bvdet

In your case it would be:
Expand|Select|Wrap|Line Numbers
  1. NewLines = [">%s" % (s) for s in random.sample(records, 3)]
OR:
Expand|Select|Wrap|Line Numbers
  1. records = [">%s" % (s) for s in text.split('>')]

Share this Question
Share on Google+
4 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
Use a list comprehension to add the ">" character to each record. Example:
Expand|Select|Wrap|Line Numbers
  1. >>> records = ["123", "456", "789"]
  2. >>> new_records = [">%s" % (s) for s in records]
  3. >>> new_records
  4. ['>123', '>456', '>789']
  5. >>> 
Nov 5 '10 #2

P: 3
Would I have to convert my list to a string for this? Also, for "records = ["123", "456", "789"]" how would I get python to automatically fill in for the "123", "456", and "789" in your example? The text file I'm using is hundreds of thousands of characters long, so I can't possibly do this manually.
Nov 5 '10 #3

bvdet
Expert Mod 2.5K+
P: 2,851
In your case it would be:
Expand|Select|Wrap|Line Numbers
  1. NewLines = [">%s" % (s) for s in random.sample(records, 3)]
OR:
Expand|Select|Wrap|Line Numbers
  1. records = [">%s" % (s) for s in text.split('>')]
Nov 5 '10 #4

P: 3
Thanks! That works perfectly!
Nov 5 '10 #5

Post your reply

Sign in to post your reply or Sign up for a free account.