By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,767 Members | 1,033 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,767 IT Pros & Developers. It's quick & easy.

Parsing a file, and matching numbers of a sequence (python)

P: 3
I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:

NM_198212 chr7 + **115926679** **115935830** 115927071 11593344 2 *115926679*,'115933260', *115927221*,'115935830',

The sequence for the chromosome starts at base **115926679** and continues up to(but not including) base **115935830**

If we want the spliced sequence, we use the exons.The first extends from
*115926679* to *155927221*, and the second goes from '115933260' to '115935830'

However, I have run across a problem when on a complementary sequence such as:

NM_001005286 chr1 - **245941755** ***245942680*** 245941755 245942680 1 *245941755*, '245942680'

Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.

That said, I have only been programming for about half a year and and not sure how to starts going about doing this.

I have written a regular expression:

'(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+), (\d+),s+(\d+),(\d+),'

but am now unsure as to how to start this function...
If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.

OK: suppose this is chromsome 25:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG

(there are 10 of each character).

Now: if I am looking for an unspliced gene on:
<blah> chr25 + 10 20 <blah>

Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:

CCCCCCCCCC

This is easy. It matches python string slicing really well.


Its more confusing if I give you:

<blah> chr25 - 10 20 <blah>

What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a souble-strand:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG
TTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC

We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.

The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand but given how large it is, I advise against that.)

So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up every time.) So base 10 is against base 29, and base 19 is against base 20.

So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).


I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing

I have tried:


fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
start,end = end,start



which is on the right track, yet but its not enough. This would take the 10 and 20, and turn it into a 20 and 10.

And I know I can reverse complement the string by doing this:

r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
return ''.join(o)


But, from here I am lost.
Apr 22 '12 #1
Share this Question
Share on Google+
2 Replies

Expert 100+
P: 626
We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right.
You can start from either the left or the right, assuming you already know how to extract the "+" or "-" and the begin and end postitions-->if not post back. Note that the smallest number is always first in the slice-->[:]
Expand|Select|Wrap|Line Numbers
  1. forward="AAAAAAAAAAXXXXXXXXXXTTTTTTTTTTGGGGGGGGGG"
  2. print forward[10:20]  ## prints "X"'s
  3.  
  4. backward="AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGGTTTTTTTTTTGGGGGGGGGGYYYYYYYYYYCCCCCCCCCC"
  5. print backward[-20:-10]  ## prints "Y"'s 
You can also split the string if you know that each sequence is always 10, otherwise you have to test for some break indicator.
Expand|Select|Wrap|Line Numbers
  1. backward="AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGGTTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC"
  2.  
  3. list_of_seq=[backward[start:start+10] for start in range(0, len(backward), 10)]
  4. print list_of_seq
  5.  
  6. for start in [10, -20, 30, -30, 70]:
  7.     print list_of_seq[start/10] 
Apr 22 '12 #2

Expert 100+
P: 626
fields = line.split(' \t')

This may give you and extra field because of the newline=\n, or retain the newline in the last field which could also create problems. Instead strip the whitespace first:
fields = line.strip().split()

Note that split() receives the return of strip() and the default will split on whitespace (any spaces, tabs, or newlines combination).
Apr 22 '12 #3

Post your reply

Sign in to post your reply or Sign up for a free account.