I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:
NM_198212 chr7 + 115926679 115935830 115927071 11593344 2 115926679,'115933260', 115927221,'115935830',
The sequence for the chromosome starts at base 115926679 and continues up to(but not including) base 115935830
If we want the spliced sequence, we use the exons.The first extends from 115926679 to 155927221, and the second goes from '115933260' to '115935830'
However, I have run across a problem when on a complementary sequence such as:
NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1 245941755, '245942680'
Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.
That said, I have only been programming for about half a year and and not sure how to starts going about doing this.
I have written a regular expression:
'(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+), (\d+),s+(\d+),(\d+),'
I just made some bold, some italics, and some in quotes to show the different parts I was trying to use
but am now unsure as to how to start this function... If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.