By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,848 Members | 1,290 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,848 IT Pros & Developers. It's quick & easy.

Loop for application I wrote crashes

P: 4
Hi i need to find the GC content in a number of files i have a code which works but would like to shorten it, at the moment i have the same piece of script repeated 8 times but would like to add a loop instead, however every time i try it crashes the program.


Expand|Select|Wrap|Line Numbers
  1. @Name;
  2. @Seq;
  3.  
  4. open FILE,"YBL091C.data";
  5.  
  6. ($NoSeq,$size) = split(/ /,<FILE>);
  7.  
  8. print "Starter NumSeq:$NoSeq Length:$size";
  9.  
  10. foreach (1..$NoSeq) {
  11.         $line = <FILE>;
  12.                 push(@Name,$line);
  13.                         $line = <FILE>;
  14.                                 push(@Seq,$line);
  15.                                 }
  16.  
  17.                                 print "Names @Name\n";
  18. #                               print "Seq @Seq\n";
  19.  
  20. # I would like to loop the following segment of my program another 7 times but #every time i try to add a loop it crashes
  21.  
  22. $str = $Seq[0];
  23. print "this yeast is @Name[0]";
  24. $countG = $str =~ s/(g)/$1/gi;
  25. print "There are $countG G's. \n ";
  26. $countA = $str =~ s/(a)/$1/gi;
  27. print "There are $countA A's. \n";
  28. $countT = $str =~ s/(t)/$1/gi;
  29. print "There are $countT T's. \n";
  30. $countC = $str =~ s/(c)/$1/gi;
  31. print "There are $countC C's. \n";
Feb 26 '08 #1
Share this Question
Share on Google+
7 Replies


nithinpes
Expert 100+
P: 410
This is my understanding of your problem from the code provided:
- first line of the file contains number of seq and size.
- the succeeding lines contain yeast name and the sequence alternatively.
- You need to count A, T, G and C in all the sequences.

For this, I would prefer a hash with yeast name as key and the sequence as value, though it can be achieved using arrays as you have done.

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3.  
  4. my @data;
  5. my %yeasts;
  6.  
  7. open FILE,"YBL091C.data" or die "sorry:$!";
  8. my @file=<FILE>; 
  9. my ($NoSeq,$size) = split(/ /,shift @file);  # remove first element and split
  10. push @data,$file[$_] foreach(0..($NoSeq*2-1));  # take only $NoSeq sequences
  11. %yeasts = @data;   ##convert remaining array to hash with name as key
  12.  
  13. print "Name: $_\n" foreach(keys %yeasts);
  14.  
  15. foreach (sort keys %yeasts) {
  16. my $str = $yeasts{$_};
  17. print "this yeast is $_";
  18. my $countG = $str =~ s/(g)/$1/gi;
  19. print "There are $countG G's. \n ";
  20. my $countA = $str =~ s/(a)/$1/gi;
  21. print "There are $countA A's. \n";
  22. my $countT = $str =~ s/(t)/$1/gi;
  23. print "There are $countT T's. \n";
  24. my $countC = $str =~ s/(c)/$1/gi;
  25. print "There are $countC C's. \n";
  26. }
  27.  
  28.  
Feb 27 '08 #2

P: 12
I may be wrong, but from what I've seen from other folks with genetic sequence data they tend to have very large files with huge numbers of lines that they are processing. If that is the case for you, then you may well wish to avoid the previous code's suggestion of pulling the entire file into an array, as it will potentially consume a huge amount of memory.

Also, the act of putting all this into a hash will ruin any chance of preserving the original order of the sequences unless you preserve it in an additional array... and if you need the additional array for that, then we might as well keep the array in the first place. On the other hand, it may well be that the original order was in fact the order produced by sort, or that you would prefer to sort them rather than preserving the original order. Still, I hate to make assumptions if I don't have to.

So with that in mind, your original approach of reading in a single line at a time in your loop seems reasonable enough to start. However, we could count the Gs, As, Ts, and Cs along the way and avoid having to store the whole sequence in an array at all. This should avoid holding a huge array in memory and generally be more efficient. If it were necessary due to inordinately long lines, you could even take the approach of pulling in a single character at a time.

Unfortunately, you seem to be printing out both @Name and @Seq just before printing the counts. It is my hope that this is for debugging purposes, but if it is not, then you're doomed to store it all in these arrays anyhow. If they're not needed, you'll be able to remove them from the code bellow as commented.

In the searches that you are using to do the counts, I've removed the capture and replace, since they are unneeded. Once you've stored the line in the array, you can be as destructive with it as you like.

I've further taken the liberty of storing your counts in an array of hashes rather than 4 scalars. This makes things look a bit more complicated, but it lets you loop over the 4 bases rather than having multiple prints or searches that look almost exactly the same. Besides, by moving the counting up into the first loop, we'd have needed to store them in four arrays at the very least.

The downside to one of these changes though is that I'm using $base inside the regular expression used fro counting. This causes the regular expression to be recompiled every time we pass through it so that $base can be interpolated. That could be solved be reverting to individual regexes for each of the four bases, or it could be solved cleverly by using pre-compiled regular expressions defined before the start of the main loop. I'm not feeling clever enough at the moment to attempt it. If things run too slowly, try it with four individual regexes first, and only go to the trouble to do pre-compiled regexes if it gets you a significant savings.

I've also marked some points where you could conceivably place an outer loop should you intend to process files that contain more than just one pass through this format.

Expand|Select|Wrap|Line Numbers
  1. my @bases = ('G', 'A', 'T', 'C');
  2.  
  3. open FILE,"YBL091C.data";
  4.  
  5. # Start of potential looping point
  6. my ($NoSeq,$size) = split(/ /,<FILE>);
  7. print "Starter NumSeq:$NoSeq Length:$size";
  8. my @names = ();
  9. my @counts = ();
  10. for (my $index = 0; $index < $NoSeq; $index++) {
  11.         my $line = <FILE>;
  12.         push @names,$line;
  13.         $line = <FILE>;
  14.         push @seq,$line; # Remove if the print of @seq below are not needed.
  15.         foreach my $base (@bases) {
  16.                 $counts[$index]->{$base} = ($line =~ s/$base//gi);
  17.         }
  18. }
  19. print "Names @names\n"; # Remove if not needed.
  20. print "Seq @seq\n"; # Remove if not needed.
  21. # If the above two print calls are not needed, then this loop could be made a
  22. # part of the first loop.
  23. for (my $index = 0; $index < $NoSeq; $index++) {
  24.         print "This yeast is @names[$index]";
  25.         foreach my $base (@bases) {
  26.                 print "There are $counts[$index]->{$base} ${base}'s.\n";
  27.         }
  28.         print "\n";
  29. }
  30. # End of potential looping point
  31.  
Feb 27 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
I would use tr/// to count the occurances. It should be considerably more effcient than s///ig. I hope the OP comes back and reads this thread because some good effort went into posting replies to it.
Feb 28 '08 #4

P: 12
I would use tr/// to count the occurances. It should be considerably more effcient than s///ig. I hope the OP comes back and reads this thread because some good effort went into posting replies to it.
tr/// works well with single letter search counting like this, but does not interpolate $base. So again, you'd have to split back out each count. It would however, be more efficient in the long run.

Expand|Select|Wrap|Line Numbers
  1. foreach my $base (@bases) {
  2.      $counts[$index]->{$base} = ($line =~ s/$base//gi);
  3. }
  4.  
becomes

Expand|Select|Wrap|Line Numbers
  1. $counts[$index]->{G} = ($line =~ tr/G//gi);
  2. $counts[$index]->{A} = ($line =~ tr/A//gi);
  3. $counts[$index]->{T} = ($line =~ tr/T//gi);
  4. $counts[$index]->{C} = ($line =~ tr/C//gi);
  5.  
For multi-character matching, you'd need to use s/// though (if you wanted to count all the occurrences of 'CA' in the sequence for instance). Though I suppose that regex becomes more complicated anyhow since you'd probably be processing the string as two-character base pairs rather than simply a stream of characters.
Mar 3 '08 #5

KevinADC
Expert 2.5K+
P: 4,059
just for clarification sake, tr/// does not recognize the g and i options. And yes, it would not work for counting pairs, just the single characters.
Mar 3 '08 #6

P: 12
just for clarification sake, tr/// does not recognize the g and i options. And yes, it would not work for counting pairs, just the single characters.
Correct... I was overzealous with cut and past and too sparing with editing thereafter.
Mar 4 '08 #7

KevinADC
Expert 2.5K+
P: 4,059
Correct... I was overzealous with cut and past and too sparing with editing thereafter.

I figured as much, we all do that. I know I appreciate if another member takes the time to correct any little errors I make so I assumed you would too. But if not just let me know and I won't do that in the future if that is what you prefer.
Mar 5 '08 #8

Post your reply

Sign in to post your reply or Sign up for a free account.