I may be wrong, but from what I've seen from other folks with genetic sequence data they tend to have very large files with huge numbers of lines that they are processing. If that is the case for you, then you may well wish to avoid the previous code's suggestion of pulling the entire file into an array, as it will potentially consume a huge amount of memory.
Also, the act of putting all this into a hash will ruin any chance of preserving the original order of the sequences unless you preserve it in an additional array... and if you need the additional array for that, then we might as well keep the array in the first place. On the other hand, it may well be that the original order was in fact the order produced by sort, or that you would prefer to sort them rather than preserving the original order. Still, I hate to make assumptions if I don't have to.
So with that in mind, your original approach of reading in a single line at a time in your loop seems reasonable enough to start. However, we could count the Gs, As, Ts, and Cs along the way and avoid having to store the whole sequence in an array at all. This should avoid holding a huge array in memory and generally be more efficient. If it were necessary due to inordinately long lines, you could even take the approach of pulling in a single character at a time.
Unfortunately, you seem to be printing out both @Name and @Seq just before printing the counts. It is my hope that this is for debugging purposes, but if it is not, then you're doomed to store it all in these arrays anyhow. If they're not needed, you'll be able to remove them from the code bellow as commented.
In the searches that you are using to do the counts, I've removed the capture and replace, since they are unneeded. Once you've stored the line in the array, you can be as destructive with it as you like.
I've further taken the liberty of storing your counts in an array of hashes rather than 4 scalars. This makes things look a bit more complicated, but it lets you loop over the 4 bases rather than having multiple prints or searches that look almost exactly the same. Besides, by moving the counting up into the first loop, we'd have needed to store them in four arrays at the very least.
The downside to one of these changes though is that I'm using $base inside the regular expression used fro counting. This causes the regular expression to be recompiled every time we pass through it so that $base can be interpolated. That could be solved be reverting to individual regexes for each of the four bases, or it could be solved cleverly by using pre-compiled regular expressions defined before the start of the main loop. I'm not feeling clever enough at the moment to attempt it. If things run too slowly, try it with four individual regexes first, and only go to the trouble to do pre-compiled regexes if it gets you a significant savings.
I've also marked some points where you could conceivably place an outer loop should you intend to process files that contain more than just one pass through this format.
-
my @bases = ('G', 'A', 'T', 'C');
-
-
open FILE,"YBL091C.data";
-
-
# Start of potential looping point
-
my ($NoSeq,$size) = split(/ /,<FILE>);
-
print "Starter NumSeq:$NoSeq Length:$size";
-
my @names = ();
-
my @counts = ();
-
for (my $index = 0; $index < $NoSeq; $index++) {
-
my $line = <FILE>;
-
push @names,$line;
-
$line = <FILE>;
-
push @seq,$line; # Remove if the print of @seq below are not needed.
-
foreach my $base (@bases) {
-
$counts[$index]->{$base} = ($line =~ s/$base//gi);
-
}
-
}
-
print "Names @names\n"; # Remove if not needed.
-
print "Seq @seq\n"; # Remove if not needed.
-
# If the above two print calls are not needed, then this loop could be made a
-
# part of the first loop.
-
for (my $index = 0; $index < $NoSeq; $index++) {
-
print "This yeast is @names[$index]";
-
foreach my $base (@bases) {
-
print "There are $counts[$index]->{$base} ${base}'s.\n";
-
}
-
print "\n";
-
}
-
# End of potential looping point
-