473,385 Members | 1,764 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Loop for application I wrote crashes

4
Hi i need to find the GC content in a number of files i have a code which works but would like to shorten it, at the moment i have the same piece of script repeated 8 times but would like to add a loop instead, however every time i try it crashes the program.


Expand|Select|Wrap|Line Numbers
  1. @Name;
  2. @Seq;
  3.  
  4. open FILE,"YBL091C.data";
  5.  
  6. ($NoSeq,$size) = split(/ /,<FILE>);
  7.  
  8. print "Starter NumSeq:$NoSeq Length:$size";
  9.  
  10. foreach (1..$NoSeq) {
  11.         $line = <FILE>;
  12.                 push(@Name,$line);
  13.                         $line = <FILE>;
  14.                                 push(@Seq,$line);
  15.                                 }
  16.  
  17.                                 print "Names @Name\n";
  18. #                               print "Seq @Seq\n";
  19.  
  20. # I would like to loop the following segment of my program another 7 times but #every time i try to add a loop it crashes
  21.  
  22. $str = $Seq[0];
  23. print "this yeast is @Name[0]";
  24. $countG = $str =~ s/(g)/$1/gi;
  25. print "There are $countG G's. \n ";
  26. $countA = $str =~ s/(a)/$1/gi;
  27. print "There are $countA A's. \n";
  28. $countT = $str =~ s/(t)/$1/gi;
  29. print "There are $countT T's. \n";
  30. $countC = $str =~ s/(c)/$1/gi;
  31. print "There are $countC C's. \n";
Feb 26 '08 #1
7 1425
nithinpes
410 Expert 256MB
This is my understanding of your problem from the code provided:
- first line of the file contains number of seq and size.
- the succeeding lines contain yeast name and the sequence alternatively.
- You need to count A, T, G and C in all the sequences.

For this, I would prefer a hash with yeast name as key and the sequence as value, though it can be achieved using arrays as you have done.

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3.  
  4. my @data;
  5. my %yeasts;
  6.  
  7. open FILE,"YBL091C.data" or die "sorry:$!";
  8. my @file=<FILE>; 
  9. my ($NoSeq,$size) = split(/ /,shift @file);  # remove first element and split
  10. push @data,$file[$_] foreach(0..($NoSeq*2-1));  # take only $NoSeq sequences
  11. %yeasts = @data;   ##convert remaining array to hash with name as key
  12.  
  13. print "Name: $_\n" foreach(keys %yeasts);
  14.  
  15. foreach (sort keys %yeasts) {
  16. my $str = $yeasts{$_};
  17. print "this yeast is $_";
  18. my $countG = $str =~ s/(g)/$1/gi;
  19. print "There are $countG G's. \n ";
  20. my $countA = $str =~ s/(a)/$1/gi;
  21. print "There are $countA A's. \n";
  22. my $countT = $str =~ s/(t)/$1/gi;
  23. print "There are $countT T's. \n";
  24. my $countC = $str =~ s/(c)/$1/gi;
  25. print "There are $countC C's. \n";
  26. }
  27.  
  28.  
Feb 27 '08 #2
I may be wrong, but from what I've seen from other folks with genetic sequence data they tend to have very large files with huge numbers of lines that they are processing. If that is the case for you, then you may well wish to avoid the previous code's suggestion of pulling the entire file into an array, as it will potentially consume a huge amount of memory.

Also, the act of putting all this into a hash will ruin any chance of preserving the original order of the sequences unless you preserve it in an additional array... and if you need the additional array for that, then we might as well keep the array in the first place. On the other hand, it may well be that the original order was in fact the order produced by sort, or that you would prefer to sort them rather than preserving the original order. Still, I hate to make assumptions if I don't have to.

So with that in mind, your original approach of reading in a single line at a time in your loop seems reasonable enough to start. However, we could count the Gs, As, Ts, and Cs along the way and avoid having to store the whole sequence in an array at all. This should avoid holding a huge array in memory and generally be more efficient. If it were necessary due to inordinately long lines, you could even take the approach of pulling in a single character at a time.

Unfortunately, you seem to be printing out both @Name and @Seq just before printing the counts. It is my hope that this is for debugging purposes, but if it is not, then you're doomed to store it all in these arrays anyhow. If they're not needed, you'll be able to remove them from the code bellow as commented.

In the searches that you are using to do the counts, I've removed the capture and replace, since they are unneeded. Once you've stored the line in the array, you can be as destructive with it as you like.

I've further taken the liberty of storing your counts in an array of hashes rather than 4 scalars. This makes things look a bit more complicated, but it lets you loop over the 4 bases rather than having multiple prints or searches that look almost exactly the same. Besides, by moving the counting up into the first loop, we'd have needed to store them in four arrays at the very least.

The downside to one of these changes though is that I'm using $base inside the regular expression used fro counting. This causes the regular expression to be recompiled every time we pass through it so that $base can be interpolated. That could be solved be reverting to individual regexes for each of the four bases, or it could be solved cleverly by using pre-compiled regular expressions defined before the start of the main loop. I'm not feeling clever enough at the moment to attempt it. If things run too slowly, try it with four individual regexes first, and only go to the trouble to do pre-compiled regexes if it gets you a significant savings.

I've also marked some points where you could conceivably place an outer loop should you intend to process files that contain more than just one pass through this format.

Expand|Select|Wrap|Line Numbers
  1. my @bases = ('G', 'A', 'T', 'C');
  2.  
  3. open FILE,"YBL091C.data";
  4.  
  5. # Start of potential looping point
  6. my ($NoSeq,$size) = split(/ /,<FILE>);
  7. print "Starter NumSeq:$NoSeq Length:$size";
  8. my @names = ();
  9. my @counts = ();
  10. for (my $index = 0; $index < $NoSeq; $index++) {
  11.         my $line = <FILE>;
  12.         push @names,$line;
  13.         $line = <FILE>;
  14.         push @seq,$line; # Remove if the print of @seq below are not needed.
  15.         foreach my $base (@bases) {
  16.                 $counts[$index]->{$base} = ($line =~ s/$base//gi);
  17.         }
  18. }
  19. print "Names @names\n"; # Remove if not needed.
  20. print "Seq @seq\n"; # Remove if not needed.
  21. # If the above two print calls are not needed, then this loop could be made a
  22. # part of the first loop.
  23. for (my $index = 0; $index < $NoSeq; $index++) {
  24.         print "This yeast is @names[$index]";
  25.         foreach my $base (@bases) {
  26.                 print "There are $counts[$index]->{$base} ${base}'s.\n";
  27.         }
  28.         print "\n";
  29. }
  30. # End of potential looping point
  31.  
Feb 27 '08 #3
KevinADC
4,059 Expert 2GB
I would use tr/// to count the occurances. It should be considerably more effcient than s///ig. I hope the OP comes back and reads this thread because some good effort went into posting replies to it.
Feb 28 '08 #4
I would use tr/// to count the occurances. It should be considerably more effcient than s///ig. I hope the OP comes back and reads this thread because some good effort went into posting replies to it.
tr/// works well with single letter search counting like this, but does not interpolate $base. So again, you'd have to split back out each count. It would however, be more efficient in the long run.

Expand|Select|Wrap|Line Numbers
  1. foreach my $base (@bases) {
  2.      $counts[$index]->{$base} = ($line =~ s/$base//gi);
  3. }
  4.  
becomes

Expand|Select|Wrap|Line Numbers
  1. $counts[$index]->{G} = ($line =~ tr/G//gi);
  2. $counts[$index]->{A} = ($line =~ tr/A//gi);
  3. $counts[$index]->{T} = ($line =~ tr/T//gi);
  4. $counts[$index]->{C} = ($line =~ tr/C//gi);
  5.  
For multi-character matching, you'd need to use s/// though (if you wanted to count all the occurrences of 'CA' in the sequence for instance). Though I suppose that regex becomes more complicated anyhow since you'd probably be processing the string as two-character base pairs rather than simply a stream of characters.
Mar 3 '08 #5
KevinADC
4,059 Expert 2GB
just for clarification sake, tr/// does not recognize the g and i options. And yes, it would not work for counting pairs, just the single characters.
Mar 3 '08 #6
just for clarification sake, tr/// does not recognize the g and i options. And yes, it would not work for counting pairs, just the single characters.
Correct... I was overzealous with cut and past and too sparing with editing thereafter.
Mar 4 '08 #7
KevinADC
4,059 Expert 2GB
Correct... I was overzealous with cut and past and too sparing with editing thereafter.

I figured as much, we all do that. I know I appreciate if another member takes the time to correct any little errors I make so I assumed you would too. But if not just let me know and I won't do that in the future if that is what you prefer.
Mar 5 '08 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Bob Helber | last post by:
I've got a perl script with a system command, within a for loop, that calls a FORTRAN program. For one of the iterations of the for loop the FORTRAN program quits with an error due to bad input. ...
13
by: Tosch | last post by:
I have a Win2003 server where I installed my VB.NET application, built with VS.NET 2002. My main application crashes before the first line of my code is executed with this error: "Memory at...
5
by: VinnieT | last post by:
I have a load balanced system that consists of 3 production servers. There are about 20 different applications that are used on these boxes. One of my applications in particular is used more than...
8
by: mike2036 | last post by:
I have an application (that has unmanaged code) and when I launch it without 'FullTrust' permissions (LocalIntranet_Zone), it crashes. When I set 'FullTrust' permissions, it launches fine. Is...
0
by: Tom | last post by:
When I try to connect to a MS SQL Server 2000 my application always crashes (without message). In the event log I find the following error: Faulting application testapp.exe, version...
29
by: garyusenet | last post by:
I'm trying to investigate the maximum size of different variable types. I'm using INT as my starting variable for exploration. I know that the maximum number that the int variable can take is:...
4
by: yinglcs | last post by:
Hi, I have a c++ application which crashes in this line (from the debugger, I have a segmentation fault here): void *object = dynamic_cast<void>(aObject); I have stepped thru the code in...
2
by: rlntemp-gng | last post by:
re:Access 2003 TransferSpreadsheet has worked in my app for weeks now. Now, the day I was to put in production (today) it crashes the app, and users are livid...but not more than me...
1
by: teju | last post by:
I am using the OTL library for database interaction(otl.sourceforge.net) This is the sample code from my application: strm_ << parm.corp_id(); strm_ << parm.prod_id(); write_to_log(("Before...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.