By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,834 Members | 2,265 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,834 IT Pros & Developers. It's quick & easy.

memory problem?

P: 1
Hello


In the inputfile each entry is represented by 4 lines: line 2 is read and line 1 is its id.
I would like the outputfile to be as follows:
1. remove lines 3 and 4.
2. To remove redundancy, the reads (line 2) with identical sequence are represented with a single entry.
3. Each read id will be a: name_n_xp" where 'p' is an integer indicating the number of times the exact read was detected in the inputfile and 'n' is a running number in the id to ensure that all of the ids are unique.
'name' should be specified in the command.
4. Add '>' to the id, so it looks like this >name_n_xp
I included below an example of inputfile and its outputfile.
A friend wrote a script for me; it works on small files but when I tried to process big files I got an error:
perl(87251) malloc: *** mmap(size=89133056) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!



Can you modify the script to make it memory efficient?




script:
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env perl
  2.  
  3. use POSIX qw(ceil floor);
  4.  
  5.  
  6.  
  7.  
  8.  
  9. if($#ARGV>0){
  10.  
  11.     $name=pop;
  12.  
  13. }else{
  14.  
  15.     $name="read";
  16.  
  17. }
  18.  
  19.  
  20.  
  21. $file=pop;
  22.  
  23.  
  24.  
  25. open(fr, "<$file");
  26.  
  27.  
  28.  
  29. print "Reading in file...\n";
  30.  
  31. $j=0;
  32.  
  33. $n=1;
  34.  
  35. @reads=();
  36.  
  37. foreach $line (<fr>){
  38.  
  39.     next if($line eq "\n");
  40.  
  41.     if($n%4==2){
  42.  
  43.         chomp($line);
  44.  
  45.         $reads[$j][0]=$line;
  46.  
  47.     }elsif($n%4==0){
  48.  
  49.         chomp($line);
  50.  
  51.         $reads[$j][1]=$line;
  52.  
  53.         $j++;
  54.  
  55.     }
  56.  
  57.     $n++;
  58.  
  59. }
  60.  
  61. close(fr);
  62.  
  63.  
  64.  
  65. $length=length($reads[0][0]);
  66.  
  67.  
  68.  
  69. print "Identifying unique reads...\n";
  70.  
  71. @reads=sort {$a->[0] cmp $b->[0]} @reads;
  72.  
  73.  
  74.  
  75. print "Creating new file ${file}_unique...\n";
  76.  
  77.  
  78.  
  79. $k=1;
  80.  
  81. $rep=0;
  82.  
  83. open(fw, ">${file}_unique");
  84.  
  85. for($i=0; $i<$j; $i++){
  86.  
  87.     $rep++;
  88.  
  89.     $seq=$reads[$i][0];
  90.  
  91.     @temp=split(//, $reads[$i][1]);
  92.  
  93.     for($z=0; $z<$length; $z++){
  94.  
  95.         $qual[$z]+=ord($temp[$z]);
  96.  
  97.     }
  98.  
  99.  
  100.  
  101.     if($seq ne $reads[$i+1][0]){
  102.  
  103.         $quality='';
  104.  
  105.         for($z=0; $z<$length; $z++){
  106.  
  107.             $quality.=chr(round($qual[$z]/$rep));
  108.  
  109.             $qual[$z]=0;
  110.  
  111.         }
  112.  
  113.  
  114.  
  115.         print fw ">", $name, $k, "x", "$rep\n";
  116.  
  117.         print fw "$seq\n";
  118.  
  119. #        print fw "+\n$quality\n";
  120.  
  121.         $k++;
  122.  
  123.         $rep=0;        
  124.  
  125.     }
  126.  
  127. }
  128.  
  129. close(fw);
  130.  
  131.  
  132.  
  133. sub round{
  134.  
  135.     $c=shift;
  136.  
  137.     if($c-floor($c)>=0.5){
  138.  
  139.         return floor($c)+1;
  140.  
  141.     }else{
  142.  
  143.         return floor($c);
  144.  
  145.     }
  146.  
  147. }
  148.  
inputfile:
Expand|Select|Wrap|Line Numbers
  1. @7:100:62:869:Y
  2. CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCTGCTGATGGTGGTACANNNNNNNNNNN
  3. +
  4. BB?B>3B>>AABBAA@5=7?@?5<78955/989;23<<:?<<8<-42.2/4;;4468=776+2+0%%%%%%%%%%%
  5. @7:100:62:593:N
  6. GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTGAGAACTCCCAGGATGNNNNNNNNNNN
  7. +
  8. BB?C?BCBA@94>??(>@B:<??@=?<@?@;;;>@?@?;A<B<@A<<;;?<@1<==:48;A=>:3%%%%%%%%%%%
  9. @7:100:62:823:Y
  10. TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
  11. +
  12. ?CCA@2A@BBBBBABABBA<?=/;9/:1??31::.**:;)3/:154.//4'33093/:);/20-.%%%%%%%%%%%
  13. @7:100:62:716:N
  14. TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
  15. +
  16. BBABAB>B>98/<>:9=;=?(<@8=848684866861667=181847(48.81841,++5866(*%%%%%%%%%%%
  17. @7:100:62:488:Y
  18. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  19. +
  20. =8;=A:>=6,<=?AA99>179=;,11;4'441,4490,7334&.1'911/1,*,4320,3'0'/,%%%%%%%%%%%
  21. @7:100:62:162:Y
  22. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  23. +
  24. BABC>BB=A@A@>B@B===A68:456=57477878886;37==77=46<(662'27/**3=477.%%%%%%%%%%%
  25. @7:100:62:1550:N
  26. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  27. +
  28. (;*(/(=2@7)3(?58@C=:(.5)('=,%33@.'6).::%8(.+*3;>?@6,(&?'7A?:':.&6%%%%%%%%%%%
  29. @7:100:62:439:N
  30. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  31. +
  32. 1@?B?@@;25@@>;6.>=93;>>4;.9378/>0,7119=56/93431/122822:=;>3>,333(%%%%%%%%%%%
  33. @7:100:62:154:Y
  34. ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
  35. +
  36. B?AB<BB@A@<?6A>A@?5@<>8=55;84?8/=846@===6?5:=8?6856:884:88:688461%%%%%%%%%%%
  37. @7:100:62:1819:Y
  38. ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
  39. +
  40. B=A3>A96<?7=>=>6=?;1==444144,441.11/41==,391&%24432351,'8'345451,%%%%%%%%%%%
  41.  
  42.  

output file:
[/code]
>name_1_x1
CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCT GCTGATGGTGGTACANNNNNNNNNNN
>name_2_x1
GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTG AGAACTCCCAGGATGNNNNNNNNNNN
>name_3_x2
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGG TCACTGCCAGATACCNNNNNNNNNNN
>name_4_x4
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGAC GGCGCCCAGGCCGCTNNNNNNNNNNN
>name_5_x2
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGA CACGGTTCAAAGACANNNNNNNNNNN
[/code]
Oct 7 '09 #1
Share this Question
Share on Google+
2 Replies


Expert
P: 70
If each set of 4 lines is unrelated to every other set of 4 lines, then
there is no need to read the whole large input file into memory at once.
This should help to avoid memory issues. You could read four lines in,
extract and re-format as needed, then output immediately, all within
the same while loop.

I find it difficult to understand your code and sample data because
you did not include them inside code tags.
Oct 8 '09 #2

numberwhun
Expert Mod 2.5K+
P: 3,503
toolic is right, you need to please use the required code tags when posting code in the forums. I have fixed this for you, but you need to remember to do so next time please. The button with the hash symbol in the editor when posting provides the code tags for you, just click it and put your code between it.

Its also a good idea to surround other text, like your input and output file text in code tags as well, to make them stand out.

Regards,

Jeff
Oct 8 '09 #3

Post your reply

Sign in to post your reply or Sign up for a free account.