473,326 Members | 2,108 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

memory problem?

Hello


In the inputfile each entry is represented by 4 lines: line 2 is read and line 1 is its id.
I would like the outputfile to be as follows:
1. remove lines 3 and 4.
2. To remove redundancy, the reads (line 2) with identical sequence are represented with a single entry.
3. Each read id will be a: name_n_xp" where 'p' is an integer indicating the number of times the exact read was detected in the inputfile and 'n' is a running number in the id to ensure that all of the ids are unique.
'name' should be specified in the command.
4. Add '>' to the id, so it looks like this >name_n_xp
I included below an example of inputfile and its outputfile.
A friend wrote a script for me; it works on small files but when I tried to process big files I got an error:
perl(87251) malloc: *** mmap(size=89133056) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!



Can you modify the script to make it memory efficient?




script:
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env perl
  2.  
  3. use POSIX qw(ceil floor);
  4.  
  5.  
  6.  
  7.  
  8.  
  9. if($#ARGV>0){
  10.  
  11.     $name=pop;
  12.  
  13. }else{
  14.  
  15.     $name="read";
  16.  
  17. }
  18.  
  19.  
  20.  
  21. $file=pop;
  22.  
  23.  
  24.  
  25. open(fr, "<$file");
  26.  
  27.  
  28.  
  29. print "Reading in file...\n";
  30.  
  31. $j=0;
  32.  
  33. $n=1;
  34.  
  35. @reads=();
  36.  
  37. foreach $line (<fr>){
  38.  
  39.     next if($line eq "\n");
  40.  
  41.     if($n%4==2){
  42.  
  43.         chomp($line);
  44.  
  45.         $reads[$j][0]=$line;
  46.  
  47.     }elsif($n%4==0){
  48.  
  49.         chomp($line);
  50.  
  51.         $reads[$j][1]=$line;
  52.  
  53.         $j++;
  54.  
  55.     }
  56.  
  57.     $n++;
  58.  
  59. }
  60.  
  61. close(fr);
  62.  
  63.  
  64.  
  65. $length=length($reads[0][0]);
  66.  
  67.  
  68.  
  69. print "Identifying unique reads...\n";
  70.  
  71. @reads=sort {$a->[0] cmp $b->[0]} @reads;
  72.  
  73.  
  74.  
  75. print "Creating new file ${file}_unique...\n";
  76.  
  77.  
  78.  
  79. $k=1;
  80.  
  81. $rep=0;
  82.  
  83. open(fw, ">${file}_unique");
  84.  
  85. for($i=0; $i<$j; $i++){
  86.  
  87.     $rep++;
  88.  
  89.     $seq=$reads[$i][0];
  90.  
  91.     @temp=split(//, $reads[$i][1]);
  92.  
  93.     for($z=0; $z<$length; $z++){
  94.  
  95.         $qual[$z]+=ord($temp[$z]);
  96.  
  97.     }
  98.  
  99.  
  100.  
  101.     if($seq ne $reads[$i+1][0]){
  102.  
  103.         $quality='';
  104.  
  105.         for($z=0; $z<$length; $z++){
  106.  
  107.             $quality.=chr(round($qual[$z]/$rep));
  108.  
  109.             $qual[$z]=0;
  110.  
  111.         }
  112.  
  113.  
  114.  
  115.         print fw ">", $name, $k, "x", "$rep\n";
  116.  
  117.         print fw "$seq\n";
  118.  
  119. #        print fw "+\n$quality\n";
  120.  
  121.         $k++;
  122.  
  123.         $rep=0;        
  124.  
  125.     }
  126.  
  127. }
  128.  
  129. close(fw);
  130.  
  131.  
  132.  
  133. sub round{
  134.  
  135.     $c=shift;
  136.  
  137.     if($c-floor($c)>=0.5){
  138.  
  139.         return floor($c)+1;
  140.  
  141.     }else{
  142.  
  143.         return floor($c);
  144.  
  145.     }
  146.  
  147. }
  148.  
inputfile:
Expand|Select|Wrap|Line Numbers
  1. @7:100:62:869:Y
  2. CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCTGCTGATGGTGGTACANNNNNNNNNNN
  3. +
  4. BB?B>3B>>AABBAA@5=7?@?5<78955/989;23<<:?<<8<-42.2/4;;4468=776+2+0%%%%%%%%%%%
  5. @7:100:62:593:N
  6. GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTGAGAACTCCCAGGATGNNNNNNNNNNN
  7. +
  8. BB?C?BCBA@94>??(>@B:<??@=?<@?@;;;>@?@?;A<B<@A<<;;?<@1<==:48;A=>:3%%%%%%%%%%%
  9. @7:100:62:823:Y
  10. TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
  11. +
  12. ?CCA@2A@BBBBBABABBA<?=/;9/:1??31::.**:;)3/:154.//4'33093/:);/20-.%%%%%%%%%%%
  13. @7:100:62:716:N
  14. TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
  15. +
  16. BBABAB>B>98/<>:9=;=?(<@8=848684866861667=181847(48.81841,++5866(*%%%%%%%%%%%
  17. @7:100:62:488:Y
  18. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  19. +
  20. =8;=A:>=6,<=?AA99>179=;,11;4'441,4490,7334&.1'911/1,*,4320,3'0'/,%%%%%%%%%%%
  21. @7:100:62:162:Y
  22. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  23. +
  24. BABC>BB=A@A@>B@B===A68:456=57477878886;37==77=46<(662'27/**3=477.%%%%%%%%%%%
  25. @7:100:62:1550:N
  26. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  27. +
  28. (;*(/(=2@7)3(?58@C=:(.5)('=,%33@.'6).::%8(.+*3;>?@6,(&?'7A?:':.&6%%%%%%%%%%%
  29. @7:100:62:439:N
  30. AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
  31. +
  32. 1@?B?@@;25@@>;6.>=93;>>4;.9378/>0,7119=56/93431/122822:=;>3>,333(%%%%%%%%%%%
  33. @7:100:62:154:Y
  34. ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
  35. +
  36. B?AB<BB@A@<?6A>A@?5@<>8=55;84?8/=846@===6?5:=8?6856:884:88:688461%%%%%%%%%%%
  37. @7:100:62:1819:Y
  38. ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
  39. +
  40. B=A3>A96<?7=>=>6=?;1==444144,441.11/41==,391&%24432351,'8'345451,%%%%%%%%%%%
  41.  
  42.  

output file:
[/code]
>name_1_x1
CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCT GCTGATGGTGGTACANNNNNNNNNNN
>name_2_x1
GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTG AGAACTCCCAGGATGNNNNNNNNNNN
>name_3_x2
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGG TCACTGCCAGATACCNNNNNNNNNNN
>name_4_x4
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGAC GGCGCCCAGGCCGCTNNNNNNNNNNN
>name_5_x2
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGA CACGGTTCAAAGACANNNNNNNNNNN
[/code]
Oct 7 '09 #1
2 2182
toolic
70 Expert
If each set of 4 lines is unrelated to every other set of 4 lines, then
there is no need to read the whole large input file into memory at once.
This should help to avoid memory issues. You could read four lines in,
extract and re-format as needed, then output immediately, all within
the same while loop.

I find it difficult to understand your code and sample data because
you did not include them inside code tags.
Oct 8 '09 #2
numberwhun
3,509 Expert Mod 2GB
toolic is right, you need to please use the required code tags when posting code in the forums. I have fixed this for you, but you need to remember to do so next time please. The button with the hash symbol in the editor when posting provides the code tags for you, just click it and put your code between it.

Its also a good idea to surround other text, like your input and output file text in code tags as well, to make them stand out.

Regards,

Jeff
Oct 8 '09 #3

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Andreas Suurkuusk | last post by:
Hi, I just noticed your post in the "C# memory problem: no end for our problem?" thread. In the post you implied that I do not how the garbage collector works and that I mislead people. Since...
4
by: Amadeus | last post by:
Hello Everybody! I have a problem with MySQL servers running RedHat 9 (smp kernel 2.4.20) on Intel and MySQL server 4.0.14 (problem also appears on binary distr 4.0.15 and on 4.0.15 I bilt myself...
32
by: John | last post by:
Hi all: When I run my code, I find that the memory that the code uses keeps increasing. I have a PC with 2G RAM running Debian linux. The code consumes 1.5G memory by the time it finishes...
17
by: José Joye | last post by:
Hi, I have implemented a Service that is responsible for getting messages from a MS MQ located on a remote machine. I'm getting memory leak from time to time (???). In some situation, it is...
16
by: JCauble | last post by:
We have a large Asp.net application that is currently crashing our production servers. What we are seeing is the aspnet_wp eat up a bunch of memory and then stop unexpectedly. Does not recycle. ...
7
by: Salvador | last post by:
Hi, I am using WMI to gather information about different computers (using win2K and win 2K3), checking common classes and also WMI load balance. My application runs every 1 minute and reports...
9
by: Bruno Barberi Gnecco | last post by:
I'm using PHP to run a CLI application. It's a script run by cron that parses some HTML files (with DOM XML), and I ended up using PHP to integrate with the rest of the code that already runs the...
9
by: jeungster | last post by:
Hello, I'm trying to track down a memory issue with a C++ application that I'm working on: In a nutshell, the resident memory usage of my program continues to grow as the program runs. It...
17
by: frederic.pica | last post by:
Greets, I've some troubles getting my memory freed by python, how can I force it to release the memory ? I've tried del and gc.collect() with no success. Here is a code sample, parsing an XML...
27
by: George2 | last post by:
Hello everyone, Should I delete memory pointed by pointer a if there is bad_alloc when allocating memory in memory pointed by pointer b? I am not sure whether there will be memory leak if I do...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.