In the inputfile each entry is represented by 4 lines: line 2 is read and line 1 is its id.
I would like the outputfile to be as follows:
1. remove lines 3 and 4.
2. To remove redundancy, the reads (line 2) with identical sequence are represented with a single entry.
3. Each read id will be a: name_n_xp" where 'p' is an integer indicating the number of times the exact read was detected in the inputfile and 'n' is a running number in the id to ensure that all of the ids are unique.
'name' should be specified in the command.
4. Add '>' to the id, so it looks like this >name_n_xp
I included below an example of inputfile and its outputfile.
A friend wrote a script for me; it works on small files but when I tried to process big files I got an error:
perl(87251) malloc: *** mmap(size=89133056) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!
Can you modify the script to make it memory efficient?
script:
Expand|Select|Wrap|Line Numbers
- #!/usr/bin/env perl
- use POSIX qw(ceil floor);
- if($#ARGV>0){
- $name=pop;
- }else{
- $name="read";
- }
- $file=pop;
- open(fr, "<$file");
- print "Reading in file...\n";
- $j=0;
- $n=1;
- @reads=();
- foreach $line (<fr>){
- next if($line eq "\n");
- if($n%4==2){
- chomp($line);
- $reads[$j][0]=$line;
- }elsif($n%4==0){
- chomp($line);
- $reads[$j][1]=$line;
- $j++;
- }
- $n++;
- }
- close(fr);
- $length=length($reads[0][0]);
- print "Identifying unique reads...\n";
- @reads=sort {$a->[0] cmp $b->[0]} @reads;
- print "Creating new file ${file}_unique...\n";
- $k=1;
- $rep=0;
- open(fw, ">${file}_unique");
- for($i=0; $i<$j; $i++){
- $rep++;
- $seq=$reads[$i][0];
- @temp=split(//, $reads[$i][1]);
- for($z=0; $z<$length; $z++){
- $qual[$z]+=ord($temp[$z]);
- }
- if($seq ne $reads[$i+1][0]){
- $quality='';
- for($z=0; $z<$length; $z++){
- $quality.=chr(round($qual[$z]/$rep));
- $qual[$z]=0;
- }
- print fw ">", $name, $k, "x", "$rep\n";
- print fw "$seq\n";
- # print fw "+\n$quality\n";
- $k++;
- $rep=0;
- }
- }
- close(fw);
- sub round{
- $c=shift;
- if($c-floor($c)>=0.5){
- return floor($c)+1;
- }else{
- return floor($c);
- }
- }
Expand|Select|Wrap|Line Numbers
- @7:100:62:869:Y
- CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCTGCTGATGGTGGTACANNNNNNNNNNN
- +
- BB?B>3B>>AABBAA@5=7?@?5<78955/989;23<<:?<<8<-42.2/4;;4468=776+2+0%%%%%%%%%%%
- @7:100:62:593:N
- GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTGAGAACTCCCAGGATGNNNNNNNNNNN
- +
- BB?C?BCBA@94>??(>@B:<??@=?<@?@;;;>@?@?;A<B<@A<<;;?<@1<==:48;A=>:3%%%%%%%%%%%
- @7:100:62:823:Y
- TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
- +
- ?CCA@2A@BBBBBABABBA<?=/;9/:1??31::.**:;)3/:154.//4'33093/:);/20-.%%%%%%%%%%%
- @7:100:62:716:N
- TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
- +
- BBABAB>B>98/<>:9=;=?(<@8=848684866861667=181847(48.81841,++5866(*%%%%%%%%%%%
- @7:100:62:488:Y
- AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
- +
- =8;=A:>=6,<=?AA99>179=;,11;4'441,4490,7334&.1'911/1,*,4320,3'0'/,%%%%%%%%%%%
- @7:100:62:162:Y
- AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
- +
- BABC>BB=A@A@>B@B===A68:456=57477878886;37==77=46<(662'27/**3=477.%%%%%%%%%%%
- @7:100:62:1550:N
- AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
- +
- (;*(/(=2@7)3(?58@C=:(.5)('=,%33@.'6).::%8(.+*3;>?@6,(&?'7A?:':.&6%%%%%%%%%%%
- @7:100:62:439:N
- AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
- +
- 1@?B?@@;25@@>;6.>=93;>>4;.9378/>0,7119=56/93431/122822:=;>3>,333(%%%%%%%%%%%
- @7:100:62:154:Y
- ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
- +
- B?AB<BB@A@<?6A>A@?5@<>8=55;84?8/=846@===6?5:=8?6856:884:88:688461%%%%%%%%%%%
- @7:100:62:1819:Y
- ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
- +
- B=A3>A96<?7=>=>6=?;1==444144,441.11/41==,391&%24432351,'8'345451,%%%%%%%%%%%
output file:
[/code]
>name_1_x1
CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCT GCTGATGGTGGTACANNNNNNNNNNN
>name_2_x1
GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTG AGAACTCCCAGGATGNNNNNNNNNNN
>name_3_x2
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGG TCACTGCCAGATACCNNNNNNNNNNN
>name_4_x4
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGAC GGCGCCCAGGCCGCTNNNNNNNNNNN
>name_5_x2
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGA CACGGTTCAAAGACANNNNNNNNNNN
[/code]