Hello
In the inputfile each entry is represented by 4 lines: line 2 is read and line 1 is its id.
I would like the outputfile to be as follows:
1. remove lines 3 and 4.
2. To remove redundancy, the reads (line 2) with identical sequence are represented with a single entry.
3. Each read id will be a: name_n_xp" where 'p' is an integer indicating the number of times the exact read was detected in the inputfile and 'n' is a running number in the id to ensure that all of the ids are unique.
'name' should be specified in the command.
4. Add '>' to the id, so it looks like this >name_n_xp
I included below an example of inputfile and its outputfile.
A friend wrote a script for me; it works on small files but when I tried to process big files I got an error:
perl(87251) malloc: *** mmap(size=89133056) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!
Can you modify the script to make it memory efficient?
script:
inputfile:
-
@7:100:62:869:Y
-
CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCTGCTGATGGTGGTACANNNNNNNNNNN
-
+
-
BB?B>3B>>AABBAA@5=7?@?5<78955/989;23<<:?<<8<-42.2/4;;4468=776+2+0%%%%%%%%%%%
-
@7:100:62:593:N
-
GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTGAGAACTCCCAGGATGNNNNNNNNNNN
-
+
-
BB?C?BCBA@94>??(>@B:<??@=?<@?@;;;>@?@?;A<B<@A<<;;?<@1<==:48;A=>:3%%%%%%%%%%%
-
@7:100:62:823:Y
-
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
-
+
-
?CCA@2A@BBBBBABABBA<?=/;9/:1??31::.**:;)3/:154.//4'33093/:);/20-.%%%%%%%%%%%
-
@7:100:62:716:N
-
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGGTCACTGCCAGATACCNNNNNNNNNNN
-
+
-
BBABAB>B>98/<>:9=;=?(<@8=848684866861667=181847(48.81841,++5866(*%%%%%%%%%%%
-
@7:100:62:488:Y
-
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
-
+
-
=8;=A:>=6,<=?AA99>179=;,11;4'441,4490,7334&.1'911/1,*,4320,3'0'/,%%%%%%%%%%%
-
@7:100:62:162:Y
-
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
-
+
-
BABC>BB=A@A@>B@B===A68:456=57477878886;37==77=46<(662'27/**3=477.%%%%%%%%%%%
-
@7:100:62:1550:N
-
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
-
+
-
(;*(/(=2@7)3(?58@C=:(.5)('=,%33@.'6).::%8(.+*3;>?@6,(&?'7A?:':.&6%%%%%%%%%%%
-
@7:100:62:439:N
-
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGACGGCGCCCAGGCCGCTNNNNNNNNNNN
-
+
-
1@?B?@@;25@@>;6.>=93;>>4;.9378/>0,7119=56/93431/122822:=;>3>,333(%%%%%%%%%%%
-
@7:100:62:154:Y
-
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
-
+
-
B?AB<BB@A@<?6A>A@?5@<>8=55;84?8/=846@===6?5:=8?6856:884:88:688461%%%%%%%%%%%
-
@7:100:62:1819:Y
-
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGACACGGTTCAAAGACANNNNNNNNNNN
-
+
-
B=A3>A96<?7=>=>6=?;1==444144,441.11/41==,391&%24432351,'8'345451,%%%%%%%%%%%
-
-
output file:
[/code]
>name_1_x1
CAGCCATACCACCAGGATATTGGCTTGACATTTTGTCTGCTCTTGGGGCT GCTGATGGTGGTACANNNNNNNNNNN
>name_2_x1
GCCCATTTCAAGGTCAGTGTAAGATGCCTGTAGCTTTCAGAGTACTCCTG AGAACTCCCAGGATGNNNNNNNNNNN
>name_3_x2
TGAGGATCCCTTCCTCTCTGTGACTGGCTTGTTTGATGGGGAGAGTTTGG TCACTGCCAGATACCNNNNNNNNNNN
>name_4_x4
AAGACAAGCAGTCCCGGCTACGCTACCAGAACCTGGAAAATGTTGAGGAC GGCGCCCAGGCCGCTNNNNNNNNNNN
>name_5_x2
ACAGGGCACAAGGGCTGGTTACTTCCTTCTTTCGTCTTCTGGATCTGTGA CACGGTTCAAAGACANNNNNNNNNNN
[/code]