Dear PERL users,
I am a biologist with a limited programming experience in the statistical programming language R . Recently I have been facing the task to extract relevant information from csv file with a size of 500MB and rewrite it as another csv with data organized in a different way. Although it is very easy to write a R program doing the desired manipulations, it is desperately slow. Because some 5 years ago I used PERL for very simple manipulation of DNA and protein sequences (one-shot job with a book dealing with PERL and sequences in hand) I got a feeling that PERL could be the right choice. I would like to kindly ask for help with my problem. I am not asking for a complete code, rather for an advice how to design the most efficient program (searching through the internet shows that in PERL task can be accomplished in a various manner but some of them are faster than others).
The problem is as follows:
For each level of Key1 and (its sub)Key2 get an array of MesuredValues across CellLinesA - CellLineZ. Key1 is (should be) unique, (sub)Key2 two can have various levels for Key1. Not all Key1(s) have the all levels of Key2. The file should be rather regular - there should always be all the cell lines A-Z (with MeasuredValue NA if the mesurement is missing) oredered in the same way A-Z, but this must be checked.
ORIGINAL csv
1 ,A , CellLineA, 0.1
1 ,A , CellLineB, 0.2
1 ,A , CellLineC, 0.3
1, B , CellLineA, 0.4
1 ,B , CellLineB, 0.5
1 ,B , CellLineC, NA
2 ,A, CellLineA, NA
2 ,A , CellLineB, 0.6
2 ,A , CellLineC, 0.7
3,....................
The Output Should Look Like
Key1,Key2,CellLineA,CellLineB,CellLineC
1,A,0.1,0.2,0.3
1,B,0.4,0.5,NA.
2,A,NA,0.6,0.7
3...............
ORIGINAL csv -POSSIBLE IRREGULARITIES:
Key1,Key2,CellLine, MeasuredValue
1 ,A , CellLineA, 0.1
1 ,A , CellLineB, 0.2
1 ,A , CellLineC, 0.3
1 ,B , CellLineB, 0.5
1, B , CellLineA, 0.4
2 ,A , CellLineC, 0.7
2 ,A , CellLineB, 0.6
3,...................
Thanks for help
Karpatov