By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,280 Members | 1,693 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,280 IT Pros & Developers. It's quick & easy.

Reformating a csv table

P: 1
Dear PERL users,
I am a biologist with a limited programming experience in the statistical programming language R . Recently I have been facing the task to extract relevant information from csv file with a size of 500MB and rewrite it as another csv with data organized in a different way. Although it is very easy to write a R program doing the desired manipulations, it is desperately slow. Because some 5 years ago I used PERL for very simple manipulation of DNA and protein sequences (one-shot job with a book dealing with PERL and sequences in hand) I got a feeling that PERL could be the right choice. I would like to kindly ask for help with my problem. I am not asking for a complete code, rather for an advice how to design the most efficient program (searching through the internet shows that in PERL task can be accomplished in a various manner but some of them are faster than others).

The problem is as follows:
For each level of Key1 and (its sub)Key2 get an array of MesuredValues across CellLinesA - CellLineZ. Key1 is (should be) unique, (sub)Key2 two can have various levels for Key1. Not all Key1(s) have the all levels of Key2. The file should be rather regular - there should always be all the cell lines A-Z (with MeasuredValue NA if the mesurement is missing) oredered in the same way A-Z, but this must be checked.

ORIGINAL csv
1 ,A , CellLineA, 0.1
1 ,A , CellLineB, 0.2
1 ,A , CellLineC, 0.3
1, B , CellLineA, 0.4
1 ,B , CellLineB, 0.5
1 ,B , CellLineC, NA
2 ,A, CellLineA, NA
2 ,A , CellLineB, 0.6
2 ,A , CellLineC, 0.7
3,....................


The Output Should Look Like
Key1,Key2,CellLineA,CellLineB,CellLineC
1,A,0.1,0.2,0.3
1,B,0.4,0.5,NA.
2,A,NA,0.6,0.7
3...............


ORIGINAL csv -POSSIBLE IRREGULARITIES:
Key1,Key2,CellLine, MeasuredValue
1 ,A , CellLineA, 0.1
1 ,A , CellLineB, 0.2
1 ,A , CellLineC, 0.3
1 ,B , CellLineB, 0.5
1, B , CellLineA, 0.4
2 ,A , CellLineC, 0.7
2 ,A , CellLineB, 0.6
3,...................


Thanks for help
Karpatov
Jan 5 '08 #1
Share this Question
Share on Google+
1 Reply


KevinADC
Expert 2.5K+
P: 4,059
You will need to build a data set to parse and organize the data, then print it to file. A 500 MB file is probably going to consume a lot of system memory to build the data set, possibly giga-bytes of memory. Will the computer you run the program on have enough memory? And it will still take a while for perl to process a 500 MB file and do what you want. How long depends on too many things to attempt an estimate. Have you tried to write any code yet?
Jan 5 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.