By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,852 Members | 1,426 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,852 IT Pros & Developers. It's quick & easy.

Speed problem

P: 9
Hi, I am newbie to perl, below is the script i have worked on which gives me exactly what i want. The only problem i am facing is speed. This script is mapping the AB,BA,IS,RT,ET from file1.csv and putting them into columns. file2.txt contains first column of file1.csv which i have cut using unix command (this i would like to avoid and map it from the script itself). my question- Is there any way I can make this script faster as file1.csv has 1.6 millions of lines and grep command goes through them 6 times which is making the script real slow (takes 6 hours to map). but does gives the output what i want.
Please help me

Thanks in advance.
John

Expand|Select|Wrap|Line Numbers
  1. file1.csv 
  2. 112311,AB,312342
  3. 112311,BA,LAN321
  4. 112311,IS,LA3423
  5. 321211,AB,342324
  6. 432342,BA,LAN322
  7. 432342,IS,LA3453
  8. 432342,RT,324343
  9. 432342,ET,ERTF
file2.txt #has unique file1.csv column 1 data
Expand|Select|Wrap|Line Numbers
  1. 112311
  2. 321211
  3. 432342
output.txt
Expand|Select|Wrap|Line Numbers
  1. sc,ab_no bar_a,is_in,erto,ex_tk
  2. 112311,312342,LAN321,LA3423,    ,
  3. 321211,342324,    ,   ,   ,   ,
  4. 432342,   ,LAN322,LA3453,324343,ERTF
This is what i have,

Expand|Select|Wrap|Line Numbers
  1. $file1 = "file1.csv";
  2. $file2 = "file2.txt";
  3.  
  4. open(FILE2,"<$file2");
  5.  
  6. while(<FILE2>) {
  7.   chomp $_;
  8.   $sc = $_;
  9.   $ab = `grep "$sc,AB," $file1`;
  10.     if($ab ne ""){
  11.     @array = split(/,/, $ab);
  12.     $ab_no = $array[2];
  13.    } else { 
  14.       $ab_no = "\t";
  15.     }
  16.   $ba = `grep "$sc,BA,LAN" $file1`;
  17.     if($ba ne ""){
  18.     @array1 = split(/,/, $ba);
  19.     $barra = $array1[2];
  20.     } else {
  21.       $bar_a = "\t";
  22.     }
  23.   $is = `grep "$sc,IS,LA" $file1`;
  24.      if($is ne ""){
  25.      @array2 = split(/,/, $is);
  26.      $is_in = $array2[2];
  27.      } else {
  28.        $is_in = "\t";
  29.      }
  30.   $rt = `grep "$sc,RT,.*TO," $file1`;
  31.      if($rt ne ""){
  32.      @array3 = split(/,/, $rt);
  33.      $erto = $array3[2];
  34.      } else {
  35.        $erto = "\t";
  36.      } 
  37.   $et = `grep "$sc,ET," $file1`;
  38.      if($et ne ""){
  39.      @array4 = split(/,/, $et);
  40.      $ex_tk = $array4[2];
  41.      } else {
  42.        $ex_tk = "\t";
  43.      }
  44. print "$sc,$ab_no,$bar_a,$is_in,$erto,$ex_tk\n";
  45. }
  46.  
  47. close(FILE1)
Feb 15 '08 #1
Share this Question
Share on Google+
15 Replies


KevinADC
Expert 2.5K+
P: 4,059
Hi, I am newbie to perl, below is the script i have worked on which gives me exactly what i want. The only problem i am facing is speed. This script is mapping the AB,BA,IS,RT,ET from file1.csv and putting them into columns. file2.txt contains first column of file1.csv which i have cut using unix command (this i would like to avoid and map it from the script itself). my question- Is there any way I can make this script faster as file1.csv has 1.6 millions of lines and grep command goes through them 6 times which is making the script real slow (takes 6 hours to map). but does gives the output what i want.
Please help me

Thanks in advance.
John
It looks like file1 and file2 have the same order:

112311
321211
432342

Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.
Feb 15 '08 #2

P: 9
It looks like file1 and file2 have the same order:

112311
321211
432342

Is file1 already in the order you want the output to be in or does the first column need to be sorted somehow? I assume file2 is not important, you just use it temporarily to get the first column value unless file1 is not in the correct order to begin with.

Do you have this question posted on other forums and do you already have a solution? Because if you do I don't want to waste my time figuring it out.
Kevin Thanks for fast response, No i have not posted this question anywhere else, coz I saw in my search that this the only forum which is active and fast.
About file2.txt, it contains unique columns from file1.csv and I am getting file2 using unix command manually ( I want avoid manual step and do it in perl). The output file as above has new columns which comes out of file1 with column as match. so in the output i dont want any duplicates plus I want to create new columns from rows in column 2.

Yes this script does gives solution, but as i mention it takes 6 to 7 hours to generate results, as it goes through 1.6 millions lines. thats why i need help, coz i am stuck. Any help will be appreciated.

Thank you in advancs.
Feb 16 '08 #3

eWish
Expert 100+
P: 971
What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

--Kevin
Feb 16 '08 #4

P: 9
What are the relationships between the three columns? If you are only storing the unique data from column1, what are you doing with the data from columns 2 and 3?

--Kevin
Hi, thanx for response.
In file1
col1 is "ID" (which could be duplicate), col2 has rows which includes "AB,BA,IS,RT,ET" for these rows i want to create new columns in output as shown in output.txt, and col3 has values of col2. So basically i want output with unique values for all columns and place tab for the null as shown in output.txt,
Please see the output file "output.txt",
output.txt,
sc, ab_no, bar_a, is_in, erto,ex_tk
112311,312342,LAN321,LA3423, ,
321211,342324, , , , ,
432342, ,LAN322,LA3453,324343,ERTF

The output file has output with column 1 is "sc" (unique), column 2(which is one row 'AB' from file1) is "ab_no", column 3 (which is another row 'BA' in file1) is "bar_a", column 4 (which is 'IS' in file1 ) is "is_in" and so on..., So new columns are created from rows in file1 based on first column.

Thank you in advance.
Feb 16 '08 #5

KevinADC
Expert 2.5K+
P: 4,059
You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. my %fields = (
  4.    AB => "\t",
  5.    BA => "\t",
  6.    IS => "\t",
  7.    RT => "\t",
  8.    ET => "\t"
  9. );
  10.  
  11. my $file1 = "c:/perl_test/file1.csv";
  12. open(FILE1,"<$file1") or die "$!";
  13. open(OUT, '>>', 'path/to/output.csv') or die "$!";
  14. #get the first line of the file so we can start
  15. my ($sc_old,$v1,$v2) = split(/,/, <FILE1>);
  16. chomp $v2;
  17. $fields{$v1}=$v2;
  18.  
  19. while (<FILE1>) {
  20.    chomp;
  21.    my ($sc_new,$v1,$v2) = split(/,/);
  22.    if ($sc_new eq $sc_old) {
  23.       $fields{$v1}=$v2;
  24.    }
  25.    elsif (($sc_new ne $sc_old)) {
  26.       print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
  27.       %fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");     
  28.       $sc_old = $sc_new;
  29.       $fields{$v1}=$v2;
  30.    }
  31. }
  32. close FILE1;
  33. # prints last set of records
  34. print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
It maybe could be done even faster by someone that is more experienced with perl.
Feb 16 '08 #6

P: 9
You didn't answer my question concerning the order of column1 in file1. Assuming it is in order, this should work and should be faster than 6 hours. Change the path to the files to match your system.

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. my %fields = (
  4.    AB => "\t",
  5.    BA => "\t",
  6.    IS => "\t",
  7.    RT => "\t",
  8.    ET => "\t"
  9. );
  10.  
  11. my $file1 = "c:/perl_test/file1.csv";
  12. open(FILE1,"<$file1") or die "$!";
  13. open(OUT, '>>', 'path/to/output.csv') or die "$!";
  14. #get the first line of the file so we can start
  15. my ($sc_old,$v1,$v2) = split(/,/, <FILE1>);
  16. chomp $v2;
  17. $fields{$v1}=$v2;
  18.  
  19. while (<FILE1>) {
  20.    chomp;
  21.    my ($sc_new,$v1,$v2) = split(/,/);
  22.    if ($sc_new eq $sc_old) {
  23.       $fields{$v1}=$v2;
  24.    }
  25.    elsif (($sc_new ne $sc_old)) {
  26.       print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
  27.       %fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");     
  28.       $sc_old = $sc_new;
  29.       $fields{$v1}=$v2;
  30.    }
  31. }
  32. close FILE1;
  33. # prints last set of records
  34. print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
It maybe could be done even faster by someone that is more experienced with perl.
Kevin
Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
$ba = `grep "$sc,BA,LAN" $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

Thankx in advance


John
Feb 17 '08 #7

KevinADC
Expert 2.5K+
P: 4,059
Kevin
Thankx for the hard work, really appreciate your time and help, works like beauty. Thnx again Kevin. One more thing I have to apply some rules to get the data from file1.csv as shown in the script i posted above for exampl,
$ba = `grep "$sc,BA,LAN" $file1`; # Here i am grabbing those rows which has BA,LAN only. Is there any way i can do this in this same script.

Thankx in advance


John
Do you mean grep all the lines in the file that satisfy the condition and only those lines?
Feb 17 '08 #8

P: 9
Do you mean grep all the lines in the file that satisfy the condition and only those lines?
Kevin
I really appreciate your attention. Thanks.
yes thats right has to grep lines that satisfy the conditions for example,
$ba = `grep "$sc,BA,LAN" $file1`;
so first grep all the lines that has ",BA,LAN" which gives all the unique id's (ie columns1 from file1) so based on this id's satisfy rest of the information.
$is = `grep "$sc,IS,LA" $file1`;
$rt = `grep "$sc,RT,.*TO," $file1`;
output will look like the same as your script gives.

Thank you in advance Kevin
John
Feb 17 '08 #9

KevinADC
Expert 2.5K+
P: 4,059
Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.
Feb 17 '08 #10

P: 9
Maybe if there were just a couple of rules you could do it all in one script, but if you want to get all the possible combinations of the various fields for each id it would just get crazy. Thats the time to start looking at using a real database to store and process your data.
Thanks for the quick response Kevin, Yaa just because of this rules my script is too slow but your script is amazingly fast. Could you please help me with just one rule say for $ba = `grep "$sc,BA,LAN" $file1`;
Then i will try to work on it.

Thanks in advance
John
Feb 17 '08 #11

KevinADC
Expert 2.5K+
P: 4,059
let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

112311,AB,312342
112311,AB,554325

$sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.
Feb 17 '08 #12

P: 9
let me see if I understand, $sc is a unique set of numbers, an ID of some kind. $sc can have up to five different fields: AB, BA, IS, RT, ET. That seems clear enough.

The question is can $sc only ever have one set of five fields in the file? Some of the fields might have no value and some might have a value but $sc can never have more than one set of five fields associated with it? For example, you can't have this scenario:

112311,AB,312342
112311,AB,554325

$sc 112311 will only ever have one AB field or no AB field but not more than zero or one. Same for all the other ID and fields.
Kevin thanx for response, Regarding your example your right. $sc can never have more than one set of five fields associated with it. So $sc can have some thing like as below,
[HTML]112311,AB,343534
112311,BA,LAN435
112311,IS,CA34533
232323,AB,543345
232323,BA,MAN654
232323,ET,RTER[/HTML]So what I am looking is grep all "ID(col1),BA,LAN*" first (I need only those IDs that begins with "LAN" in column 3 not "MAN", then use this ID(col1) to get result) rest is same. your script is very close to want i am looking for, specialy after open(FILE), the only thing i am really not good with hashes. so Is there any way I can get the ID based on the rule. I appreciate your help Kevin.

Thank you
John
Feb 17 '08 #13

KevinADC
Expert 2.5K+
P: 4,059
I will see what I can do later today.
Feb 18 '08 #14

P: 9
I will see what I can do later today.
Kevin,
Thankx again for your response, I appreciate your help. Thanks.
John
Feb 19 '08 #15

KevinADC
Expert 2.5K+
P: 4,059
Well, just going by the one example you posted (finding all $sc with AB = LAN):

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. my %fields = (
  4.    AB => "\t",
  5.    BA => "\t",
  6.    IS => "\t",
  7.    RT => "\t",
  8.    ET => "\t"
  9. );
  10.  
  11. my $file1 = "c:/perl_test/file1.csv";
  12. open(FILE1,"<$file1") or die "$!";
  13. open(OUT, '>>', 'c:/perl_test/output.csv') or die "$!";
  14. #get the first line of the file so we can start
  15. my ($sc_old,$v1,$v2) = split(/,/, <FILE1>);
  16. chomp $v2;
  17. $fields{$v1}=$v2;
  18.  
  19. while (<FILE1>) {
  20.    chomp;
  21.    my ($sc_new,$v1,$v2) = split(/,/);
  22.    if ($sc_new eq $sc_old) {
  23.       $fields{$v1}=$v2;
  24.    }
  25.    elsif ($sc_new ne $sc_old) {
  26.       if ($fields{'BA'} =~ /^LAN/) {
  27.          print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
  28.       }
  29.       %fields = (AB => "\t",BA => "\t",IS => "\t",RT => "\t",ET => "\t");     
  30.       $sc_old = $sc_new;
  31.       $fields{$v1}=$v2;
  32.    }
  33. }
  34. close FILE1;
  35. # prints last set of records
  36. if ($fields{'BA'} =~ /^LAN/) {
  37.    print OUT "$sc_old,$fields{AB},$fields{BA},$fields{IS},$fields{RT},$fields{ET}\n";
  38. }
  39.  
This is not very robust however. But if you test it and it does what you want as far as finding all $sc with AB=LAN maybe it can be expanded upon. Make sure to change the file paths.
Feb 19 '08 #16

Post your reply

Sign in to post your reply or Sign up for a free account.