By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,710 Members | 1,960 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,710 IT Pros & Developers. It's quick & easy.

Average and Standard deviation

P: 55
Hi All,
I am trying to get an average value for my data, here is my data file
Expand|Select|Wrap|Line Numbers
  1.  
  2. DATA FILE
  3. EP1934.PDB 250 250 11.27
  4. EP1934.PDB 251 251 12.7332
  5. EP1934.PDB 252 252 6.38341
  6. EP1934.PDB 253 253 8.04318
  7. EP1934.PDB 254 254 13.7123
  8. EP1934.PDB 255 255 10.5251
  9. EP1934.PDB 256 256 6.0811
  10. EP1934.PDB 257 257 13.317
  11. EP1934.PDB 258 258 14.1105
  12. EP1934.PDB 259 259 6.98834
  13. EP1934.PDB 260 260 9.93146
  14. EP1934.PDB 261 261 15.0784
  15. EP1934.PDB 262 262 11.2232
  16. EP1934.PDB 263 263 5.8835
  17. EP1934.PDB 264 264 12.9708
  18. EP1934.PDB 265 265 14.6467
  19. EP1934.PDB 266 266 7.85166
  20. EP1934.PDB 267 267 8.95534
  21. EP1934.PDB 268 268 14.5541
  22. EP1934.PDB 269 269 11.5805
  23. EP1934.PDB 270 270 5.62243
  24. EP1934.PDB 271 271 12.6822
  25. EP1934.PDB 272 272 14.9681
  26. EP1934.PDB 273 273 8.78424
  27. EP1934.PDB 274 274 9.98951
  28. EP1935.PDB 250 250 11.793
  29. EP1935.PDB 251 251 13.2081
  30. EP1935.PDB 252 252 6.3147
  31. EP1935.PDB 253 253 8.55546
  32. EP1935.PDB 254 254 13.8497
  33. EP1935.PDB 255 255 10.091
  34. EP1935.PDB 256 256 5.70243
  35. EP1935.PDB 257 257 12.8827
  36. EP1935.PDB 258 258 13.4507
  37. EP1935.PDB 259 259 6.39756
  38. EP1935.PDB 260 260 9.43181
  39. EP1935.PDB 261 261 14.7167
  40. EP1935.PDB 262 262 10.9966
  41. EP1935.PDB 263 263 5.71955
  42. EP1935.PDB 264 264 13.135
  43. EP1935.PDB 265 265 14.4682
  44. EP1935.PDB 266 266 7.93579
  45. EP1935.PDB 267 267 9.48097
  46. EP1935.PDB 268 268 15.5227
  47. EP1935.PDB 269 269 12.5595
  48. EP1935.PDB 270 270 6.47589
  49. EP1935.PDB 271 271 13.1677
  50. EP1935.PDB 272 272 15.9816
  51. EP1935.PDB 273 273 10.2107
  52. EP1935.PDB 274 274 10.7019
  53. EP1936.PDB 250 250 12.0315
  54. EP1936.PDB 251 251 13.6144
  55. EP1936.PDB 252 252 6.44758
  56. EP1936.PDB 253 253 8.70471
  57. EP1936.PDB 254 254 13.9884
  58. EP1936.PDB 255 255 10.4086
  59. EP1936.PDB 256 256 5.42416
  60. EP1936.PDB 257 257 12.5661
  61. EP1936.PDB 258 258 13.497
  62. EP1936.PDB 259 259 6.49391
  63. EP1936.PDB 260 260 9.43865
  64. EP1936.PDB 261 261 14.9835
  65. EP1936.PDB 262 262 11.4903
  66. EP1936.PDB 263 263 6.2322
  67. EP1936.PDB 264 264 13.3191
  68. EP1936.PDB 265 265 15.0674
  69. EP1936.PDB 266 266 8.56444
  70. EP1936.PDB 267 267 9.8656
  71. EP1936.PDB 268 268 16.3347
  72. EP1936.PDB 269 269 13.6462
  73. EP1936.PDB 270 270 7.47648
  74. EP1936.PDB 271 271 13.8738
  75. EP1936.PDB 272 272 16.8272
  76. EP1936.PDB 273 273 11.1519
  77. EP1936.PDB 274 274 9.61694
  78. EP1937.PDB 250 250 11.2767
  79. EP1937.PDB 251 251 12.8564
  80. EP1937.PDB 252 252 6.13925
  81. EP1937.PDB 253 253 8.30244
  82. EP1937.PDB 254 254 14.1491
  83. EP1937.PDB 255 255 10.6535
  84. EP1937.PDB 256 256 5.36572
  85. EP1937.PDB 257 257 12.1148
  86. EP1937.PDB 258 258 13.3093
  87. EP1937.PDB 259 259 6.15769
  88. EP1937.PDB 260 260 9.39202
  89. EP1937.PDB 261 261 14.6329
  90. EP1937.PDB 262 262 11.1803
  91. EP1937.PDB 263 263 6.40411
  92. EP1937.PDB 264 264 13.6729
  93. EP1937.PDB 265 265 14.5391
  94. EP1937.PDB 266 266 8.22699
  95. EP1937.PDB 267 267 8.98709
  96. EP1937.PDB 268 268 15.2712
  97. EP1937.PDB 269 269 13.2764
  98. EP1937.PDB 270 270 6.57068
  99. EP1937.PDB 271 271 11.7033
  100. EP1937.PDB 272 272 16.2944
  101. EP1937.PDB 273 273 11.7734
  102. EP1937.PDB 274 274 8.73714
  103. EP1940.PDB 250 250 11.7256
  104. EP1940.PDB 251 251 13.3999
  105. EP1940.PDB 252 252 6.52818
  106. EP1940.PDB 253 253 8.41266
  107. EP1940.PDB 254 254 14.1372
  108. EP1940.PDB 255 255 10.5523
  109. EP1940.PDB 256 256 5.54926
  110. EP1940.PDB 257 257 12.544
  111. EP1940.PDB 258 258 13.0304
  112. EP1940.PDB 259 259 6.3614
  113. EP1940.PDB 260 260 9.26743
  114. EP1940.PDB 261 261 14.8251
  115. EP1940.PDB 262 262 11.0243
  116. EP1940.PDB 263 263 6.09589
  117. EP1940.PDB 264 264 13.2229
  118. EP1940.PDB 265 265 14.4447
  119. EP1940.PDB 266 266 7.83723
  120. EP1940.PDB 267 267 10.0536
  121. EP1940.PDB 268 268 16.3468
  122. EP1940.PDB 269 269 13.4618
  123. EP1940.PDB 270 270 7.98931
  124. EP1940.PDB 271 271 14.8577
  125. EP1940.PDB 272 272 17.9952
  126. EP1940.PDB 273 273 12.2682
  127. EP1940.PDB 274 274 10.2391
  128.  
where the first column is pdb id , second, third is residue position and fourth is distance.
What i am trying to do is to calculate the average value for each residue position and calulate standard deviation(SD).
For example: for residue position 250, program should select and calculate all the average values for distande at residue number 250 and then calculate SD.
and finaly print the residue number, average value and SD.

I have written a code but its not able to select the specified residue and do the calculations.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my (%hash,$respos1,$respos2,$dist,$val,$line,@temp);
  6. my ($count,$dis) = 0;
  7.  
  8.  
  9. open (FH,"caca.dat") or die "Check the file";
  10. while (<FH>)
  11. {
  12.     $line = $_;
  13.     chomp $_;
  14.     @temp = split (/\s/,$line);
  15.     $respos1 = $temp[1];
  16.     $respos2 = $temp[2];
  17.     $dist    = $temp[3];
  18.     $hash{$respos1} = $dist;
  19. }
  20.  
  21.     for ($respos1=250;$respos1<=274;$respos1++)
  22.     {
  23.         if ($respos1 == $respos2)
  24.         {
  25.             $dis = $dis + $dist;
  26.             $count++;
  27.         }
  28.     }
  29.  
  30.  
Since the average value is not calculating correctly, I have not tried the SD part.

Any directions will be helpful.

Thanks
Kumar
Oct 16 '08 #1
Share this Question
Share on Google+
6 Replies


KevinADC
Expert 2.5K+
P: 4,059
Not sure if I got this correct but it should help or can be fixed easy enough (I think).

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5. my %hash;
  6.  
  7. #  ID        r1  r2 sd 
  8. #EP1935.PDB 267 267 9.48097
  9.  
  10. open (FH,"caca.dat") or die "Check the file";
  11. while (my $line = <FH>){
  12.     chomp $line;
  13.     my ($r1,$sd) = (split (/\s/,$line))[1,3];
  14.     $hash{$r1}{'sd'} += $sd;
  15.     $hash{$r1}{'divisor'}++;
  16. }
  17. close FH;
  18. foreach my $key (sort {$a <=> $b} keys %hash) {
  19.    my $avg = sprintf "%.3f" , $hash{$key}{'sd'} / $hash{$key}{'divisor'};
  20.    print "The average SD for $key is $avg\n";
  21. }
  22.  
Oct 16 '08 #2

P: 55
Hi All,
Thanks for the reply, I tried to calculate the average and the SD but something is wrong I am not sure.
here is my code
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my (@pos,@cadist,$mean,%hash,$respos1,$respos2,$cadist,$val,$line,@temp,@respos2);
  7. my ($cnt,$dis,$sum) = 0;
  8.  
  9. open (FH,"caca.dat") or die "Check the file";
  10. while (<FH>)
  11. {
  12.     $line = $_;chomp $_;
  13.     @temp = split (/\s/,$line);
  14.     $respos1   = $temp[1];
  15.     $respos2   = $temp[2];
  16.     $cadist    = $temp[3];
  17.     for(my $i=250;$i<=274;$i++)
  18.     {
  19.         if ($i == $respos2)
  20.         {
  21.             push (@cadist,$cadist);
  22.             push (@pos,$respos2);
  23.             $sum +=$cadist;
  24.             $cnt++;
  25.         }
  26.     }
  27. }
  28.  
  29. $mean = $sum/$cnt;
  30. @cadist = ();
  31.  
  32. my $summ = 0;
  33. my $deviation;
  34.  
  35. foreach my $val(@cadist)
  36. {
  37.     my $abar = (($val-$mean)**2);
  38.     $summ   += $abar;
  39. }
  40. $deviation = sprintf "%.5f",sqrt($summ/($cnt-1));
  41. print "$pos[0] $mean $deviation\n";
  42.  
If i remove the for loop and in the if statement simply put some value for comparision everything works fine, but when I put the condition for every residue position then calculation goes wrong.
Thanks
Kumar
Oct 17 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
Well, if everything else is correct in your code, this line needs to be removed:

@cadist = (); (line 30)

That deletes the array of any values it had previously.
Oct 17 '08 #4

P: 55
Thanks for the reply,
I finally succeded in calculating the values but one thing still remains, due to for loop in the code, all the values are printed repeatedly till the end, which makes it redundant,
I am posting the code which runs on the data file, which i posted earlier and one can see the results after running the code.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my ($line,@temp,$respos1,$respos2,@respos1,@respos2,$cadis,@lstdist,$i,$j,@cadist,@result,$length,$ele,$val,$abar,$deviation);
  7. my ($cnt,$dis,$sum,$summ,$mean) = 0;
  8.  
  9. open (FH,"caca.dat") or die "Check the file";
  10. while (<FH>)
  11. {
  12.     $line = $_;chomp $_;
  13.     @temp = split (/\s/,$line);
  14.     $respos1   = $temp[1];
  15.     $respos2   = $temp[2];
  16.     $cadis     = $temp[3];
  17.     push(@respos1,$respos1);push(@respos2,$respos2);push(@cadist,$cadis);
  18. }
  19.  
  20. for($i=0;$i<@respos1;$i++)
  21. {
  22.     @lstdist=();
  23.    for($j=0;$j<@respos2;$j++)
  24.    {
  25.        if ($respos1[$i] == $respos2[$j])
  26.        {
  27.        push (@lstdist,$cadist[$j]);
  28.        }
  29.    }
  30.     @result=&mean(@lstdist);
  31.     print "$respos1[$i]\t$result[0]\t$result[1]\n";
  32. }
  33. sub mean
  34. {
  35.     (@lstdist)=@_;
  36.     $length=scalar(@lstdist);
  37.     $sum=0;$mean=0;$summ=0;
  38.     foreach $ele(@lstdist)
  39.     {
  40.     $sum +=$ele;
  41.     }
  42.     $mean=$sum/$length;
  43.     foreach $val(@lstdist)
  44.     {
  45.     $abar=0;
  46.     $abar = (($val-$mean)**2);
  47.     $summ   += $abar;
  48.     }
  49.     $deviation = sqrt($summ/($length-1));
  50.     return($mean,$deviation);
  51. }
  52.  
How I can print the values only once for each residue number.


Thanks
Kumar
Oct 17 '08 #5

nithinpes
Expert 100+
P: 410
Thanks for the reply,
I finally succeded in calculating the values but one thing still remains, due to for loop in the code, all the values are printed repeatedly till the end, which makes it redundant,
I am posting the code which runs on the data file, which i posted earlier and one can see the results after running the code.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my ($line,@temp,$respos1,$respos2,@respos1,@respos2,$cadis,@lstdist,$i,$j,@cadist,@result,$length,$ele,$val,$abar,$deviation);
  7. my ($cnt,$dis,$sum,$summ,$mean) = 0;
  8.  
  9. open (FH,"caca.dat") or die "Check the file";
  10. while (<FH>)
  11. {
  12.     $line = $_;chomp $_;
  13.     @temp = split (/\s/,$line);
  14.     $respos1   = $temp[1];
  15.     $respos2   = $temp[2];
  16.     $cadis     = $temp[3];
  17.     push(@respos1,$respos1);push(@respos2,$respos2);push(@cadist,$cadis);
  18. }
  19.  
  20. for($i=0;$i<@respos1;$i++)
  21. {
  22.     @lstdist=();
  23.    for($j=0;$j<@respos2;$j++)
  24.    {
  25.        if ($respos1[$i] == $respos2[$j])
  26.        {
  27.        push (@lstdist,$cadist[$j]);
  28.        }
  29.    }
  30.     @result=&mean(@lstdist);
  31.     print "$respos1[$i]\t$result[0]\t$result[1]\n";
  32. }
  33. sub mean
  34. {
  35.     (@lstdist)=@_;
  36.     $length=scalar(@lstdist);
  37.     $sum=0;$mean=0;$summ=0;
  38.     foreach $ele(@lstdist)
  39.     {
  40.     $sum +=$ele;
  41.     }
  42.     $mean=$sum/$length;
  43.     foreach $val(@lstdist)
  44.     {
  45.     $abar=0;
  46.     $abar = (($val-$mean)**2);
  47.     $summ   += $abar;
  48.     }
  49.     $deviation = sqrt($summ/($length-1));
  50.     return($mean,$deviation);
  51. }
  52.  
How I can print the values only once for each residue number.


Thanks
Kumar
You may store the result in hash of array, and print it outside the loop to avoid duplicate results:
Expand|Select|Wrap|Line Numbers
  1. my %result=();  # result hash
  2. for($i=0;$i<@respos1;$i++) 
  3.   @lstdist=(); 
  4.    for($j=0;$j<@respos2;$j++) 
  5.    { 
  6.        if ($respos1[$i] == $respos2[$j]) 
  7.        { 
  8.        push(@lstdist,$cadist[$j]); 
  9.        } 
  10.    } 
  11.    @result=&mean(@lstdist); 
  12.     $result{$respos1[$i]} = [$result[0],$result[1]]; # create hash of arrays
  13.   } 
  14.  
  15. foreach(sort keys %result) {
  16.  print "$_\t$result{$_}[0]\t$result{$_}[1]\n"; #display result
  17. }
  18.  
Oct 17 '08 #6

P: 55
Thanks Nithinpes and All, for suggestions and now the program works perfectly fine.

Thanks
Kumar
Oct 17 '08 #7

Post your reply

Sign in to post your reply or Sign up for a free account.