By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,809 Members | 1,999 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,809 IT Pros & Developers. It's quick & easy.

error in calculation of standard deviation

P: 5
Hi,
My data set is followed like this... (as tab delimited input.csv)
First is column is gene name, second column is bn count, third colums is sh count.


ENSRNOG00000000417 42 102
ENSRNOG00000000417 26 52
ENSRNOG00000000417 152 284
ENSRNOG00000000417 270 372
ENSRNOG00000000417 418 540
ENSRNOG00000000417 1796 2110
ENSRNOG00000000417 2116 2464
ENSRNOG00000000417 450 518
ENSRNOG00000000417 90 102
ENSRNOG00000000417 6472 7076
ENSRNOG00000000417 724 784
ENSRNOG00000000417 724 780
ENSRNOG00000000417 406 430
ENSRNOG00000000417 536 564
ENSRNOG00000000417 236 242
ENSRNOG00000000417 112 114
ENSRNOG00000000417 1272 1224
ENSRNOG00000000417 290 264
ENSRNOG00000000417 688 624
ENSRNOG00000000417 90 78
ENSRNOG00000000417 180 148
ENSRNOG00000000417 96 78
ENSRNOG00000000417 210 164
ENSRNOG00000000417 64 38
ENSRNOG00000000417 166 78
ENSRNOG00000001487 96 172
ENSRNOG00000001487 114 168
ENSRNOG00000001487 1190 1654
ENSRNOG00000001487 86 104
ENSRNOG00000001487 86 102
ENSRNOG00000001487 1536 1679
ENSRNOG00000001487 1518 1659
ENSRNOG00000001487 188 146
ENSRNOG00000001487 100 72
ENSRNOG00000001487 84 56
ENSRNOG00000001487 46 26
ENSRNOG00000001487 0 0
ENSRNOG00000001487 0 0
ENSRNOG00000001487 0 0
ENSRNOG00000001488 814 2714
ENSRNOG00000001488 314 780
ENSRNOG00000001488 1110 2156
ENSRNOG00000001488 956 1652
ENSRNOG00000001488 4334 6926
ENSRNOG00000001488 2450 3202
ENSRNOG00000001489 50 114
ENSRNOG00000001489 100 186
ENSRNOG00000001489 72 118
ENSRNOG00000001489 356 516
ENSRNOG00000001489 42 50
ENSRNOG00000001489 32 32
ENSRNOG00000001489 72 66
ENSRNOG00000001489 14 10
ENSRNOG00000001489 1040 352
ENSRNOG00000001489 0 0


I want to calculate for each gene names of total count, sum, mean and standard deviation..
I could calculate for the count, sum and mean but the Standard deviation is not calculating properly with my code..
This is my perl code
Expand|Select|Wrap|Line Numbers
  1. #usr/bin/perl -w
  2. use strict;
  3.  
  4. my ($input) = @ARGV;
  5.  
  6. open (INPUT, "$input") or die "Can't open the file";
  7. my %hash = ();
  8.  
  9.  
  10. while (<INPUT>) { 
  11.     chomp;
  12.     my @data = split " ", $_;
  13.     for my $i (1,2) {
  14.         $hash {$data[0]} [$i-1]+=$data[$i];
  15.     }
  16.     $hash{$data[0]}[2]++;
  17. }
  18. my $sumofsq=0;
  19.     foreach my $key (keys %hash) {
  20.         print "Number, Sum, Mean, Stdev  for BN & SH  $key : ";
  21.         foreach my $i (0..$#{$hash{$key}}-1) {
  22.         if ($hash{$key}[$i] == 0 ) {
  23.                 my $count = sprintf $hash{$key}[2];            
  24.                 print "$count \t 0 \t 0 \t 0 \t";
  25.             }
  26.             else {
  27.         my $count1 = sprintf $hash{$key}[2];
  28.         my $sum = sprintf "%.2f", $hash{$key}[$i];    
  29.         my $avg = sprintf  "%.2f", $hash{$key}[$i] / $hash{$key}[2];
  30.               my $sumofsq += ($hash{$key}[$i] - ($hash{$key}[$i] / $hash{$key}[2]))**2;
  31.         my $stdev = sprintf "%.2f", sqrt($sumofsq / ($hash{$key}[2]));
  32.             print "$count1 \t $sum \t $avg \t $stdev \t";
  33.             }
  34.         }
  35.     print "\n";
  36. }
  37.  
  38. close (INPUT);
  39. exit;
  40.  
The output is
Number, Sum, Mean, Stdev for BN & SHR ENSRNOG00000000417 : 26 17660.00 679.23 3330.20 26 19242.00 740.08 3628.53
Number, Sum, Mean, Stdev for BN & SH ENSRNOG00000001487 : 14 5044.00 360.29 1251.78 14 5838.00 417.00 1448.82
Number, Sum, Mean, Stdev for BN & SH ENSRNOG00000001488 : 6 9978.00 1663.00 3394.58 6 17430.00 2905.00 5929.81
Number, Sum, Mean, Stdev for BN & SH ENSRNOG00000001488 : 6 9978.00 1663.00 3394.58 6 17430.00 2905.00 5929.81


So can you help me to find out the problem of standard deviation?

For example the output should be like this for
Number, Sum, Mean, Stdev for BN & SH ENSRNOG00000000417 : 26 17660.00 679.23 1295.36 26 19242.00 740.08 1427.99
Sep 17 '09 #1
Share this Question
Share on Google+
6 Replies


Expert
P: 70
Since I could not determine what you problem was by looking at your code,
and because there is a discrepancy between your posted data and
your actual output (I only see 25 lines for gene 417, not 26), I decided
to code something else up myself. I also used very simple data,
which I just copied from the wikipedia standard deviation page.

Expand|Select|Wrap|Line Numbers
  1. use warnings;
  2. use strict;
  3. use List::Util qw(sum);
  4.  
  5. my %data;
  6. while (<DATA>) {
  7.     my ($gene, $bn, $sh) = split;
  8.     push @{ $data{$gene}{bn} }, $bn;
  9.     push @{ $data{$gene}{sh} }, $sh;
  10. }
  11.  
  12. for my $gene (keys %data) {
  13.     print "gene=$gene\n";
  14.     for my $type (qw(bn sh)) {
  15.         my @vals = @{ $data{$gene}{$type} };
  16.         my $tot = sum(@vals);
  17.         my $avg = $tot / @vals;
  18.         my $sos = 0;
  19.         for (@vals) {
  20.             $sos += ($_ - $avg)**2;
  21.         }
  22.         my $sd = sqrt($sos/@vals);
  23.         print "type=$type tot=$tot avg=$avg sd=$sd\n";
  24.     }
  25. }
  26. __DATA__
  27. foo 2 0
  28. foo 4 0
  29. foo 4 0
  30. foo 4 0
  31. foo 5 0
  32. foo 5 0
  33. foo 7 0
  34. foo 9 0
  35.  
This prints out:

Expand|Select|Wrap|Line Numbers
  1. gene=foo
  2. type=bn tot=40 avg=5 sd=2
  3. type=sh tot=0 avg=0 sd=0
  4.  
Perhaps you could use this simplified version of the code
to build upon.
Sep 18 '09 #2

P: 5
Thanks for the help. It was my mistake that i missed a line in the gene 417. But your code is correct.. Its working fine.. Can you help me on this problem also..

The data in this format
_DATA__
foo 2 0 ens1
foo 4 0 ens1
foo 4 0 ens1
foo 4 0 ens1
foo 5 0 ens1
foo 5 0 ens1
foo 7 0 ens1
foo 9 0 ens1
fo1 1 3 ens2
fo1 4 2 ens2
fo3 2 2 ens3
fo3 3 2 ens3

i want to keep the 4th column geneid also with this output..
like

gene=foo
type=bn tot=40 avg=5 sd=2 geneid =ens1
type=sh tot=0 avg=0 sd=0 geneid =ens1

gene=fo1
type=bn tot=5 avg=2.5 sd=0.5 geneid=ens2
type=sh tot=4 avg=2 sd=0 geneid=ens2

gene=fo2
type=bn tot=5 avg=2.5 sd=1.5 geneid=ens3
type=sh tot=5 avg=2.5 sd=0.5 geneid=ens3

How i can include the geneid also with the output?
Sep 18 '09 #3

P: 5
@toolic

Sorry there is mistake in the sd calculation. The value should be 2.14
can you please check this..
Sep 18 '09 #4

P: 5
Sorry there is mistake in the sd calculation. The value should be 2.14
can you please check this..
Sep 18 '09 #5

P: 5
Hi toolic,
I found the mistake in SD calculation.. I had assigned $count = @vals;
Then $sd = sqrt($sos/ ($count -1))
Now its correct the SD...
Thanks..
Sep 18 '09 #6

Expert
P: 70
i want to keep the 4th column geneid also with this output..
like

gene=foo
type=bn tot=40 avg=5 sd=2 geneid =ens1
Assuming there is only one geneid per gene,
one way to add 'geneid' to your data structure is:
Expand|Select|Wrap|Line Numbers
  1. while (<DATA>) {
  2.     my ($gene, $bn, $sh, $geneid) = split;
  3.     push @{ $data{$gene}{bn} }, $bn;
  4.     push @{ $data{$gene}{sh} }, $sh;
  5.     $data{$gene}{sh}{geneid} = $geneid;
  6. }
  7.  
A handy reference for Perl data structures is:
perldsc

Sorry there is mistake in the sd calculation. The value should be 2.14
can you please check this..
The Wiki Standard_deviation Basic_example shows a result of 2. When I check this by hand, I get 2. When I run my Perl code, I get 2. Perhaps you are interested in a different formula from the wiki page. It's your code; you can use whatever formula you want.
Sep 18 '09 #7

Post your reply

Sign in to post your reply or Sign up for a free account.