467,175 Members | 1,313 Online

# numpy/scipy: correlation

 Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: m, m-err, o, o-err, r-coef,r-coef-err ? Or a formula to to compute the 3 error ranges? -robert PS: numpy.corrcoef computes only the bare coeff: >>numpy.corrcoef((0,1,2,3.0),(2,5,6,7.0),) array([[ 1. , 0.95618289], [ 0.95618289, 1. ]]) with ints it goes computes wrong: >>numpy.corrcoef((0,1,2,3),(2,5,6,7),) array([[ 1. , 0.94491118], [ 0.94491118, 1. ]]) Nov 11 '06 #1
• viewed: 21739
Share:
18 Replies
 robert wrote: Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: m, m-err, o, o-err, r-coef,r-coef-err ? numpy and scipy questions are best asked on their lists, not here. There are a number of people who know the capabilities of numpy and scipy through and through, but most of them don't hang out on comp.lang.python. http://www.scipy.org/Mailing_Lists scipy.optimize.leastsq() can be told to return the covariance matrix of the estimated parameters (m and o in your example; I have no idea what you think r-coeff is). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco Nov 11 '06 #2
 Robert Kern wrote: robert wrote: >Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:m, m-err, o, o-err, r-coef,r-coef-err ? scipy.optimize.leastsq() can be told to return the covariance matrix of the estimated parameters (m and o in your example; I have no idea what you think r-coeff is). Ah, the correlation coefficient itself. Since correlation coefficients are weird beasts constrained to [-1, 1], standard gaussian errors like you are expecting for m-err and o-err don't apply. No, there's currently no function in numpy or scipy that will do something sophisticated enough to be reliable. Here's an option: http://www.pubmedcentral.nih.gov/art...i?artid=155684 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco Nov 11 '06 #3
 robert wrote: Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: m, m-err, o, o-err, r-coef,r-coef-err ? And of course, those three parameters are not particularly meaningful together. If your model is truly "y is a linear response given x with normal noise" then "y=m*x+o" is correct, and all of the information that you can get from the data will be found in the estimates of m and o and the covariance matrix of the estimates. On the other hand, if your model is that "(x, y) is distributed as a bivariate normal distribution" then "y=m*x+o" is not a particularly good representation of the model. You should instead estimate the mean vector and covariance matrix of (x, y). Your correlation coefficient will be the off-diagonal term after dividing out the marginal standard deviations. The difference between the two models is that the first places no restrictions on the distribution of x. The second does; both the x and y marginal distributions need to be normal. Under the first model, the correlation coefficient has no meaning. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco Nov 11 '06 #4
 Robert Kern wrote: The difference between the two models is that the first places no restrictions on the distribution of x. The second does; both the x and y marginal distributions need to be normal. Under the first model, the correlation coefficient has no meaning. That is not correct. The correlation coefficient is meaningful in both models, but must be interpreted differently. However, in both cases a correlation coefficient of 1 or -1 indicates an exact linear relationship between x and y. Under the first model ("linear regression"), the squared correlation coefficient is the "explained variance", i.e. the the proportion of y's variance accounted for by the model y = m*x + o. Under the second model ("multivariate normal distribution"), the correlation coefficient is the covariance of y and x divided by the product of the standard deviations, cov(x,y)/(std(x)*std(y)). Nov 12 '06 #6
 robert wrote: Robert Kern wrote: http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2-Y tells: probable error of r = 0.6745*(1-r**2)/sqrt(N) A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from 0.0115628571429 to 0.00548453410954 ! This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 . I do monte carlo on correlating random series: >>X=numpy.random.random(10000)l=[]for i in range(200): .... Z=numpy.random.random(10000) .... l.append( sd.correlation((X,Z)) ) #collect coef's .... >>mean(l) 0.000327657082234 >>std(l) 0.0109120766158 # thats how the coef jitters >>std(l)/sqrt(len(l)) 0.000771600337185 >>len(l) 200 # now # 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079 # vs M.C. 0.0109120766158 Â± 0.000771600337185 but the fancy factor of 0.6745 is significantly just fancy for r=0. then for a higher (0.5) correlation: >>l=[]for i in range(200): .... Z=numpy.random.random(10000)+array(range(10000))/10000.0 .... l.append( sd.correlation((X+array(range(10000))/10000.0,Z)) ) .... >>mean(l) 0.498905642552 >>std(l) 0.00546979583163 >>std(l)/sqrt(len(l)) 0.000386772972425 #now: # 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849) # vs M.C. 0.00546979583163 Â± 0.000386772972425 =there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it. -- Now adding boring data: >>boring=ones(10001)*0.5X=numpy.random.random(10000)l=[]for i in range(200): .... Z=concatenate((numpy.random.random(10000)+array(ra nge(10000))/10000.0,boring)) .... l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z)) ) .... >>mean(l) 0.712753628489 # r >>std(l) 0.00316163649888 # r_err >>std(l)/sqrt(len(l)) 0.0002235614608 # now: # 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461 #N=20000 # vs M.C. streuung 0.00316163649888 Â± 0.0002235614608 =the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N) =I'll use this formula to get a downside error estimate for the correlation coefficient: ------------------------------------------ | r_err_down ~= 1.0 * (1-r**2)/sqrt(N) | ------------------------------------------ (until I find a better one respecting the actual distribution of data) Would be interesting what MATLAB & Octave say ... -robert Nov 12 '06 #7
 First, are you talking about rounding error (due to floating point arithmetics) or statistical sampling error? If you are talking about the latter, I suggest you look it up in a statistics text book. E.g. if x and y are normally distributed, then t = r * sqrt( (n-2)/(1-r**2) ) has a Student t-distribution with n-2 degrees of freedom. And if you don't know how to get the p-value from that, you should not be messing with statistics anyway. Nov 12 '06 #8
 sturlamolden wrote: First, are you talking about rounding error (due to floating point arithmetics) or statistical sampling error? About measured data. rounding and sampling errors with special distrutions are neglegible. Thus by default assuming gaussian noise in x and y. (This may explain that factor of ~0.7 in the rectangle M.C. test) The (x,y) points may not distribute "nicely" along the assumed regression diagonale. If you are talking about the latter, I suggest you look it up in a statistics text book. E.g. if x and y are normally distributed, then t = r * sqrt( (n-2)/(1-r**2) ) has a Student t-distribution with n-2 degrees of freedom. And if you don't know how to get the p-value from that, you should not be messing with statistics anyway. yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ? that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ? The const. ~1 is less the problem. My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. E.g. due to the nature of the experiment more points are around (0,0) but only few distribute along the interesting part of the diagonale and thus few points have great effect on m & r. above formulas will possibly not respect that. I could try a weighting technique, but maybe there is a (commonly used) speedy formula for r/r_err respecting that directly? Robert Nov 12 '06 #9
 sturlamolden wrote: Robert Kern wrote: >The difference between the two models is that the first places no restrictionson the distribution of x. The second does; both the x and y marginaldistributions need to be normal. Under the first model, the correlationcoefficient has no meaning. That is not correct. The correlation coefficient is meaningful in both models, but must be interpreted differently. However, in both cases a correlation coefficient of 1 or -1 indicates an exact linear relationship between x and y. Under the first model ("linear regression"), the squared correlation coefficient is the "explained variance", i.e. the the proportion of y's variance accounted for by the model y = m*x + o. Point. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco Nov 12 '06 #11
 robert wrote: I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore. There is no such thing as "a formula for an error range" in a vacuum like that. Each formula has a model attached to it. If your data does not follow that model, then any such estimate of the error of your estimate is meaningless. sturlamolden pointed out to me that I was wrong in thinking that the correlation coefficient was meaningless in the linear regression case. However, the error estimate that is used for bivariate normal correlation coefficients will almost certainly not apply to "correlation coefficient qua sqare root of the fraction of y's variance explained by the linear model". In that context, the "correlation coefficient" is not an estimate of some parameter. It has no uncertainty attached to it. http://en.wikipedia.org/wiki/Pearson...ficient#Trivia says: In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value. Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too. scipy.stats.pearsonr() As with all such frequentist hypothesis testing nonsense, one takes the null hypothesis (in this case, a bivariate normal distribution of points with 0 correlation), finds the distribution of the test statistic given the number of points sampled, and then finds the probability of getting a test statistic "at least as extreme" as the test statistic you actually got. my old func is simply hands-on based on n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum( ),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum () Guess its already fast for large data? You really want to use a 2-pass algorithm to avoid numerical problems. Note: numpy.corrcoef strikes on 2 points: >>>numpy.corrcoef(([1,2],[1,2])) array([[ -1.#IND, -1.#IND], [ -1.#IND, -1.#IND]]) This was fixed in SVN. PPS: A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy. Well, go ahead and poke the people that said they would have a Windows binary ready. Or provide one yourself. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco Nov 12 '06 #12
 robert wrote: t = r * sqrt( (n-2)/(1-r**2) ) yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ? that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ? I gave you th formula. Solve for r and you get the confidence interval. You will need to use the inverse cumulative Student t distribution. Another quick-and-dirty solution is to use bootstrapping. from numpy import mean, std, sum, sqrt, sort from numpy.random import randint def bootstrap_correlation(x,y): idx = randint(len(x),size=(1000,len(x))) bx = x[idx] # reasmples x with replacement by = y[idx] # resamples y with replacement mx = mean(bx,1) my = mean(by,1) sx = std(bx,1) sy = std(by,1) r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) * (by - my.repeat(len(y),0).reshape(by.shape)), 1) / ((len(x)-1)*sx*sy)) #bootstrap confidence interval (NB! biased) return (r,r) My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. The bootstrap is "non-parametric" in the sense that it is distribution free. Nov 12 '06 #13
 While I am at it, lets add the bootstrap estimate of the standard error as well. from numpy import mean, std, sum, sqrt, sort, corrcoef, tanh, arctanh from numpy.random import randint def bootstrap_correlation(x,y): idx = randint(len(x),size=(1000,len(x))) bx = x[idx] by = y[idx] mx = mean(bx,1) my = mean(by,1) sx = std(bx,1) sy = std(by,1) r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) * (by - my.repeat(len(y),0).reshape(by.shape)), 1) / ((len(x)-1)*sx*sy)) #bootstrap confidence interval (NB! biased) conf_interval = (r,r) #bootstrap standard error using Fisher's z-transform (NB! biased) std_err = tanh(std(arctanh(r))*(len(r)/(len(r)-1.0))) return (std_err, conf_interval) Since we are not using bias corrected and accelerated bootstrap, the standard error are more meaningful, although both estimates are biased. I said that the boostrap is "distribution free". That is not quite the case either. The assumed distribution is the provided sample distribution, i.e. a sum of Dirac delta functions. Often this is not a very sound assumption, and the bootstrap must then be improved e.g. using the BCA procedure. That is particularly important for confidence intervals, but not so much for standard errors. However for a quick-and-dirty standard error estimate, we can just ignore the small bias. Now, if you wonder why I used the standard deviation to obtain the standard error (without dividing by sqrt(n) ), the answer is that the standard error is the standard deviation of an estimator. As we have 1000 samples from the correlation's sampling distribution in r, we get the "standard error of the correlation" by taking the standard deviation of r. The z-transform is required before we can compute mean and standard deviations of correlations. Nov 12 '06 #14
 sturlamolden wrote: robert wrote: >>t = r * sqrt( (n-2)/(1-r**2) ) >yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ? I gave you th formula. Solve for r and you get the confidence interval. You will need to use the inverse cumulative Student t distribution. Another quick-and-dirty solution is to use bootstrapping. from numpy import mean, std, sum, sqrt, sort from numpy.random import randint def bootstrap_correlation(x,y): idx = randint(len(x),size=(1000,len(x))) bx = x[idx] # reasmples x with replacement by = y[idx] # resamples y with replacement mx = mean(bx,1) my = mean(by,1) sx = std(bx,1) sy = std(by,1) r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) * (by - my.repeat(len(y),0).reshape(by.shape)), 1) / ((len(x)-1)*sx*sy)) #bootstrap confidence interval (NB! biased) return (r,r) >My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. The bootstrap is "non-parametric" in the sense that it is distribution free. thanks for the bootstrap tester. It confirms mainly the "r_stderr = (1-r**2)/sqrt(n)" formula. The assymetry of r (-1..+1) is less a problem. Yet my main problem, how to respect clumpy distribution in the data points, is still the same. In practice think of a situation where data out of an experiment has an unkown damping/filter (or whatever unkown data clumper) on it, thus lots of redundancy in effect. An extreme example is to just duplicate data: >>x ,y =[0.,0,0,0,1]*10 ,[0.,1,1,1,1]*10xx,yy=[0.,0,0,0,1]*100,[0.,1,1,1,1]*100correlation(x,y) (0.25, 0.132582521472, 0.25, 0.75) >>correlation(xx,yy) (0.25, 0.0419262745781, 0.25, 0.75) >>bootstrap_correlation(array(x),array(y)) (0.148447544378, 0.375391432338) >>bootstrap_correlation(array(xx),array(yy)) (0.215668822617, 0.285633303438) >>> here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good. The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation. So far I can only think of kind of geometric density approach... Or is there a commonly known straight forward approach/formula for this problem? In this formula which I can remember weakly somehow - I think there were other basic sum terms like sum_xxy, sum_xyy,.. in it (which are not needed for the formula for r itself ) Robert Nov 15 '06 #15
 4 posts views Thread by jason | last post: by 20 posts views Thread by mclaugb | last post: by 1 post views Thread by drife | last post: by 3 posts views Thread by Iljya | last post: by 15 posts views Thread by greg.landrum@gmail.com | last post: by 2 posts views Thread by robert | last post: by 1 post views Thread by 1960_j@operamail.com | last post: by 4 posts views Thread by Richard_Martineau@xyratex.com | last post: by 3 posts views Thread by Slaunger | last post: by reply views Thread by SwissProgrammer | last post: by 3 posts views Thread by SwissProgrammer | last post: by 1 post views Thread by AccessUser22 | last post: by reply views Thread by Raftar | last post: by reply views Thread by isladogs | last post: by reply views Thread by jobsrod | last post: by reply views Thread by nearlyhub | last post: by 1 post views Thread by MinkiChung | last post: by reply views Thread by sri8580 | last post: by