473,721 Members | 1,810 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

numpy/scipy: correlation

Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?

Or a formula to to compute the 3 error ranges?

-robert

PS:

numpy.corrcoef computes only the bare coeff:
>>numpy.corrcoe f((0,1,2,3.0),( 2,5,6,7.0),)
array([[ 1. , 0.95618289],
[ 0.95618289, 1. ]])

with ints it goes computes wrong:
>>numpy.corrcoe f((0,1,2,3),(2, 5,6,7),)
array([[ 1. , 0.94491118],
[ 0.94491118, 1. ]])

Nov 11 '06 #1
18 22395
robert wrote:
Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?
numpy and scipy questions are best asked on their lists, not here. There are a
number of people who know the capabilities of numpy and scipy through and
through, but most of them don't hang out on comp.lang.pytho n.

http://www.scipy.org/Mailing_Lists

scipy.optimize. leastsq() can be told to return the covariance matrix of the
estimated parameters (m and o in your example; I have no idea what you think
r-coeff is).

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 11 '06 #2
Robert Kern wrote:
robert wrote:
>Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?
scipy.optimize. leastsq() can be told to return the covariance matrix of the
estimated parameters (m and o in your example; I have no idea what you think
r-coeff is).
Ah, the correlation coefficient itself. Since correlation coefficients are weird
beasts constrained to [-1, 1], standard gaussian errors like you are expecting
for m-err and o-err don't apply. No, there's currently no function in numpy or
scipy that will do something sophisticated enough to be reliable. Here's an option:

http://www.pubmedcentral.nih.gov/art...i?artid=155684

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 11 '06 #3
robert wrote:
Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?
And of course, those three parameters are not particularly meaningful together.
If your model is truly "y is a linear response given x with normal noise" then
"y=m*x+o" is correct, and all of the information that you can get from the data
will be found in the estimates of m and o and the covariance matrix of the
estimates.

On the other hand, if your model is that "(x, y) is distributed as a bivariate
normal distribution" then "y=m*x+o" is not a particularly good representation of
the model. You should instead estimate the mean vector and covariance matrix of
(x, y). Your correlation coefficient will be the off-diagonal term after
dividing out the marginal standard deviations.

The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 11 '06 #4
Robert Kern wrote:
robert wrote:
>Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, m-err, o, o-err, r-coef,r-coef-err ?

And of course, those three parameters are not particularly meaningful together.
If your model is truly "y is a linear response given x with normal noise" then
"y=m*x+o" is correct, and all of the information that you can get from the data
will be found in the estimates of m and o and the covariance matrix of the
estimates.

On the other hand, if your model is that "(x, y) is distributed as a bivariate
normal distribution" then "y=m*x+o" is not a particularly good representation of
the model. You should instead estimate the mean vector and covariance matrix of
(x, y). Your correlation coefficient will be the off-diagonal term after
dividing out the marginal standard deviations.

The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.
Think the difference is little in practice - when you head for usable diagonals.
Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. )

Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points.
With big coef's and lots of distributed data the coef is very good by itself - its error range err(N) only approx ~ 1/sqrt(N)

One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.

Now think of adding 'boring data':
>>X=[1.,2,3,4]
Y=[1.,2,3,5]
sd.correlatio n((X,Y)) # my old func
(1.3, -0.5, 0.982707629824) # m,o,coef
>>numpy.corrcoe f((X,Y))
array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>>XX=[1.,1,1,1,1,2,3, 4]
YY=[1.,1,1,1,1,2,3, 5]
sd.correlatio n((XX,YY))
(1.23684210526, -0.289473684211, 0.988433774639)
>>>
I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...

I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.

http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2-Y
tells:

probable error of r = 0.6745*(1-r**2)/sqrt(N)

A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from

0.0115628571429 to 0.0054845341095 4 !

And the absolute size of this error measure seems to be too low for just 4 points of data!

The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
Robert

PS:

my old func is simply hands-on based on
n,sum_x,sum_y,s um_xy,sum_xx,su m_yy=len(vx),vx .sum(),vy.sum() ,(vx*vy).sum(), (vx*vx).sum(),( vy*vy).sum()
Guess its already fast for large data?
Note: numpy.corrcoef strikes on 2 points:
>>numpy.corrcoe f(([1,2],[1,2]))
array([[ -1.#IND, -1.#IND],
[ -1.#IND, -1.#IND]])
>>sd.correlatio n(([1,2],[1,2]))
(1, 0, 1.0)
>>>
numpy.corrcoe f(([1,2,3],[1,2,3]))
array([[ 1., 1.],
[ 1., 1.]])
>>sd.correlatio n(([1,2,3],[1,2,3]))
(1, 0, 1.0)
PPS:

A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
Nov 12 '06 #5

Robert Kern wrote:
The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.
That is not correct. The correlation coefficient is meaningful in both
models, but must be interpreted differently. However, in both cases a
correlation coefficient of 1 or -1 indicates an exact linear
relationship between x and y.

Under the first model ("linear regression"), the squared correlation
coefficient is the "explained variance", i.e. the the proportion of y's
variance accounted for by the model y = m*x + o.

Under the second model ("multivaria te normal distribution"), the
correlation coefficient is the covariance of y and x divided by the
product of the standard deviations, cov(x,y)/(std(x)*std(y)) .

Nov 12 '06 #6
robert wrote:
Robert Kern wrote:
http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2-Y

tells:
probable error of r = 0.6745*(1-r**2)/sqrt(N)

A simple function of r and N - quite what I expected above roughly for
the N-only dep.. But thus it is not sensitive to above considerations
about 'boring' data. With above example it would spit a decrease of this
probable coef-err from
0.0115628571429 to 0.0054845341095 4 !
This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 .
I do monte carlo on correlating random series:
>>X=numpy.rando m.random(10000)
l=[]
for i in range(200):
.... Z=numpy.random. random(10000)
.... l.append( sd.correlation( (X,Z))[2] ) #collect coef's
....
>>mean(l)
0.0003276570822 34
>>std(l)
0.0109120766158 # thats how the coef jitters
>>std(l)/sqrt(len(l))
0.0007716003371 85
>>len(l)
200

# now
# 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079
# vs M.C. 0.0109120766158 ± 0.0007716003371 85
but the fancy factor of 0.6745 is significantly just fancy for r=0.

then for a higher (0.5) correlation:
>>l=[]
for i in range(200):
.... Z=numpy.random. random(10000)+a rray(range(1000 0))/10000.0
.... l.append( sd.correlation( (X+array(range( 10000))/10000.0,Z))[2] )
....
>>mean(l)
0.498905642552
>>std(l)
0.0054697958316 3
>>std(l)/sqrt(len(l))
0.0003867729724 25

#now:
# 0.6745*(1-r**2)/sqrt(N) = 0.0051217322484 9)
# vs M.C. 0.0054697958316 3 ± 0.0003867729724 25

=there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it.

--

Now adding boring data:
>>boring=ones(1 0001)*0.5
X=numpy.rando m.random(10000)
l=[]
for i in range(200):
.... Z=concatenate(( numpy.random.ra ndom(10000)+arr ay(range(10000) )/10000.0,boring) )
.... l.append( sd.correlation( (concatenate((X +array(range(10 000))/10000.0,boring) ),Z))[2] )
....
>>mean(l)
0.712753628489 # r
>>std(l)
0.0031616364988 8 # r_err
>>std(l)/sqrt(len(l))
0.0002235614608

# now:
# 0.6745*(1-r**2)/sqrt(N) = 0.0023445997146 1 #N=20000
# vs M.C. streuung 0.0031616364988 8 ± 0.0002235614608

=the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N)

=I'll use this formula to get a downside error estimate for the correlation coefficient:

------------------------------------------
| r_err_down ~= 1.0 * (1-r**2)/sqrt(N) |
------------------------------------------

(until I find a better one respecting the actual distribution of data)

Would be interesting what MATLAB & Octave say ...
-robert
Nov 12 '06 #7

First, are you talking about rounding error (due to floating point
arithmetics) or statistical sampling error?

If you are talking about the latter, I suggest you look it up in a
statistics text book. E.g. if x and y are normally distributed, then

t = r * sqrt( (n-2)/(1-r**2) )

has a Student t-distribution with n-2 degrees of freedom. And if you
don't know how to get the p-value from that, you should not be messing
with statistics anyway.

Nov 12 '06 #8
sturlamolden wrote:
First, are you talking about rounding error (due to floating point
arithmetics) or statistical sampling error?
About measured data. rounding and sampling errors with special distrutions are neglegible. Thus by default assuming gaussian noise in x and y.
(This may explain that factor of ~0.7 in the rectangle M.C. test)
The (x,y) points may not distribute "nicely" along the assumed regression diagonale.
If you are talking about the latter, I suggest you look it up in a
statistics text book. E.g. if x and y are normally distributed, then

t = r * sqrt( (n-2)/(1-r**2) )

has a Student t-distribution with n-2 degrees of freedom. And if you
don't know how to get the p-value from that, you should not be messing
with statistics anyway.
yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?

The const. ~1 is less the problem.

My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. E.g. due to the nature of the experiment more points are around (0,0) but only few distribute along the interesting part of the diagonale and thus few points have great effect on m & r. above formulas will possibly not respect that.
I could try a weighting technique, but maybe there is a (commonly used) speedy formula for r/r_err respecting that directly?
Robert
Nov 12 '06 #9
On 11/12/06, robert <no*****@no-spam-no-spam.invalidwro te:
Robert Kern wrote:
robert wrote:
(...)
One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.
(...)
I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
(...)
>
Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

There is a transformation of the correlation coefficient that is
distributed as a t-statistic under the null. This was derived a long
way back, and is the usual, standard, way to test for significance of
(attach a p-value to) correlation coefficients. I do not recall the
formula from top of my head, but I am sure any google search will find
it. And of course, any decent intro stats textbook will also provide
it. You can look in your nearby library for popular textbooks such as
Snedecor & Cochran, or Sokal & Rohlf, or the wonderful "Beyond Anova"
by Miller, which do have the expression.

Since this is such a common procedure all stats programs do provide
for tests of significance of correlation coefficients. A whole bunch
of them are free, and one is widely used: R (check
http://cran.r-project.org). This is an easy way for you to test the
output of your stats code in Python.

Now, since you can do a significance test, you can invert the
procedure and obtain a confidence interval (of whichever width you
want) for your correlation coefficients. This CI will (almost always)
be assymmetric. (And recall that CI do have a not-obvious
interpretation) . CI of correlation coefficients are not as common as
p-values, but this is doable too.

The expression (and "improved versions" of it, I think Sokal & Rohlf
show several) is easy to compute. And I bet obtaining the density of a
t-distribution with k d.f. is also available already for Python. So
this part is definitely doable.
Best,

R.

(...)
Now think of adding 'boring data':
>X=[1.,2,3,4]
Y=[1.,2,3,5]
sd.correlation ((X,Y)) # my old func
(1.3, -0.5, 0.982707629824) # m,o,coef
>numpy.corrcoef ((X,Y))
array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>XX=[1.,1,1,1,1,2,3, 4]
YY=[1.,1,1,1,1,2,3, 5]
sd.correlation ((XX,YY))
(1.23684210526, -0.289473684211, 0.988433774639)
>>

I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...

http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.
http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2-Y
tells:

probable error of r = 0.6745*(1-r**2)/sqrt(N)

A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from

0.0115628571429 to 0.0054845341095 4 !

And the absolute size of this error measure seems to be too low for just 4 points of data!

The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
Robert

PS:

my old func is simply hands-on based on
n,sum_x,sum_y,s um_xy,sum_xx,su m_yy=len(vx),vx .sum(),vy.sum() ,(vx*vy).sum(), (vx*vx).sum(),( vy*vy).sum()
Guess its already fast for large data?
Note: numpy.corrcoef strikes on 2 points:
>numpy.corrcoef (([1,2],[1,2]))
array([[ -1.#IND, -1.#IND],
[ -1.#IND, -1.#IND]])
>sd.correlation (([1,2],[1,2]))
(1, 0, 1.0)
>>
numpy.corrcoef (([1,2,3],[1,2,3]))
array([[ 1., 1.],
[ 1., 1.]])
>sd.correlation (([1,2,3],[1,2,3]))
(1, 0, 1.0)
PPS:

A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
--
http://mail.python.org/mailman/listinfo/python-list

--
Ramon Diaz-Uriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Nov 12 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2589
by: jason | last post by:
Hello: I am using the following versions of Python and packages on Windows XP (SP2): Python 2.4.2 NumPy 0.9.4.win32-py2.4 SciPy 0.4.4 for Python 2.4 and Pentium 4/SSE2 In the Python Shell I am running the following:
20
2582
by: mclaugb | last post by:
Has anyone recompiled the Scientific Computing package using NumPy instead of Numeric? I need a least squares algorithm and a Newton Rhaphson algorithm which is contained in Numeric but all the documentation out there says that Numeric is crap and all code should be using NumPy. Thanks, Bryan
1
1524
by: drife | last post by:
Hello, I use the Python Numeric package extensively, and had been an avid user of the "old" scipy. In my view, both pieces of software are truly first rate, and have greatly improved my productivity in the area of scientific analysis. Thus, I was excited to make the transition to the new scipy core (numpy). Unfortunately, I am experiencing a problem that I cannot sort out. I am running Python 2.4.2 on a Debian box (V3.1), using
3
3610
by: Iljya | last post by:
Hello, I need to pickle the type numpy.float32 but encounter an error when I try to do so. I am able to pickle the array itself, it is specifically the type that I cannot pickle. I am using: Numpy version: 0.9.4 Python version: 2.4.3 Windows XP
15
2522
by: greg.landrum | last post by:
After using numeric for almost ten years, I decided to attempt to switch a large codebase (python and C++) to using numpy. Here's are some comments about how that went. - The code to automatically switch python stuff over just kind of works. But it was a 90% solution, I could do the rest by hand. Of course, the problem is that then the code is still using the old numeric API, so it's not a long term solution. Unfortunately, to switch to...
2
3940
by: robert | last post by:
I'm using latest numpy & scipy. What is this problem ? : RuntimeError: module compiled against version 1000002 of C-API but this version of numpy is 1000009 Traceback (most recent call last): File "<interactive input>", line 1, in ? File "C:\PYTHON23\Lib\site-packages\scipy\stats\__init__.py", line 7, in ? from stats import * File "C:\PYTHON23\Lib\site-packages\scipy\stats\stats.py", line 191, in ? import scipy.special as special File...
1
1613
by: 1960_j | last post by:
I have tried to install numpy and scipy on python 5.2. Using gcc 2.95.3, lapack 3.1.1 and ATLAS 3.6.0. When installin numpy it seems to work but when I try to run test get error no test for numpy. When I try to Install scipy only get error. Any ideas on how to install would be appreciated. Thanks
4
2129
by: Richard_Martineau | last post by:
Hello All I wonder if anyone can advise me or has done similar to the following? Basically I've downloaded the Python 2.5.2 source code that builds with Visual Studio 6.0. I've built Python for windows. This was easy (it even came with the pcbuild.dsw workspace file). Great! Now comes the troubled bit...I now look for similar source code for Python extensions Numpy and Scipy but the source code and directories
3
4218
by: Slaunger | last post by:
I know there must be a simple method to do this. I have implemented this function for calculating a checksum based on a ones complement addition: def complement_ones_checksum(ints): """ Returns a complements one checksum based on a specified numpy.array of dtype=uint16 """
0
8728
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9130
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9059
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8005
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6668
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5977
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4484
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4751
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3187
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.