Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, merr, o, oerr, rcoef,rcoeferr ?
Or a formula to to compute the 3 error ranges?
robert
PS:
numpy.corrcoef computes only the bare coeff:
>>numpy.corrcoe f((0,1,2,3.0),( 2,5,6,7.0),)
array([[ 1. , 0.95618289],
[ 0.95618289, 1. ]])
with ints it goes computes wrong:
>>numpy.corrcoe f((0,1,2,3),(2, 5,6,7),)
array([[ 1. , 0.94491118],
[ 0.94491118, 1. ]]) 18 22391
robert wrote:
Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, merr, o, oerr, rcoef,rcoeferr ?
numpy and scipy questions are best asked on their lists, not here. There are a
number of people who know the capabilities of numpy and scipy through and
through, but most of them don't hang out on comp.lang.pytho n. http://www.scipy.org/Mailing_Lists
scipy.optimize. leastsq() can be told to return the covariance matrix of the
estimated parameters (m and o in your example; I have no idea what you think
rcoeff is).

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
 Umberto Eco
Robert Kern wrote:
robert wrote:
>Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: m, merr, o, oerr, rcoef,rcoeferr ?
scipy.optimize. leastsq() can be told to return the covariance matrix of the
estimated parameters (m and o in your example; I have no idea what you think
rcoeff is).
Ah, the correlation coefficient itself. Since correlation coefficients are weird
beasts constrained to [1, 1], standard gaussian errors like you are expecting
for merr and oerr don't apply. No, there's currently no function in numpy or
scipy that will do something sophisticated enough to be reliable. Here's an option: http://www.pubmedcentral.nih.gov/art...i?artid=155684

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
 Umberto Eco
robert wrote:
Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
m, merr, o, oerr, rcoef,rcoeferr ?
And of course, those three parameters are not particularly meaningful together.
If your model is truly "y is a linear response given x with normal noise" then
"y=m*x+o" is correct, and all of the information that you can get from the data
will be found in the estimates of m and o and the covariance matrix of the
estimates.
On the other hand, if your model is that "(x, y) is distributed as a bivariate
normal distribution" then "y=m*x+o" is not a particularly good representation of
the model. You should instead estimate the mean vector and covariance matrix of
(x, y). Your correlation coefficient will be the offdiagonal term after
dividing out the marginal standard deviations.
The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.

Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
 Umberto Eco
Robert Kern wrote:
robert wrote:
>Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast: m, merr, o, oerr, rcoef,rcoeferr ?
And of course, those three parameters are not particularly meaningful together.
If your model is truly "y is a linear response given x with normal noise" then
"y=m*x+o" is correct, and all of the information that you can get from the data
will be found in the estimates of m and o and the covariance matrix of the
estimates.
On the other hand, if your model is that "(x, y) is distributed as a bivariate
normal distribution" then "y=m*x+o" is not a particularly good representation of
the model. You should instead estimate the mean vector and covariance matrix of
(x, y). Your correlation coefficient will be the offdiagonal term after
dividing out the marginal standard deviations.
The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.
Think the difference is little in practice  when you head for usable diagonals.
Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to classlearning .. )
Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coeferr). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3  see below ) points.
With big coef's and lots of distributed data the coef is very good by itself  its error range err(N) only approx ~ 1/sqrt(N)
One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points  maybe coef=0.05 . Got it  or not? Thus lower coefs require naturally a coeferr to be useful in practice.
Now think of adding 'boring data':
>>X=[1.,2,3,4] Y=[1.,2,3,5] sd.correlatio n((X,Y)) # my old func
(1.3, 0.5, 0.982707629824) # m,o,coef
>>numpy.corrcoe f((X,Y))
array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>>XX=[1.,1,1,1,1,2,3, 4] YY=[1.,1,1,1,1,2,3, 5] sd.correlatio n((XX,YY))
(1.23684210526, 0.289473684211, 0.988433774639)
>>>
I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore. http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with pvalue.
Does anybody know how this prob.value is computed/motivated? Such thing would be very helpful for numpy/scipy too. http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2Y
tells:
probable error of r = 0.6745*(1r**2)/sqrt(N)
A simple function of r and N  quite what I expected above roughly for the Nonly dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coeferr from
0.0115628571429 to 0.0054845341095 4 !
And the absolute size of this error measure seems to be too low for just 4 points of data!
The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
Robert
PS:
my old func is simply handson based on
n,sum_x,sum_y,s um_xy,sum_xx,su m_yy=len(vx),vx .sum(),vy.sum() ,(vx*vy).sum(), (vx*vx).sum(),( vy*vy).sum()
Guess its already fast for large data?
Note: numpy.corrcoef strikes on 2 points:
>>numpy.corrcoe f(([1,2],[1,2]))
array([[ 1.#IND, 1.#IND],
[ 1.#IND, 1.#IND]])
>>sd.correlatio n(([1,2],[1,2]))
(1, 0, 1.0)
>>> numpy.corrcoe f(([1,2,3],[1,2,3]))
array([[ 1., 1.],
[ 1., 1.]])
>>sd.correlatio n(([1,2,3],[1,2,3]))
(1, 0, 1.0)
PPS:
A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible mostrecent libs of scipy and numpy.
Robert Kern wrote:
The difference between the two models is that the first places no restrictions
on the distribution of x. The second does; both the x and y marginal
distributions need to be normal. Under the first model, the correlation
coefficient has no meaning.
That is not correct. The correlation coefficient is meaningful in both
models, but must be interpreted differently. However, in both cases a
correlation coefficient of 1 or 1 indicates an exact linear
relationship between x and y.
Under the first model ("linear regression"), the squared correlation
coefficient is the "explained variance", i.e. the the proportion of y's
variance accounted for by the model y = m*x + o.
Under the second model ("multivaria te normal distribution"), the
correlation coefficient is the covariance of y and x divided by the
product of the standard deviations, cov(x,y)/(std(x)*std(y)) .
robert wrote:
Robert Kern wrote: http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2Y
tells:
probable error of r = 0.6745*(1r**2)/sqrt(N)
A simple function of r and N  quite what I expected above roughly for
the Nonly dep.. But thus it is not sensitive to above considerations
about 'boring' data. With above example it would spit a decrease of this
probable coeferr from
0.0115628571429 to 0.0054845341095 4 !
This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 .
I do monte carlo on correlating random series:
>>X=numpy.rando m.random(10000) l=[] for i in range(200):
.... Z=numpy.random. random(10000)
.... l.append( sd.correlation( (X,Z))[2] ) #collect coef's
....
>>mean(l)
0.0003276570822 34
>>std(l)
0.0109120766158 # thats how the coef jitters
>>std(l)/sqrt(len(l))
0.0007716003371 85
>>len(l)
200
# now
# 0.6745*(1r**2)/sqrt(N) = 0.0067440015079
# vs M.C. 0.0109120766158 Â± 0.0007716003371 85
but the fancy factor of 0.6745 is significantly just fancy for r=0.
then for a higher (0.5) correlation:
>>l=[] for i in range(200):
.... Z=numpy.random. random(10000)+a rray(range(1000 0))/10000.0
.... l.append( sd.correlation( (X+array(range( 10000))/10000.0,Z))[2] )
....
>>mean(l)
0.498905642552
>>std(l)
0.0054697958316 3
>>std(l)/sqrt(len(l))
0.0003867729724 25
#now:
# 0.6745*(1r**2)/sqrt(N) = 0.0051217322484 9)
# vs M.C. 0.0054697958316 3 Â± 0.0003867729724 25
=there the 0.6745 factor and (1r**2) seem to get the main effect ! There is something in it.

Now adding boring data:
>>boring=ones(1 0001)*0.5 X=numpy.rando m.random(10000) l=[] for i in range(200):
.... Z=concatenate(( numpy.random.ra ndom(10000)+arr ay(range(10000) )/10000.0,boring) )
.... l.append( sd.correlation( (concatenate((X +array(range(10 000))/10000.0,boring) ),Z))[2] )
....
>>mean(l)
0.712753628489 # r
>>std(l)
0.0031616364988 8 # r_err
>>std(l)/sqrt(len(l))
0.0002235614608
# now:
# 0.6745*(1r**2)/sqrt(N) = 0.0023445997146 1 #N=20000
# vs M.C. streuung 0.0031616364988 8 Â± 0.0002235614608
=the boring data has an effect on coeferr which is significantly not reflected by the formula 0.6745*(1r**2)/sqrt(N)
=I'll use this formula to get a downside error estimate for the correlation coefficient:

 r_err_down ~= 1.0 * (1r**2)/sqrt(N) 

(until I find a better one respecting the actual distribution of data)
Would be interesting what MATLAB & Octave say ...
robert
First, are you talking about rounding error (due to floating point
arithmetics) or statistical sampling error?
If you are talking about the latter, I suggest you look it up in a
statistics text book. E.g. if x and y are normally distributed, then
t = r * sqrt( (n2)/(1r**2) )
has a Student tdistribution with n2 degrees of freedom. And if you
don't know how to get the pvalue from that, you should not be messing
with statistics anyway.
sturlamolden wrote:
First, are you talking about rounding error (due to floating point
arithmetics) or statistical sampling error?
About measured data. rounding and sampling errors with special distrutions are neglegible. Thus by default assuming gaussian noise in x and y.
(This may explain that factor of ~0.7 in the rectangle M.C. test)
The (x,y) points may not distribute "nicely" along the assumed regression diagonale.
If you are talking about the latter, I suggest you look it up in a
statistics text book. E.g. if x and y are normally distributed, then
t = r * sqrt( (n2)/(1r**2) )
has a Student tdistribution with n2 degrees of freedom. And if you
don't know how to get the pvalue from that, you should not be messing
with statistics anyway.
yet too lazy/practical for digging these things from there. You obviously got it  out of that, what would be a final estimate for an error range of r (n big) ?
that same "const. * (1r**2)/sqrt(n)" which I found in that other document ?
The const. ~1 is less the problem.
My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. E.g. due to the nature of the experiment more points are around (0,0) but only few distribute along the interesting part of the diagonale and thus few points have great effect on m & r. above formulas will possibly not respect that.
I could try a weighting technique, but maybe there is a (commonly used) speedy formula for r/r_err respecting that directly?
Robert
On 11/12/06, robert <no*****@nospamnospam.invalidwro te:
Robert Kern wrote:
robert wrote:
(...)
One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points  maybe coef=0.05 . Got it  or not? Thus lower coefs require naturally a coeferr to be useful in practice.
(...)
I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
(...)
>
Does anybody know how this prob.value is computed/motivated? Such thing would be very helpful for numpy/scipy too.
There is a transformation of the correlation coefficient that is
distributed as a tstatistic under the null. This was derived a long
way back, and is the usual, standard, way to test for significance of
(attach a pvalue to) correlation coefficients. I do not recall the
formula from top of my head, but I am sure any google search will find
it. And of course, any decent intro stats textbook will also provide
it. You can look in your nearby library for popular textbooks such as
Snedecor & Cochran, or Sokal & Rohlf, or the wonderful "Beyond Anova"
by Miller, which do have the expression.
Since this is such a common procedure all stats programs do provide
for tests of significance of correlation coefficients. A whole bunch
of them are free, and one is widely used: R (check http://cran.rproject.org). This is an easy way for you to test the
output of your stats code in Python.
Now, since you can do a significance test, you can invert the
procedure and obtain a confidence interval (of whichever width you
want) for your correlation coefficients. This CI will (almost always)
be assymmetric. (And recall that CI do have a notobvious
interpretation) . CI of correlation coefficients are not as common as
pvalues, but this is doable too.
The expression (and "improved versions" of it, I think Sokal & Rohlf
show several) is easy to compute. And I bet obtaining the density of a
tdistribution with k d.f. is also available already for Python. So
this part is definitely doable.
Best,
R.
(...)
Now think of adding 'boring data':
>X=[1.,2,3,4] Y=[1.,2,3,5] sd.correlation ((X,Y)) # my old func
(1.3, 0.5, 0.982707629824) # m,o,coef
>numpy.corrcoef ((X,Y))
array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>XX=[1.,1,1,1,1,2,3, 4] YY=[1.,1,1,1,1,2,3, 5] sd.correlation ((XX,YY))
(1.23684210526, 0.289473684211, 0.988433774639)
>>
I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
http://en.wikipedia.org/wiki/Pearson...ficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with pvalue.
http://links.jstor.org/sici?sici=016...3E2.0.CO%3B2Y
tells:
probable error of r = 0.6745*(1r**2)/sqrt(N)
A simple function of r and N  quite what I expected above roughly for the Nonly dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coeferr from
0.0115628571429 to 0.0054845341095 4 !
And the absolute size of this error measure seems to be too low for just 4 points of data!
The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
Robert
PS:
my old func is simply handson based on
n,sum_x,sum_y,s um_xy,sum_xx,su m_yy=len(vx),vx .sum(),vy.sum() ,(vx*vy).sum(), (vx*vx).sum(),( vy*vy).sum()
Guess its already fast for large data?
Note: numpy.corrcoef strikes on 2 points:
>numpy.corrcoef (([1,2],[1,2]))
array([[ 1.#IND, 1.#IND],
[ 1.#IND, 1.#IND]])
>sd.correlation (([1,2],[1,2]))
(1, 0, 1.0)
>> numpy.corrcoef (([1,2,3],[1,2,3]))
array([[ 1., 1.],
[ 1., 1.]])
>sd.correlation (([1,2,3],[1,2,3]))
(1, 0, 1.0)
PPS:
A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible mostrecent libs of scipy and numpy.
 http://mail.python.org/mailman/listinfo/pythonlist

Ramon DiazUriarte
Statistical Computing Team
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO) http://ligarto.org/rdiaz This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics 
by: jason 
last post by:
Hello:
I am using the following versions of Python and packages on Windows XP
(SP2):
Python 2.4.2
NumPy 0.9.4.win32py2.4
SciPy 0.4.4 for Python 2.4 and Pentium 4/SSE2
In the Python Shell I am running the following:

by: mclaugb 
last post by:
Has anyone recompiled the Scientific Computing package using NumPy instead
of Numeric?
I need a least squares algorithm and a Newton Rhaphson algorithm which is
contained in Numeric but all the documentation out there says that Numeric
is crap and all code should be using NumPy.
Thanks,
Bryan

by: drife 
last post by:
Hello,
I use the Python Numeric package extensively, and had been an
avid user of the "old" scipy. In my view, both pieces of software
are truly first rate, and have greatly improved my productivity in
the area of scientific analysis. Thus, I was excited to make the
transition to the new scipy core (numpy).
Unfortunately, I am experiencing a problem that I cannot sort
out. I am running Python 2.4.2 on a Debian box (V3.1), using

by: Iljya 
last post by:
Hello,
I need to pickle the type numpy.float32 but encounter an error when I
try to do so. I am able to pickle the array itself, it is specifically
the type that I cannot pickle.
I am using:
Numpy version: 0.9.4
Python version: 2.4.3
Windows XP

by: greg.landrum 
last post by:
After using numeric for almost ten years, I decided to attempt to
switch a large codebase (python and C++) to using numpy. Here's are
some comments about how that went.
 The code to automatically switch python stuff over just kind of
works. But it was a 90% solution, I could do the rest by hand. Of
course, the problem is that then the code is still using the old
numeric API, so it's not a long term solution. Unfortunately, to switch
to...
 
by: robert 
last post by:
I'm using latest numpy & scipy. What is this problem ? :
RuntimeError: module compiled against version 1000002 of CAPI but this version of numpy is 1000009
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\sitepackages\scipy\stats\__init__.py", line 7, in ?
from stats import *
File "C:\PYTHON23\Lib\sitepackages\scipy\stats\stats.py", line 191, in ?
import scipy.special as special
File...

by: 1960_j 
last post by:
I have tried to install numpy and scipy on python 5.2. Using gcc
2.95.3, lapack 3.1.1 and ATLAS 3.6.0.
When installin numpy it seems to work but when I try to run test get
error no test for numpy.
When I try to Install scipy only get error.
Any ideas on how to install would be appreciated.
Thanks

by: Richard_Martineau 
last post by:
Hello All
I wonder if anyone can advise me or has done similar to the following?
Basically I've downloaded the Python 2.5.2 source code that builds
with Visual Studio 6.0. I've built Python for windows. This was easy
(it even came with the pcbuild.dsw workspace file). Great!
Now comes the troubled bit...I now look for similar source code for
Python extensions Numpy and Scipy but the source code and directories

by: Slaunger 
last post by:
I know there must be a simple method to do this.
I have implemented this function for calculating a checksum based on a
ones complement addition:
def complement_ones_checksum(ints):
"""
Returns a complements one checksum based
on a specified numpy.array of dtype=uint16
"""

by: marktang 
last post by:
ONU (Optical Network Unit) is one of the key components for providing highspeed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...

by: Hystou 
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
 
by: jinu1996 
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...

by: Hystou 
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

by: tracyyun 
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, ZWave, WiFi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

by: isladogs 
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...

by: conductexam 
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...

by: 6302768590 
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
 
by: muto222 
last post by:
How can i add a mobile payment intergratation into php mysql website.
 