Hi there,
First of all apologies for the long post. Hope someone can offer some advice.
I have about 200 columns of time series data that I need to perform a correlation analysis on in terms calculating lagged cross correlations between all the variables.
Currently the data is stored in Excel. Variable 1 is in Column A, variable 2 in Column B etc, the data ends in column GR which contains the last variable. I need to calculate the cross correlations between variable 1 in Column A and all other variables in the other remaining columns. The key thing here is that I need to calculate lagged as well as contemperaous correlations between each of the variables.
Then I want to do the same for the next variable in Column B, i.e. calculate contemperaous and lagged correlations between variable 2 and all other variables.
Is there anyway to automate this so that it work out the correlations up to say 10 lags of for each variable against all other variables and paste the results in some kind of readable grid in an excel sheet for each variable:
For example, for Column A - Variable 1 the results could be shown like this (for the first five variables):
B C D E F
Lag0 X X X X X
Lag1 X X X X X
Lag2 X X X X X
Lag3 X X X X X
Lag4 X X X X X
Lag5 X X X X X
Lag6 X X X X X
Lag7 X X X X X
Lag8 X X X X X
Lag9 X X X X X
Lag10 X X X X X
Where X represents the correlation coefficient between variable 1 and all the other variables (col B, C, D, E, F to Col GR) at different lags. The output doesn't have to be exactly like this though.
I can do this manually in excel using the Correl function, adjusting the series range in the formula every time I want to work out the lagged correlation.
Background: The correlation the function takes two arguments:
Array1, Array2
So to calculate correlations for Column A against all other columns I would input:
correl (A2:A100, B2:B100) to give the contemperous correlation between Col A and Col B, correl (A2:A100, c2:c100) to give correlation for col C vs. Col A etc.
Then to work out the correlation at 1 Lag I change the formula to (A3:A100, B2:B99), at lag 2 the formula becomes (A4:A100, B2:B98) etc. Then I do this for all the other columns/Lags but to do this for 200 variables, would take a huge amount of time (probably several days)! Is there a way of doing it using VBA perhaps? Even if it takes a few hours that's OK, as I can leave it to run in the background.
The way I'm thinking of doing this is setting up a workbook with the following sheets:
1) Source Sheet - this would contain all the raw data (Columns 1-200)as described in the top of the post.
2) Calculation sheet - Here I would have a duplicate of the above sheet but with one extra column (say column A) and so the dataset would end in column 201). Beneath this matrix I would set up formulas using the Excel’s correl function to work out the lagged correlation between the variable in column 1 and all other variables to say ten lags. Effectively column one would be variable column whereas all the data in the other columns would be fixed. I could then copy each variable from the source sheet and paste into column 1, work the correlations and then move to the next variable
3) Output Sheet - this would store the results of the correlations for each variable (effectively replicating the output grid I described earlier).
Would this be the best way of doing what I need? If so how easy is it to write the code? Basically I need code that will take the first variable from the source sheet, paste it into column A in the calculation sheet. Then copy the values from the correlation grid and paste it into the output sheet.
Then it needs to repeat the process for next column in the source sheet, and paste the values beneath the results for the previous column in the output sheet. I'm guess I need some kind of loop function that will loop through each variable in the source sheet and apply the code.
Any help much appreciated!
Thanks
Lucas
19 9371
Ok, i think there's a chance, but only a small one that VB can make it a lot faster, it depends on the algorithm you're using.
First i recomend you to find a nice quick algorithm in the internet (just google it)
then, dont work it on excel, VBA writes data on excel very slow, just put the data into an array, work it with arrays and then throw the results in another sheet.
the fastes way to add your excel worksheet to an array is something like:
dim a
with worksheets(1)
a = range(.cells(1,1), .cells(100,200))
Now, you'll have to make some DO for each correlation you want, and for the lagged ones, I think this could take some time as well, but if you dont find a faster method, i will be glad to help you (it seems somekind of fun)
Don't mind me. Just registering an interest in this thread, as I'm curious to see where it goes.
Hi,
Thanks for your advice.
When you say download a quick algorithm from Google - do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well?
Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function?
I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process.
Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis!
Hi,
Thanks for your advice.
When you say download a quick algorithm from Google - do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well?
Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function?
I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process.
Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis!
Yeap, i was suggesting you to use som raw VBA code for that function. Nothing to be afraid of, since it will be all about IF, DO, FOR, WHILE and basic stuff. Im an analyst too, an actuary. That doesnt mean you're not allowed to create your own codes to make your own solutions. And if you came to a developers network, i think that's the help we can give to you.
what say you?
Yeap, actually it was fun making it :)
There you go, just make sure that the first row of each column is the title, the info must start in the second row, and it doesnt matter where it ends.
I made a litlle test for 180 columns * 70 rows, lag from 0 to 5 and it took 55 seconds, so i think it's quite fast.
Hope it helps.
(if you dont know how to run it, just go to the worksheet where the data is and press Alt+F11, when you're in the VB editor, add a new Module, and paste the code into it, then move your cursor to the second part of the code, any line of the sub called GimmeTheCorrelation, and press F5 to run it, it will show a Message Box with the time elapsed when it finishes, if you get tired of wainting press Ctrl+Pause to stop it at any moment) - Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double
-
Dim a1 As Long
-
Dim a2 As Long
-
Dim Sx As Double
-
Dim Sx2 As Double
-
Dim Sy As Double
-
Dim Sy2 As Double
-
Dim Sxy As Double
-
Dim i As Long
-
a1 = LBound(a) + Lag
-
a2 = UBound(a)
-
For i = a1 To a2
-
Sx = Sx + a(i, 1)
-
Sx2 = Sx2 + a(i, 1) ^ 2
-
Sy = Sy + b(i - Lag, 1)
-
Sy2 = Sy2 + b(i - Lag, 1) ^ 2
-
Sxy = Sxy + (a(i, 1) * b(i - Lag, 1))
-
Next
-
i = a2 - a1 + 1
-
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
-
End Function
-
-
-
Sub GimmeTheCorrelation()
-
Dim a()
-
Dim b()
-
Dim names()
-
Dim names2()
-
Dim j As Byte
-
Dim k1 As Long
-
Dim k2 As Long
-
Dim n As Byte
-
Dim t As Single
-
t = Timer
-
n = 5 '<--- this will be the number of lags (from 0 to n)
-
-
Do
-
j = j + 1
-
Loop Until Cells(2, j) = ""
-
ReDim a(1 To j - 1)
-
ReDim b(1 To j - 1, 1 To j - 1)
-
For j = LBound(a) To UBound(a)
-
a(j) = Range(Cells(2, j), Cells(2, j).End(-4121))
-
Next
-
names = Range(Cells(1, 1), Cells(1, 1).End(-4161))
-
ReDim names2(LBound(names, 2) To UBound(names, 2), 1 To 1)
-
For j = LBound(names, 2) To UBound(names, 2)
-
names2(j, 1) = names(1, j)
-
Next
-
-
For j = 0 To n
-
For k1 = LBound(a) To UBound(a)
-
For k2 = LBound(a) To UBound(a)
-
b(k1, k2) = Correlation(a(k1), a(k2), j)
-
Next
-
Next
-
Worksheets.Add
-
With Worksheets(1)
-
Range(.Cells(2, 2), .Cells(UBound(b) + 1, UBound(b) + 1)) = b
-
Range(.Cells(1, 2), .Cells(1, UBound(b) + 1)) = names
-
Range(.Cells(2, 1), .Cells(UBound(b) + 1, 1)) = names2
-
.Name = "LAG " & j
-
End With
-
Next
-
MsgBox "It took " & Timer - t & " seconds."
-
-
End Sub
Hello Kadghar,
Thanks very much for the code. It's amazing, I thought it would take an hour or two to calculate than just seconds.
I have many questions about the actual code in terms of how it works but really there is one issue. As per my original post, I'm my interested in analysing the correlation at all lags between a variable and other variables rather than analysing correlations at specific lags if you know what I mean.
So the way the current output is formulated is good at answering the question; at a given lag (e.g. five) which variables have the highest and lowest correlation. But instead, I want to answer the question; for a given variable, which variables offer the highest correlations and at what lag.
So instead of the current correlation matrix output at different lags (per sheet):
Numbers denote different variables
1 2 3 4 5 6
1 X X X X X
2 X X etc
3
4
5
6
I need:
For Variable 1
1 2 3 4 5 6
Lag 0 X X X X X X
Lag 1 X X X etc
Lag 2
Lag 3
Lag 4
Lag 5
Variable 2
1 2 3 4 5 6
Lag 0 X X X X X X
Lag 1 X X X X X
Lag 2
Lag 3
Lag 4
Lag 5
I realise all the data I need is calculated by the procedure you kindly created for me, but just in a different format. I guess my question is what is the best way to get the format I need? Would it easiest to just to amend the existing code to get the desired output (I've had a quick look but wouldn't know where to start!) or would it be better to create a secondary macro to specifically take the current output format and re-arrange into the desired format.
Many Thanks
Lucas
Hi,
I see the problem. I think it's not that hard to fix. We'll have to change some FOR in the main procedure. I'll give it a try when i have some free time.
Kad
Thanks a lot.
That's very kind.
Lucas
There you go.
this might be what we need. - Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double
-
Dim a1 As Long
-
Dim a2 As Long
-
Dim Sx As Double
-
Dim Sx2 As Double
-
Dim Sy As Double
-
Dim Sy2 As Double
-
Dim Sxy As Double
-
Dim i As Long
-
a1 = LBound(a) + Lag
-
a2 = UBound(a)
-
For i = a1 To a2
-
Sx = Sx + a(i, 1)
-
Sx2 = Sx2 + a(i, 1) ^ 2
-
Sy = Sy + b(i - Lag, 1)
-
Sy2 = Sy2 + b(i - Lag, 1) ^ 2
-
Sxy = Sxy + (a(i, 1) * b(i - Lag, 1))
-
Next
-
i = a2 - a1 + 1
-
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
-
End Function
-
Sub GimmeTheCorrelation()
-
Dim a()
-
Dim b()
-
Dim names()
-
Dim names2()
-
Dim j As Byte
-
Dim k1 As Long
-
Dim k2 As Long
-
Dim n As Byte
-
Dim t As Single
-
Dim f As Long
-
-
t = Timer
-
n = 5 '<--- this will be the number of lags (from 0 to n)
-
-
Do
-
j = j + 1
-
Loop Until Cells(2, j) = ""
-
ReDim a(1 To j - 1)
-
ReDim b(0 To n, 1 To j - 1)
-
For j = LBound(a) To UBound(a)
-
a(j) = Range(Cells(2, j), Cells(2, j).End(-4121))
-
Next
-
names = Range(Cells(1, 1), Cells(1, 1).End(-4161))
-
ReDim names2(0 To n, 1 To 1)
-
For j = 0 To n
-
names2(j, 1) = "Lag " & j
-
Next
-
Worksheets.Add
-
Worksheets(1).Name = "Correlations"
-
f = 1
-
For k1 = LBound(a) To UBound(a)
-
For j = 0 To n
-
For k2 = LBound(a) To UBound(a)
-
b(j, k2) = Correlation(a(k1), a(k2), j)
-
Next
-
Next
-
With Worksheets(1)
-
.Cells(f, 1).Value = names(1, k1)
-
Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b
-
Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names
-
Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2
-
End With
-
f = f + n + 3
-
Next
-
MsgBox "It took " & Timer - t & " seconds."
-
End Sub
Hi kadghar,
Thanks for the code. I tried it out, works great.
Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones?
Thanks,
Lucas
Hi kadghar,
Thanks for the code. I tried it out, works great.
Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones?
Thanks,
Lucas
The function??
From wikipedia... but is the one i almost-remembered from college, just transform the sums into FORs with a variable that saves the sum.
Good to know it worked fine.
Hi,
I'm trying to run the code created on this site but it returns the following run-time error 11: Divsion By Zero. Debuging takes you to the following line in the custom function:
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this?
Thanks,
Lucas -
Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double
-
Dim a1 As Long
-
Dim a2 As Long
-
Dim Sx As Double
-
Dim Sx2 As Double
-
Dim Sy As Double
-
Dim Sy2 As Double
-
Dim Sxy As Double
-
Dim i As Long
-
a1 = LBound(a) + Lag
-
a2 = UBound(a)
-
For i = a1 To a2
-
Sx = Sx + a(i, 1)
-
Sx2 = Sx2 + a(i, 1) ^ 2
-
Sy = Sy + b(i - Lag, 1)
-
Sy2 = Sy2 + b(i - Lag, 1) ^ 2
-
Sxy = Sxy + (a(i, 1) * b(i - Lag, 1))
-
Next
-
i = a2 - a1 + 1
-
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
-
End Function
-
-
Sub GimmeTheCorrelation()
-
Dim a()
-
Dim b()
-
Dim names()
-
Dim names2()
-
Dim j As Byte
-
Dim k1 As Long
-
Dim k2 As Long
-
Dim n As Byte
-
Dim t As Single
-
Dim f As Long
-
-
t = Timer
-
n = 5 '<--- this will be the number of lags (from 0 to n)
-
-
Do
-
j = j + 1
-
Loop Until Cells(2, j) = ""
-
ReDim a(1 To j - 1)
-
ReDim b(0 To n, 1 To j - 1)
-
For j = LBound(a) To UBound(a)
-
a(j) = Range(Cells(2, j), Cells(2, j).End(-4121))
-
Next
-
names = Range(Cells(1, 1), Cells(1, 1).End(-4161))
-
ReDim names2(0 To n, 1 To 1)
-
For j = 0 To n
-
names2(j, 1) = "Lag " & j
-
Next
-
Worksheets.Add
-
Worksheets(1).Name = "Correlations"
-
f = 1
-
For k1 = LBound(a) To UBound(a)
-
For j = 0 To n
-
For k2 = LBound(a) To UBound(a)
-
b(j, k2) = Correlation(a(k1), a(k2), j)
-
Next
-
Next
-
With Worksheets(1)
-
.Cells(f, 1).Value = names(1, k1)
-
Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b
-
Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names
-
Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2
-
End With
-
f = f + n + 3
-
Next
-
MsgBox "It took " & Timer - t & " seconds."
-
End Sub
Hi,
I'm trying to run the code created on this site but it returns the following run-time error 11: Divsion By Zero. Debuging takes you to the following line in the custom function:
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this?
Thanks,
Lucas
I dont think its because you're using many cells, It might be because:
i*Sx2 - Sx^2 = zero
or
i*Sy2 - Sy^2 = zero
this can be because of many things, one, is because you have some empty cells in your range, or maybe just bad luck and that's how te calculation is done. I'll recomend you to Click on the Debug button when the error ocurs, and check in the locals window the values of the indexes and of the variables (i, Sy, Sy2, etc) this will help you finding out where the zero comes from: a cell or a previous calculation.
HTH
Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF.
Lucas
Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF.
Lucas
are you sure?
the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so:
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
when i = 0 then
Correlation =
-Sx*Sy / (Sx2 * Sy2)
and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^-390 (then the number is so small that it'll be taken as a zero)
but... are they that small?, is like the joke (a really bad joke) :
2 + 2 = 5, for very big values of 2.
But it can also be that i takes a huge huge value... then, i dont know what to do
perhaps we need an arbitrary precision algorithm
Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is.
Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF.
Lucas
are you sure?
the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so:
Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5))
when i = 0 then
Correlation =
-Sx*Sy / (Sx2 * Sy2)
and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^-390 (then the number is so small that it'll be taken as a zero)
but... are they that small?, is like the joke (a really bad joke) :
2 + 2 = 5, for very big values of 2.
But it can also be that i takes a huge huge value... then, i dont know what to do
perhaps we need an arbitrary precision algorithm
Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is.
OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran!
So it looks like that's is what was causing the error. Does that make sense from a programing perspective?
Thanks for your help.
Lucas
OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran!
So it looks like that's is what was causing the error. Does that make sense from a programing perspective?
Thanks for your help.
Lucas
From a programing perspective?? I dont think so.
Well, i have to admit im a little bit rusted, but i'd say it's because of the correlation properties.
Recently found the site and kadghar's seemed like an excellent solution to the same problem I am facing. However, though the module runs, I receive a blank output sheet in Excel. What could be going wrong?
Thanks for everyone's patience.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Limey Drink |
last post by:
Hi all,
Firstly :-) , is there any where I can search through archived newsgroup
posts so I am not cluttering up the newsgroup with repeated queries ?
And secondly :-), I know this has...
|
by: Martin MOKREJ© |
last post by:
Hi,
I need to build 3 classes for my work, but each have dozens or hundreds of
variables, self bound. I have about 250 columns in about 10 mysql tables.
I can squeeze everything into 10 classes,...
|
by: xzzy |
last post by:
I need to automate a report in a different database (and thank yous to Terry
Kreft for pointing me in the right direction).
below is the code with the one line that does not work, marked: 'Does...
|
by: Randy |
last post by:
Hello,
I'm having a strange problem. I've got a .NET web app which uses Session
variables. Sometime, not all the time, they get cross threaded...that
is...one user will have another user's Session...
|
by: erick-flores |
last post by:
Hello all
I have a linked table using ODBC. When I tried to do Count() and
Sum() for this linked table (onw of the field in the table is
name "total") the form takes forever to gives me the...
|
by: Michiel van der Blonk |
last post by:
It's probably something with rounding off, or floating point problems.
I have a timer which is used when solving the Rubik's cube. It has to
measure an exact time, but the clock display I made...
|
by: beach.dk |
last post by:
Hi,
I'm trying to implement a simple hash algorith called rs_hash in
javascript,
but I cannot get a correct result.
In c the code looks like this:
|
by: deshg |
last post by:
Hey everyone,
I am a php programmer originally and am just helping a friend of mine update their website that they paid a designer (well that's what he called himself!) to do ages ago. I have...
|
by: =?Utf-8?B?ZG9uZGlnaXRlY2g=?= |
last post by:
I having problems getting this function to return the correct value.
Basically I have a curveDataCalc method that loops as it should, but just
doesn't seem to be working right. I've stepped through...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
| |