P: 16

Hi there,
First of all apologies for the long post. Hope someone can offer some advice.
I have about 200 columns of time series data that I need to perform a correlation analysis on in terms calculating lagged cross correlations between all the variables.
Currently the data is stored in Excel. Variable 1 is in Column A, variable 2 in Column B etc, the data ends in column GR which contains the last variable. I need to calculate the cross correlations between variable 1 in Column A and all other variables in the other remaining columns. The key thing here is that I need to calculate lagged as well as contemperaous correlations between each of the variables.
Then I want to do the same for the next variable in Column B, i.e. calculate contemperaous and lagged correlations between variable 2 and all other variables.
Is there anyway to automate this so that it work out the correlations up to say 10 lags of for each variable against all other variables and paste the results in some kind of readable grid in an excel sheet for each variable:
For example, for Column A  Variable 1 the results could be shown like this (for the first five variables):
B C D E F
Lag0 X X X X X
Lag1 X X X X X
Lag2 X X X X X
Lag3 X X X X X
Lag4 X X X X X
Lag5 X X X X X
Lag6 X X X X X
Lag7 X X X X X
Lag8 X X X X X
Lag9 X X X X X
Lag10 X X X X X
Where X represents the correlation coefficient between variable 1 and all the other variables (col B, C, D, E, F to Col GR) at different lags. The output doesn't have to be exactly like this though.
I can do this manually in excel using the Correl function, adjusting the series range in the formula every time I want to work out the lagged correlation.
Background: The correlation the function takes two arguments:
Array1, Array2
So to calculate correlations for Column A against all other columns I would input:
correl (A2:A100, B2:B100) to give the contemperous correlation between Col A and Col B, correl (A2:A100, c2:c100) to give correlation for col C vs. Col A etc.
Then to work out the correlation at 1 Lag I change the formula to (A3:A100, B2:B99), at lag 2 the formula becomes (A4:A100, B2:B98) etc. Then I do this for all the other columns/Lags but to do this for 200 variables, would take a huge amount of time (probably several days)! Is there a way of doing it using VBA perhaps? Even if it takes a few hours that's OK, as I can leave it to run in the background.
The way I'm thinking of doing this is setting up a workbook with the following sheets:
1) Source Sheet  this would contain all the raw data (Columns 1200)as described in the top of the post.
2) Calculation sheet  Here I would have a duplicate of the above sheet but with one extra column (say column A) and so the dataset would end in column 201). Beneath this matrix I would set up formulas using the Excel’s correl function to work out the lagged correlation between the variable in column 1 and all other variables to say ten lags. Effectively column one would be variable column whereas all the data in the other columns would be fixed. I could then copy each variable from the source sheet and paste into column 1, work the correlations and then move to the next variable
3) Output Sheet  this would store the results of the correlations for each variable (effectively replicating the output grid I described earlier).
Would this be the best way of doing what I need? If so how easy is it to write the code? Basically I need code that will take the first variable from the source sheet, paste it into column A in the calculation sheet. Then copy the values from the correlation grid and paste it into the output sheet.
Then it needs to repeat the process for next column in the source sheet, and paste the values beneath the results for the previous column in the output sheet. I'm guess I need some kind of loop function that will loop through each variable in the source sheet and apply the code.
Any help much appreciated!
Thanks
Lucas
 
Share this Question
Expert 100+
P: 1,295

Ok, i think there's a chance, but only a small one that VB can make it a lot faster, it depends on the algorithm you're using.
First i recomend you to find a nice quick algorithm in the internet (just google it)
then, dont work it on excel, VBA writes data on excel very slow, just put the data into an array, work it with arrays and then throw the results in another sheet.
the fastes way to add your excel worksheet to an array is something like:
dim a
with worksheets(1)
a = range(.cells(1,1), .cells(100,200))
Now, you'll have to make some DO for each correlation you want, and for the lagged ones, I think this could take some time as well, but if you dont find a faster method, i will be glad to help you (it seems somekind of fun)
  Expert 5K+
P: 8,434

Don't mind me. Just registering an interest in this thread, as I'm curious to see where it goes.
 
P: 16

Hi,
Thanks for your advice.
When you say download a quick algorithm from Google  do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well?
Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function?
I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process.
Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis!
  Expert 100+
P: 1,295

Hi,
Thanks for your advice.
When you say download a quick algorithm from Google  do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well?
Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function?
I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process.
Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis!
Yeap, i was suggesting you to use som raw VBA code for that function. Nothing to be afraid of, since it will be all about IF, DO, FOR, WHILE and basic stuff. Im an analyst too, an actuary. That doesnt mean you're not allowed to create your own codes to make your own solutions. And if you came to a developers network, i think that's the help we can give to you.
what say you?
  Expert 100+
P: 1,295

Yeap, actually it was fun making it :)
There you go, just make sure that the first row of each column is the title, the info must start in the second row, and it doesnt matter where it ends.
I made a litlle test for 180 columns * 70 rows, lag from 0 to 5 and it took 55 seconds, so i think it's quite fast.
Hope it helps.
(if you dont know how to run it, just go to the worksheet where the data is and press Alt+F11, when you're in the VB editor, add a new Module, and paste the code into it, then move your cursor to the second part of the code, any line of the sub called GimmeTheCorrelation, and press F5 to run it, it will show a Message Box with the time elapsed when it finishes, if you get tired of wainting press Ctrl+Pause to stop it at any moment)  Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double

Dim a1 As Long

Dim a2 As Long

Dim Sx As Double

Dim Sx2 As Double

Dim Sy As Double

Dim Sy2 As Double

Dim Sxy As Double

Dim i As Long

a1 = LBound(a) + Lag

a2 = UBound(a)

For i = a1 To a2

Sx = Sx + a(i, 1)

Sx2 = Sx2 + a(i, 1) ^ 2

Sy = Sy + b(i  Lag, 1)

Sy2 = Sy2 + b(i  Lag, 1) ^ 2

Sxy = Sxy + (a(i, 1) * b(i  Lag, 1))

Next

i = a2  a1 + 1

Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))

End Function



Sub GimmeTheCorrelation()

Dim a()

Dim b()

Dim names()

Dim names2()

Dim j As Byte

Dim k1 As Long

Dim k2 As Long

Dim n As Byte

Dim t As Single

t = Timer

n = 5 '< this will be the number of lags (from 0 to n)


Do

j = j + 1

Loop Until Cells(2, j) = ""

ReDim a(1 To j  1)

ReDim b(1 To j  1, 1 To j  1)

For j = LBound(a) To UBound(a)

a(j) = Range(Cells(2, j), Cells(2, j).End(4121))

Next

names = Range(Cells(1, 1), Cells(1, 1).End(4161))

ReDim names2(LBound(names, 2) To UBound(names, 2), 1 To 1)

For j = LBound(names, 2) To UBound(names, 2)

names2(j, 1) = names(1, j)

Next


For j = 0 To n

For k1 = LBound(a) To UBound(a)

For k2 = LBound(a) To UBound(a)

b(k1, k2) = Correlation(a(k1), a(k2), j)

Next

Next

Worksheets.Add

With Worksheets(1)

Range(.Cells(2, 2), .Cells(UBound(b) + 1, UBound(b) + 1)) = b

Range(.Cells(1, 2), .Cells(1, UBound(b) + 1)) = names

Range(.Cells(2, 1), .Cells(UBound(b) + 1, 1)) = names2

.Name = "LAG " & j

End With

Next

MsgBox "It took " & Timer  t & " seconds."


End Sub
 
P: 16

Hello Kadghar,
Thanks very much for the code. It's amazing, I thought it would take an hour or two to calculate than just seconds.
I have many questions about the actual code in terms of how it works but really there is one issue. As per my original post, I'm my interested in analysing the correlation at all lags between a variable and other variables rather than analysing correlations at specific lags if you know what I mean.
So the way the current output is formulated is good at answering the question; at a given lag (e.g. five) which variables have the highest and lowest correlation. But instead, I want to answer the question; for a given variable, which variables offer the highest correlations and at what lag.
So instead of the current correlation matrix output at different lags (per sheet):
Numbers denote different variables
1 2 3 4 5 6
1 X X X X X
2 X X etc
3
4
5
6
I need:
For Variable 1
1 2 3 4 5 6
Lag 0 X X X X X X
Lag 1 X X X etc
Lag 2
Lag 3
Lag 4
Lag 5
Variable 2
1 2 3 4 5 6
Lag 0 X X X X X X
Lag 1 X X X X X
Lag 2
Lag 3
Lag 4
Lag 5
I realise all the data I need is calculated by the procedure you kindly created for me, but just in a different format. I guess my question is what is the best way to get the format I need? Would it easiest to just to amend the existing code to get the desired output (I've had a quick look but wouldn't know where to start!) or would it be better to create a secondary macro to specifically take the current output format and rearrange into the desired format.
Many Thanks
Lucas
  Expert 100+
P: 1,295

Hi,
I see the problem. I think it's not that hard to fix. We'll have to change some FOR in the main procedure. I'll give it a try when i have some free time.
Kad
 
P: 16

Thanks a lot.
That's very kind.
Lucas
  Expert 100+
P: 1,295

There you go.
this might be what we need.  Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double

Dim a1 As Long

Dim a2 As Long

Dim Sx As Double

Dim Sx2 As Double

Dim Sy As Double

Dim Sy2 As Double

Dim Sxy As Double

Dim i As Long

a1 = LBound(a) + Lag

a2 = UBound(a)

For i = a1 To a2

Sx = Sx + a(i, 1)

Sx2 = Sx2 + a(i, 1) ^ 2

Sy = Sy + b(i  Lag, 1)

Sy2 = Sy2 + b(i  Lag, 1) ^ 2

Sxy = Sxy + (a(i, 1) * b(i  Lag, 1))

Next

i = a2  a1 + 1

Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))

End Function

Sub GimmeTheCorrelation()

Dim a()

Dim b()

Dim names()

Dim names2()

Dim j As Byte

Dim k1 As Long

Dim k2 As Long

Dim n As Byte

Dim t As Single

Dim f As Long


t = Timer

n = 5 '< this will be the number of lags (from 0 to n)


Do

j = j + 1

Loop Until Cells(2, j) = ""

ReDim a(1 To j  1)

ReDim b(0 To n, 1 To j  1)

For j = LBound(a) To UBound(a)

a(j) = Range(Cells(2, j), Cells(2, j).End(4121))

Next

names = Range(Cells(1, 1), Cells(1, 1).End(4161))

ReDim names2(0 To n, 1 To 1)

For j = 0 To n

names2(j, 1) = "Lag " & j

Next

Worksheets.Add

Worksheets(1).Name = "Correlations"

f = 1

For k1 = LBound(a) To UBound(a)

For j = 0 To n

For k2 = LBound(a) To UBound(a)

b(j, k2) = Correlation(a(k1), a(k2), j)

Next

Next

With Worksheets(1)

.Cells(f, 1).Value = names(1, k1)

Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b

Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names

Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2

End With

f = f + n + 3

Next

MsgBox "It took " & Timer  t & " seconds."

End Sub
 
P: 16

Hi kadghar,
Thanks for the code. I tried it out, works great.
Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones?
Thanks,
Lucas
  Expert 100+
P: 1,295

Hi kadghar,
Thanks for the code. I tried it out, works great.
Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones?
Thanks,
Lucas
The function??
From wikipedia... but is the one i almostremembered from college, just transform the sums into FORs with a variable that saves the sum.
Good to know it worked fine.
 
P: 16

Hi,
I'm trying to run the code created on this site but it returns the following runtime error 11: Divsion By Zero. Debuging takes you to the following line in the custom function:
Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))
This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this?
Thanks,
Lucas 
Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double

Dim a1 As Long

Dim a2 As Long

Dim Sx As Double

Dim Sx2 As Double

Dim Sy As Double

Dim Sy2 As Double

Dim Sxy As Double

Dim i As Long

a1 = LBound(a) + Lag

a2 = UBound(a)

For i = a1 To a2

Sx = Sx + a(i, 1)

Sx2 = Sx2 + a(i, 1) ^ 2

Sy = Sy + b(i  Lag, 1)

Sy2 = Sy2 + b(i  Lag, 1) ^ 2

Sxy = Sxy + (a(i, 1) * b(i  Lag, 1))

Next

i = a2  a1 + 1

Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))

End Function


Sub GimmeTheCorrelation()

Dim a()

Dim b()

Dim names()

Dim names2()

Dim j As Byte

Dim k1 As Long

Dim k2 As Long

Dim n As Byte

Dim t As Single

Dim f As Long


t = Timer

n = 5 '< this will be the number of lags (from 0 to n)


Do

j = j + 1

Loop Until Cells(2, j) = ""

ReDim a(1 To j  1)

ReDim b(0 To n, 1 To j  1)

For j = LBound(a) To UBound(a)

a(j) = Range(Cells(2, j), Cells(2, j).End(4121))

Next

names = Range(Cells(1, 1), Cells(1, 1).End(4161))

ReDim names2(0 To n, 1 To 1)

For j = 0 To n

names2(j, 1) = "Lag " & j

Next

Worksheets.Add

Worksheets(1).Name = "Correlations"

f = 1

For k1 = LBound(a) To UBound(a)

For j = 0 To n

For k2 = LBound(a) To UBound(a)

b(j, k2) = Correlation(a(k1), a(k2), j)

Next

Next

With Worksheets(1)

.Cells(f, 1).Value = names(1, k1)

Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b

Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names

Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2

End With

f = f + n + 3

Next

MsgBox "It took " & Timer  t & " seconds."

End Sub
  Expert 100+
P: 1,295

Hi,
I'm trying to run the code created on this site but it returns the following runtime error 11: Divsion By Zero. Debuging takes you to the following line in the custom function:
Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))
This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this?
Thanks,
Lucas
I dont think its because you're using many cells, It might be because:
i*Sx2  Sx^2 = zero
or
i*Sy2  Sy^2 = zero
this can be because of many things, one, is because you have some empty cells in your range, or maybe just bad luck and that's how te calculation is done. I'll recomend you to Click on the Debug button when the error ocurs, and check in the locals window the values of the indexes and of the variables (i, Sy, Sy2, etc) this will help you finding out where the zero comes from: a cell or a previous calculation.
HTH
 
P: 16

Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of 1#NF.
Lucas
  Expert 100+
P: 1,295

Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of 1#NF.
Lucas
are you sure?
the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so:
Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))
when i = 0 then
Correlation =
Sx*Sy / (Sx2 * Sy2)
and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^390 (then the number is so small that it'll be taken as a zero)
but... are they that small?, is like the joke (a really bad joke) :
2 + 2 = 5, for very big values of 2.
But it can also be that i takes a huge huge value... then, i dont know what to do
perhaps we need an arbitrary precision algorithm
Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is.
  Expert 100+
P: 1,295

Hi,
I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of 1#NF.
Lucas
are you sure?
the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so:
Correlation = ((i * Sxy)  (Sx * Sy)) / (((i * Sx2  Sx ^ 2) ^ 0.5) * ((i * Sy2  Sy ^ 2) ^ 0.5))
when i = 0 then
Correlation =
Sx*Sy / (Sx2 * Sy2)
and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^390 (then the number is so small that it'll be taken as a zero)
but... are they that small?, is like the joke (a really bad joke) :
2 + 2 = 5, for very big values of 2.
But it can also be that i takes a huge huge value... then, i dont know what to do
perhaps we need an arbitrary precision algorithm
Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is.
 
P: 16

OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran!
So it looks like that's is what was causing the error. Does that make sense from a programing perspective?
Thanks for your help.
Lucas
  Expert 100+
P: 1,295

OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran!
So it looks like that's is what was causing the error. Does that make sense from a programing perspective?
Thanks for your help.
Lucas
From a programing perspective?? I dont think so.
Well, i have to admit im a little bit rusted, but i'd say it's because of the correlation properties.
 
P: 1

Recently found the site and kadghar's seemed like an excellent solution to the same problem I am facing. However, though the module runs, I receive a blank output sheet in Excel. What could be going wrong?
Thanks for everyone's patience.
    Question stats  viewed: 7835
 replies: 19
 date asked: Sep 3 '07
