424,988 Members | 1,341 Online
Need help? Post your question and get tips & solutions from a community of 424,988 IT Pros & Developers. It's quick & easy.

# Automating Calculation of Lagged Cross Correlations between Variables

 P: 16 Hi there, First of all apologies for the long post. Hope someone can offer some advice. I have about 200 columns of time series data that I need to perform a correlation analysis on in terms calculating lagged cross correlations between all the variables. Currently the data is stored in Excel. Variable 1 is in Column A, variable 2 in Column B etc, the data ends in column GR which contains the last variable. I need to calculate the cross correlations between variable 1 in Column A and all other variables in the other remaining columns. The key thing here is that I need to calculate lagged as well as contemperaous correlations between each of the variables. Then I want to do the same for the next variable in Column B, i.e. calculate contemperaous and lagged correlations between variable 2 and all other variables. Is there anyway to automate this so that it work out the correlations up to say 10 lags of for each variable against all other variables and paste the results in some kind of readable grid in an excel sheet for each variable: For example, for Column A - Variable 1 the results could be shown like this (for the first five variables): B C D E F Lag0 X X X X X Lag1 X X X X X Lag2 X X X X X Lag3 X X X X X Lag4 X X X X X Lag5 X X X X X Lag6 X X X X X Lag7 X X X X X Lag8 X X X X X Lag9 X X X X X Lag10 X X X X X Where X represents the correlation coefficient between variable 1 and all the other variables (col B, C, D, E, F to Col GR) at different lags. The output doesn't have to be exactly like this though. I can do this manually in excel using the Correl function, adjusting the series range in the formula every time I want to work out the lagged correlation. Background: The correlation the function takes two arguments: Array1, Array2 So to calculate correlations for Column A against all other columns I would input: correl (A2:A100, B2:B100) to give the contemperous correlation between Col A and Col B, correl (A2:A100, c2:c100) to give correlation for col C vs. Col A etc. Then to work out the correlation at 1 Lag I change the formula to (A3:A100, B2:B99), at lag 2 the formula becomes (A4:A100, B2:B98) etc. Then I do this for all the other columns/Lags but to do this for 200 variables, would take a huge amount of time (probably several days)! Is there a way of doing it using VBA perhaps? Even if it takes a few hours that's OK, as I can leave it to run in the background. The way I'm thinking of doing this is setting up a workbook with the following sheets: 1) Source Sheet - this would contain all the raw data (Columns 1-200)as described in the top of the post. 2) Calculation sheet - Here I would have a duplicate of the above sheet but with one extra column (say column A) and so the dataset would end in column 201). Beneath this matrix I would set up formulas using the Excel’s correl function to work out the lagged correlation between the variable in column 1 and all other variables to say ten lags. Effectively column one would be variable column whereas all the data in the other columns would be fixed. I could then copy each variable from the source sheet and paste into column 1, work the correlations and then move to the next variable 3) Output Sheet - this would store the results of the correlations for each variable (effectively replicating the output grid I described earlier). Would this be the best way of doing what I need? If so how easy is it to write the code? Basically I need code that will take the first variable from the source sheet, paste it into column A in the calculation sheet. Then copy the values from the correlation grid and paste it into the output sheet. Then it needs to repeat the process for next column in the source sheet, and paste the values beneath the results for the previous column in the output sheet. I'm guess I need some kind of loop function that will loop through each variable in the source sheet and apply the code. Any help much appreciated! Thanks Lucas Sep 3 '07 #1
19 Replies

 Expert 100+ P: 1,295 Ok, i think there's a chance, but only a small one that VB can make it a lot faster, it depends on the algorithm you're using. First i recomend you to find a nice quick algorithm in the internet (just google it) then, dont work it on excel, VBA writes data on excel very slow, just put the data into an array, work it with arrays and then throw the results in another sheet. the fastes way to add your excel worksheet to an array is something like: dim a with worksheets(1) a = range(.cells(1,1), .cells(100,200)) Now, you'll have to make some DO for each correlation you want, and for the lagged ones, I think this could take some time as well, but if you dont find a faster method, i will be glad to help you (it seems somekind of fun) Sep 4 '07 #2

 Expert 5K+ P: 8,434 Don't mind me. Just registering an interest in this thread, as I'm curious to see where it goes. Sep 4 '07 #3

 P: 16 Hi, Thanks for your advice. When you say download a quick algorithm from Google - do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well? Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function? I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process. Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis! Sep 4 '07 #4

 Expert 100+ P: 1,295 Hi, Thanks for your advice. When you say download a quick algorithm from Google - do you mean to work out just a correlation or something more iterative that will work all the lagged correlations as well? Is it not possible just to use Excel's correl function or are you saying that it would be faster to have the raw VBA code for that function? I'm sure I can get hold of some VBA code that lets you calculate the correlation between two variables, if you think that will speed up the process. Sorry for what may seem like naive questions but I'm not a programmer, just an analyst that uses tools to do analysis! Yeap, i was suggesting you to use som raw VBA code for that function. Nothing to be afraid of, since it will be all about IF, DO, FOR, WHILE and basic stuff. Im an analyst too, an actuary. That doesnt mean you're not allowed to create your own codes to make your own solutions. And if you came to a developers network, i think that's the help we can give to you. what say you? Sep 4 '07 #5

 Expert 100+ P: 1,295 Yeap, actually it was fun making it :) There you go, just make sure that the first row of each column is the title, the info must start in the second row, and it doesnt matter where it ends. I made a litlle test for 180 columns * 70 rows, lag from 0 to 5 and it took 55 seconds, so i think it's quite fast. Hope it helps. (if you dont know how to run it, just go to the worksheet where the data is and press Alt+F11, when you're in the VB editor, add a new Module, and paste the code into it, then move your cursor to the second part of the code, any line of the sub called GimmeTheCorrelation, and press F5 to run it, it will show a Message Box with the time elapsed when it finishes, if you get tired of wainting press Ctrl+Pause to stop it at any moment) Expand|Select|Wrap|Line Numbers Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double Dim a1 As Long Dim a2 As Long Dim Sx As Double Dim Sx2 As Double Dim Sy As Double Dim Sy2 As Double Dim Sxy As Double Dim i As Long a1 = LBound(a) + Lag a2 = UBound(a) For i = a1 To a2     Sx = Sx + a(i, 1)     Sx2 = Sx2 + a(i, 1) ^ 2     Sy = Sy + b(i - Lag, 1)     Sy2 = Sy2 + b(i - Lag, 1) ^ 2     Sxy = Sxy + (a(i, 1) * b(i - Lag, 1)) Next i = a2 - a1 + 1 Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) End Function     Sub GimmeTheCorrelation() Dim a() Dim b() Dim names() Dim names2() Dim j As Byte Dim k1 As Long Dim k2 As Long Dim n As Byte Dim t As Single t = Timer n = 5 '<--- this will be the number of lags (from 0 to n)   Do     j = j + 1 Loop Until Cells(2, j) = "" ReDim a(1 To j - 1) ReDim b(1 To j - 1, 1 To j - 1) For j = LBound(a) To UBound(a)     a(j) = Range(Cells(2, j), Cells(2, j).End(-4121)) Next names = Range(Cells(1, 1), Cells(1, 1).End(-4161)) ReDim names2(LBound(names, 2) To UBound(names, 2), 1 To 1) For j = LBound(names, 2) To UBound(names, 2)     names2(j, 1) = names(1, j) Next   For j = 0 To n     For k1 = LBound(a) To UBound(a)         For k2 = LBound(a) To UBound(a)             b(k1, k2) = Correlation(a(k1), a(k2), j)         Next     Next     Worksheets.Add     With Worksheets(1)         Range(.Cells(2, 2), .Cells(UBound(b) + 1, UBound(b) + 1)) = b         Range(.Cells(1, 2), .Cells(1, UBound(b) + 1)) = names         Range(.Cells(2, 1), .Cells(UBound(b) + 1, 1)) = names2         .Name = "LAG " & j     End With Next MsgBox "It took " & Timer - t & " seconds."   End Sub Sep 4 '07 #6

 P: 16 Hello Kadghar, Thanks very much for the code. It's amazing, I thought it would take an hour or two to calculate than just seconds. I have many questions about the actual code in terms of how it works but really there is one issue. As per my original post, I'm my interested in analysing the correlation at all lags between a variable and other variables rather than analysing correlations at specific lags if you know what I mean. So the way the current output is formulated is good at answering the question; at a given lag (e.g. five) which variables have the highest and lowest correlation. But instead, I want to answer the question; for a given variable, which variables offer the highest correlations and at what lag. So instead of the current correlation matrix output at different lags (per sheet): Numbers denote different variables 1 2 3 4 5 6 1 X X X X X 2 X X etc 3 4 5 6 I need: For Variable 1 1 2 3 4 5 6 Lag 0 X X X X X X Lag 1 X X X etc Lag 2 Lag 3 Lag 4 Lag 5 Variable 2 1 2 3 4 5 6 Lag 0 X X X X X X Lag 1 X X X X X Lag 2 Lag 3 Lag 4 Lag 5 I realise all the data I need is calculated by the procedure you kindly created for me, but just in a different format. I guess my question is what is the best way to get the format I need? Would it easiest to just to amend the existing code to get the desired output (I've had a quick look but wouldn't know where to start!) or would it be better to create a secondary macro to specifically take the current output format and re-arrange into the desired format. Many Thanks Lucas Sep 5 '07 #7

 Expert 100+ P: 1,295 Hi, I see the problem. I think it's not that hard to fix. We'll have to change some FOR in the main procedure. I'll give it a try when i have some free time. Kad Sep 5 '07 #8

 P: 16 Thanks a lot. That's very kind. Lucas Sep 5 '07 #9

 Expert 100+ P: 1,295 There you go. this might be what we need. Expand|Select|Wrap|Line Numbers Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double Dim a1 As Long Dim a2 As Long Dim Sx As Double Dim Sx2 As Double Dim Sy As Double Dim Sy2 As Double Dim Sxy As Double Dim i As Long a1 = LBound(a) + Lag a2 = UBound(a) For i = a1 To a2     Sx = Sx + a(i, 1)     Sx2 = Sx2 + a(i, 1) ^ 2     Sy = Sy + b(i - Lag, 1)     Sy2 = Sy2 + b(i - Lag, 1) ^ 2     Sxy = Sxy + (a(i, 1) * b(i - Lag, 1)) Next i = a2 - a1 + 1 Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) End Function Sub GimmeTheCorrelation() Dim a() Dim b() Dim names() Dim names2() Dim j As Byte Dim k1 As Long Dim k2 As Long Dim n As Byte Dim t As Single Dim f As Long   t = Timer n = 5 '<--- this will be the number of lags (from 0 to n)   Do     j = j + 1 Loop Until Cells(2, j) = "" ReDim a(1 To j - 1) ReDim b(0 To n, 1 To j - 1) For j = LBound(a) To UBound(a)     a(j) = Range(Cells(2, j), Cells(2, j).End(-4121)) Next names = Range(Cells(1, 1), Cells(1, 1).End(-4161)) ReDim names2(0 To n, 1 To 1) For j = 0 To n     names2(j, 1) = "Lag " & j Next Worksheets.Add Worksheets(1).Name = "Correlations" f = 1 For k1 = LBound(a) To UBound(a)     For j = 0 To n         For k2 = LBound(a) To UBound(a)             b(j, k2) = Correlation(a(k1), a(k2), j)         Next     Next     With Worksheets(1)         .Cells(f, 1).Value = names(1, k1)         Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b         Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names         Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2     End With     f = f + n + 3 Next MsgBox "It took " & Timer - t & " seconds." End Sub Sep 6 '07 #10

 P: 16 Hi kadghar, Thanks for the code. I tried it out, works great. Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones? Thanks, Lucas Sep 11 '07 #11

 Expert 100+ P: 1,295 Hi kadghar, Thanks for the code. I tried it out, works great. Can I ask where you got the function to calculate the correlations. Is there some website I can go to? I came accross a couple that provide code but this is in c++, matlab and other. Do you know any good vba ones? Thanks, Lucas The function?? From wikipedia... but is the one i almost-remembered from college, just transform the sums into FORs with a variable that saves the sum. Good to know it worked fine. Sep 11 '07 #12

 P: 16 Hi, I'm trying to run the code created on this site but it returns the following run-time error 11: Divsion By Zero. Debuging takes you to the following line in the custom function: Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this? Thanks, Lucas Expand|Select|Wrap|Line Numbers Public Function Correlation(ByVal a, ByVal b, ByVal Lag As Integer) As Double Dim a1 As Long Dim a2 As Long Dim Sx As Double Dim Sx2 As Double Dim Sy As Double Dim Sy2 As Double Dim Sxy As Double Dim i As Long a1 = LBound(a) + Lag a2 = UBound(a) For i = a1 To a2     Sx = Sx + a(i, 1)     Sx2 = Sx2 + a(i, 1) ^ 2     Sy = Sy + b(i - Lag, 1)     Sy2 = Sy2 + b(i - Lag, 1) ^ 2     Sxy = Sxy + (a(i, 1) * b(i - Lag, 1)) Next i = a2 - a1 + 1 Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) End Function   Sub GimmeTheCorrelation() Dim a() Dim b() Dim names() Dim names2() Dim j As Byte Dim k1 As Long Dim k2 As Long Dim n As Byte Dim t As Single Dim f As Long   t = Timer n = 5 '<--- this will be the number of lags (from 0 to n)   Do     j = j + 1 Loop Until Cells(2, j) = "" ReDim a(1 To j - 1) ReDim b(0 To n, 1 To j - 1) For j = LBound(a) To UBound(a)     a(j) = Range(Cells(2, j), Cells(2, j).End(-4121)) Next names = Range(Cells(1, 1), Cells(1, 1).End(-4161)) ReDim names2(0 To n, 1 To 1) For j = 0 To n     names2(j, 1) = "Lag " & j Next Worksheets.Add Worksheets(1).Name = "Correlations" f = 1 For k1 = LBound(a) To UBound(a)     For j = 0 To n         For k2 = LBound(a) To UBound(a)             b(j, k2) = Correlation(a(k1), a(k2), j)         Next     Next     With Worksheets(1)         .Cells(f, 1).Value = names(1, k1)         Range(.Cells(f + 2, 2), .Cells(n + f + 1, UBound(b, 2) + 1)) = b         Range(.Cells(f + 1, 2), .Cells(f + 1, UBound(names, 2) + 1)) = names         Range(.Cells(f + 2, 1), .Cells(f + n + 1, 1)) = names2     End With     f = f + n + 3 Next MsgBox "It took " & Timer - t & " seconds." End Sub Apr 23 '08 #13

 Expert 100+ P: 1,295 Hi, I'm trying to run the code created on this site but it returns the following run-time error 11: Divsion By Zero. Debuging takes you to the following line in the custom function: Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) This code worked fine on a previous dataset with much fewer rows, the current data set covers the range A1:CE2114. Does anyone know what might be causing this? Thanks, Lucas I dont think its because you're using many cells, It might be because: i*Sx2 - Sx^2 = zero or i*Sy2 - Sy^2 = zero this can be because of many things, one, is because you have some empty cells in your range, or maybe just bad luck and that's how te calculation is done. I'll recomend you to Click on the Debug button when the error ocurs, and check in the locals window the values of the indexes and of the variables (i, Sy, Sy2, etc) this will help you finding out where the zero comes from: a cell or a previous calculation. HTH Apr 23 '08 #14

 P: 16 Hi, I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF. Lucas Apr 24 '08 #15

 Expert 100+ P: 1,295 Hi, I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF. Lucas are you sure? the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so: Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) when i = 0 then Correlation = -Sx*Sy / (Sx2 * Sy2) and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^-390 (then the number is so small that it'll be taken as a zero) but... are they that small?, is like the joke (a really bad joke) : 2 + 2 = 5, for very big values of 2. But it can also be that i takes a huge huge value... then, i dont know what to do perhaps we need an arbitrary precision algorithm Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is. Apr 24 '08 #16

 Expert 100+ P: 1,295 Hi, I did a search for empty cells in my range but couldn't find any. Viewing the window show only the Lag expression having a value of 0. All other numerical expressions do not have any zeros. The correlation expression has the following value of -1#NF. Lucas are you sure? the only thing that comes to my mid is that you have really huge matrices, so a1 and a2 are huge huge ranges, and then i takes the value of zero... so: Correlation = ((i * Sxy) - (Sx * Sy)) / (((i * Sx2 - Sx ^ 2) ^ 0.5) * ((i * Sy2 - Sy ^ 2) ^ 0.5)) when i = 0 then Correlation = -Sx*Sy / (Sx2 * Sy2) and if you say that Sx2 is not zero and Sy2 is not zero, then i just cant imagine how the division is by zero... only if Sx2 and Sy2 are too small, that their product is smaller than 1x10^-390 (then the number is so small that it'll be taken as a zero) but... are they that small?, is like the joke (a really bad joke) : 2 + 2 = 5, for very big values of 2. But it can also be that i takes a huge huge value... then, i dont know what to do perhaps we need an arbitrary precision algorithm Anyway, why dont you post the Sx, Sy, Sx1, Sx2 and i values when the error ocurs? so we can find out where the problem is. Apr 24 '08 #17

 P: 16 OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran! So it looks like that's is what was causing the error. Does that make sense from a programing perspective? Thanks for your help. Lucas Apr 25 '08 #18

 Expert 100+ P: 1,295 OK, I've investigated the raw data further and seemed to have indentify the problem, at least the code is running now. One column of data did not vary, i.e it had the same values (of 11) in every row. I deleted this column and the code ran! So it looks like that's is what was causing the error. Does that make sense from a programing perspective? Thanks for your help. Lucas From a programing perspective?? I dont think so. Well, i have to admit im a little bit rusted, but i'd say it's because of the correlation properties. Apr 25 '08 #19

 P: 1 Recently found the site and kadghar's seemed like an excellent solution to the same problem I am facing. However, though the module runs, I receive a blank output sheet in Excel. What could be going wrong? Thanks for everyone's patience. May 5 '14 #20