Ordinary [[Correlation Coefficient|correlation]] between variables $X,Y$ is defined to be $\mathrm{corr}(X,Y):= \frac{\mathrm{Cov}(X,Y)}{\sigma_{X}\sigma_{Y}}= \frac{\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y})]}{\sigma_{X}\sigma_{Y}}$and for a sample of $\{ (X_{1},Y_{1}),\dots,(X_{n},Y_{n}) \}$, the sample correlation is $r:=\frac{\sum_{i}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{ \sum_{i}(x_{i}-\bar{x})^{2}\sum_{i}(y_{i}-\bar{y})^{2} }}$ The same idea can be extended to time series, where the two variables are the same series but a number of time intervals apart. The (population) autocorrelation of lag $k$ is $\rho_{k}:= \frac{\mathrm{Cov}(X_{t},X_{t+k})}{\sqrt{ \mathrm{Var}(X_{t})\cdot \mathrm{Var}(X_{t+k}) }}.$ The sample estimate uses the observed pairs $(x_{1},x_{1+k}),\dots,(x_{N-k},x_{N})$, giving the **autocorrelation coefficient** (of lag one): $r_{k}:=\frac{\sum\limits_{t=1}^{N-k}(x_{t}-\bar{x}_{(1)})(x_{t+k}-\bar{x}_{(2)})}{\sqrt{ \sum\limits_{t=1}^{N-k}(x_{t}-\bar{x}_{(1)})^{2}\sum\limits_{t=1}^{N-k}(x_{t+k}-\bar{x}_{(2)})^{2} }}=\frac{\left< x_{(1)}-\bar{x}_{(1)}, x_{(2)}-\bar{x}_{(2)} \right> }{ \| x_{(1)} \|\cdot \| x_{(2)} \| }$where the means $\bar{x}_{(1,2)}$ are of $x_{(1)}:=(x_{1},\dots, x_{N-k})$ and $x_{(2)}:=(x_{k},\dots,x_{N})$ respectively. Approximating $\bar{x}_{(1)} \approx \bar{x}_{(2)} \approx \bar{x}$ and $N / (N-1)\approx 1$ gives the simplification $r_{1} \approx \frac{\sum\limits_{t=1}^{N-k}(x_{t}-\bar{x})(x_{t+1}-\bar{x})}{\sum\limits_{t=1}^{N}(x_{t}-\bar{x})^{2}}=\frac{\left< x_{(1)}-\bar{x},x_{(2)}-\bar{x} \right> }{\| x \| }.$ If we define the **autocovariance coefficient** at lag $k$ to be $c_{k}:=\frac{1}{N}\sum_{t=1}^{N-k}(x_{t}-\bar{x})(x_{t+k}-\bar{x})$then the approximated autocorrelation coefficients are $r_{k} \approx c_{k} / c_{0}$ ### Correlograms The correlogram plots the autocorrelation coefficients as a function of their lag $k$. As a baseline, for a [[Purely Random Processes|purely random process]] with $N$ iid observations, the autocorrelation has distribution $r_{k} \approx N(0, N^{-1})$, which provides a cut-off for a coefficient to be considered significant. ```R fold data <- rnorm(400) par(mfrow=c(2,1), mar=c(3,4,3,4), bg=NA) acf(data, ylab="Autocorr.", main="Autocorrelation in independent normal data") ``` ![[IIDACF.png#invert]] A sequence with short-term correlation usually have large values of $r_{1}$, followed by a few more significant correlations, which decays to near-$0$. ```R fold x <- rnorm(398) noise <- rnorm(400, sd=0.1) data <- c(x,0,0) + c(0,x,0) + c(0,0,x) + noise par(mfrow=c(2,1), mar=c(3,4,3,4), bg=NA) acf(data, ylab="Autocorr.", main="Data with short-term correlation") ``` ![[ShortTermACF.png#invert]] An alternating series have successive observations on different sides of the overall mean. Their correlogram usually alternate in sign, with $r_{1}<0$ and $r_{2}>0$, etc. ```R fold x <- rnorm(399) noise <- rnorm(400, sd=0.1) data <- c(x,0) + c(0,-x) + noise par(mfrow=c(2,1), mar=c(3,4,3,4), bg=NA) acf(data, ylab="Autocorr.", main="Alternating data") ``` ![[AlternatingACF.png#invert]] If the series is non-stationary, the correlogram is rather useless because the correlations will all be dominated by the trend. ```R fold data(bev) par(mfrow=c(2,1), mar=c(3,4,3,4), bg=NA) acf(bev, ylab="Autocorr.", main="Non-stationary data") ``` ![[NonStationaryACF.png#invert]] ### Autocorrelation in Seasonal Data The correlogram of data with seasonality will also have oscillation with the same frequency: for example $r_{12}$ will be large and positive for monthly data. - In those cases the correlogram should cover three cycles of data, e.g. including $r_{0}\sim r_{36}$ for monthly data. However, seasonality can be easily identified with other measures, so correlogram is best used for data with its seasonality removed.