Positively Correlated Noise in OLS Inference

Apologies if the question has been asked or is too vague -- feel free to remove it if so. I tried searching but only found [this post] that mentioned an identity (repeated below) without giving an interpretation. At the very least I want to share what I found and provide a (hopefully) interesting question to think about. ### TLDR > In OLS, positively correlated noise terms seem to increase the SE of intercept estimates and decrease that of slope estimates. I can prove the result with linear algebra and simulation (for the simple 2D case of $Y=aX+b+\text{noise}$), but can anyone provide an intuitive answer? ### Setup Start with the Gauss-Markov model where $Y=X\beta + \epsilon,$where $X$ has values in $\mathbb{R}^{p \times 1}$ (the first entry is always $1$, corresponding to the intercept), and $\epsilon \sim N(0, \sigma^2)$ is an additive noise. Now we have a sample $(y_1, \dots, y_n),(x_1,\dots, x_n)$ where the $xs are treated as fixed and $ys are realizations of $Y$. Collect them into matrix/vectors $\mathbf{X}, \mathbf{y}$ resp. Standard OLS procedures now give the coefficient estimate $\hat{\beta}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}.$ But instead of iid. noises, say $\epsilon_1,\dots, \epsilon_n$ have covariance matrix $\Omega\ne \sigma^2 I_n$ (the identity matrix). In particular, it has the form $\Omega = (1-\rho)\sigma^2 I_n + \rho J_n,$where $J_n$ is the $n\times n$ matrix full of $1s. So $\Omega$ has $\sigma^2$ on the diagonal and $\rho\sigma^2$ off-diagonal. I want to answer the question > Does a larger $\rho$ increase or decrease the variance of $\hat{\beta}$? ### My Initial Guess & Simulations > Feel free to skip to the next section to see the results! In a very hand-wavy way I thought a lower (or even negative $\rho$) will decrease the variance, since there is "more information" in the sample. Also, this sounds similar in spirit to the [antithetic variates] technique used in simulations. [this post]: https://stats.stackexchange.com/questions/114564/why-autocorrelation-affects-ols-coefficient-standard-errors [antithetic variates]: https://en.wikipedia.org/wiki/Antithetic_variates For simplicity I used $\sigma^2 = 1$, $p=2$ (so the model has 1 covariate and an intercept term), $\beta=(0, 2)$ (so the true intercept is $0$, and $Y=2X$), and $X$ being equally spaced over $10 \sim 30$. First, given some value of $\rho$, simulate a dataset and solve for the OLS coefficient estimates. ```R one_iteration = function(n_data_points = 100, variance = 10, correlation = 1, beta = 2) { # covariance matrix of the epsilon noise terms covariance = matrix(variance * correlation, nrow = n_data_points, ncol = n_data_points) diag(covariance) = variance noise = MASS::mvrnorm(n = 1, mu = rep(0, n_data_points), Sigma = covariance) X = seq(10, 30, length.out = n) Y = beta * X + noise beta_hat = lm(Y ~ X)$coefficients return(beta_hat) } ``` Now repeat the above for a number of times (100 in my example) for that value of $\rho$ to estimate the variance of the OLS coefficients: ```R estimate_OLS_coef_variance = function(n_data_points = 100, n_iterations = 100, correlation = 1){ result = matrix(0, nrow = n_iterations, ncol = 2) %>% data.frame() for (i in 1 : n_iterations) { result[i, ] = one_iteration(n_data_points = n_data_points, correlation = correlation) } return (result %>% var() %>% diag()) } ``` Lastly, repeat the above for different values of $\rho \in [\frac{-1}{n-1},1]$ ($\rho$ cannot be smaller than the lower bound, else the covariance matrix won't be positive semi-definite): ```R compare_variance_between_correlations = function(n_iterations = 100, n_data_points_OLS = 10, n_correlations = 100){ eligible_correlations = seq(-1 / (n_data_points_OLS - 1), 1 - 0.001, length.out = n_correlations) variance_of_estimates = matrix(0, nrow = n_correlations, ncol = 3) %>% data.frame() colnames(variance_of_estimates) = c("correlation", "Var(Intercept)", "Var(Slope)") insertion_index = 1 for (correlation in eligible_correlations) { result = estimate_OLS_coef_variance(n_data_points_OLS, n_iterations, correlation) variance_of_estimates[insertion_index, ] = c(correlation, result) insertion_index = insertion_index + 1 } return(variance_of_estimates) } ``` ### Results Plot the results: ```R fold ggplot(data = result) + geom_line(aes(x = correlation, y = log(`Var(Slope)`), color = "Slope")) + geom_line(aes(x = correlation, y = log(`Var(Intercept)`), color = "Intercept")) + scale_color_manual(name = "", values = c( "Slope" = "#233F7D", "Intercept" = "#7D233F" )) + labs(y = "log(variance)", x = "Correlation z") + my_theme # remove this line if you are reproducing the plot! ``` [![Plot of OLS coefficient SE vs. correlation between noise terms][1]][1] So apparently a positive correlation decreases the variance of the slope estimate while increases the variance of the intercept estimate? ### Linear Algebra Derivation As mentioned in the post linked at the start, $\begin{align*} \mathrm{Cov}(\hat{\beta})&= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathrm{Cov}(\mathbf{y})\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\Omega\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}. \end{align*}$With the special assumption of $\Omega = (1-\rho)\sigma^2 I_n + \rho J_n$, and writing $S:= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$, $\begin{align*} &= \sigma^{2}S((1-\rho)I - \rho J_{n})S'\\ &= \sigma^{2}(1-\rho)(\mathbf{X}'\mathbf{X})^{-1} - \sigma^{2}\rho SJ_{n} S'. \end{align*}$ But the OLS slope remains the same if we center the (non-intercept) covariates, so we can assume $J\mathbf{X}$ are all zeros except for the first column (which corresponds to the intercept column in $\mathbf{X}$). Therefore, the second term [1]: https://i.sstatic.net/4aAYY3hL.png