In [[Linear Regression Methods#Least Squares Linear Regression|OLS]], classic [[Inference in OLS|inference techniques]] relies on a few modeling assumptions like a linear model $Y=X\beta+\epsilon,$where $\epsilon$ has a few strong assumptions made about it:
- Homoscedasticity: that $\epsilon$ have the same variance $\sigma^{2}$ for different observations;
- Normality: that $\epsilon$ follows the Gaussian distributions.
If those assumptions fail, the exact, finite distribution results for $\hat{\beta},\hat{\mathbf{y}}$ do not necessarily hold. Instead, we need asymptotic distributions.
### The Sandwich Variance Estimator
Without assuming normality, we can still write $\begin{align*}
\hat{\beta}-\beta&= (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}-\beta\\
&= (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\pmb{\epsilon}\\
&= B_{n}^{-1}\pmb{\xi}_{n}
\end{align*}$where $B_{n}:=\mathbf{X}^{T}\mathbf{X} / n$, and $\pmb{\xi}_{n}:=\mathbf{X}^{T}\pmb{\epsilon} / n$, and we wish to know the (asymptotic) distribution of $\pmb{\xi}_{n}$. Standard results give
$\mathrm{Cov}(\pmb{\xi}_{n})=n^{-2}\mathbf{X}^{T}\mathrm{Cov}(\pmb{\epsilon})\mathbf{X}=M_{n} / n,$where $M_{n}:= \mathbf{X}^{T}\mathrm{Cov}(\pmb{\epsilon})\mathbf{X} / n$, and plugging it back into the first equation gives $\begin{align*}
\mathrm{Cov}(\hat{\beta})&= B_{n}^{-1}\mathrm{Cov}(\pmb{\xi}_{n})B_{n}^{-1}\\
&= {B_{n}^{-1}M_{n}B_{n}^{-1}} / n\\
&\propto \text{bread-meat-bread},
\end{align*}$hence the name **sandwich**.
Furthermore, if $B_{n},M_{n}$ (being some sort of "average") has finite limits $B,M$ (i.e. they converge term-wise to a finite limit), we still have $\mathrm{Cov}(\hat{\beta})\to \mathbf{0}$, so $\hat{\beta} \to \beta ~\mathrm{a.s.}$.
Now for variance estimation, simply replace $B,M$ with their finite-sample estimators: $\begin{align*}
\hat{B}&= B_{n},\\
\hat{M}&= \mathbf{X}^{T}\hat{\Omega}\mathbf{X} / n,\\
&\text{where } \hat{\Omega}= \mathrm{diag}(e_{1}^{2},\dots,e_{n}^{2}),
\end{align*}$i.e. approximating $\mathrm{Cov}(\pmb{\epsilon})$ with $\mathrm{diag}(\pmb{\epsilon}^{2})$ (using independence) and estimate it with $\mathrm{diag}(\mathbf{e}^{2})$. Plugging in those estimators give the **Eicker-Huber-White covariance matrix** $\hat{\Sigma}_{\mathrm{EHW}}=\hat{B}_{n}^{-1}\hat{M}_{n}\hat{B}_{n}^{-1}=(\mathbf{X}^{T}\mathbf{X})^{-1}(\mathbf{X}^{T}\hat{\Omega}\mathbf{X})(\mathbf{X}^{T}\mathbf{X})^{-1}.$
> [!connection] This can also be derived from an [[Weighed OLS]].
Although $\mathbf{e}$ (usually) underestimates $\pmb{\epsilon}$, we still have $\hat{M}_{n} \to M$ when $n \to \infty$. For finite samples, however, there are a few adjusted estimators that give better results: $\hat{\epsilon}_{i}=\begin{dcases}
e_{i}\sqrt{ \frac{n}{n-p} }, \\[0.4em]
{e_{i}} / {\sqrt{ 1-h_{ii} }}, \\[0.4em]
e_{i} / (1-h_{ii}).
\end{dcases}$
### Asymptotic Normality of $\hat{\beta}$
Furthermore, under mild assumptions (e.g. Lyapunov's condition that $d_{n}:=\sum_{i}\| x_{i}\epsilon_{i} \|^{2+\delta} / n$ is bounded for all $n$ for some $\delta$) that allows the Lindeberg-Feller CLT to apply, we have $\pmb{\xi}_{n}\overset{D}{\approx} N(0, M),$and by extension $\hat{\beta}\overset{D}{\approx}N(\beta,B^{-T}MB^{-1}).$