## OLS as Orthogonal Projection
Similar to [[Least-Squares Functional Approximations#Least Squares Approximation as Projection]], the OLS can be thought of as projecting $\mathbf{y}$ onto the column space of $\mathbf{X}$ using the projection/hat matrix $\mathbf{H}$.
Naturally, if $\mathbf{X}_{1,2}$ are orthogonal matrices (i.e. both have $n$ rows, and $\mathbf{X}_{1}^{T}\mathbf{X}_{2}=\mathbf{X}_{2}^{T}\mathbf{X}_{1}=I_{n}$), then they have orthogonal projection matrices: $H_{1}H_{2}=H_{2}H_{1}=\mathbf{0}_{n \times n},$i.e. nothing is left after projecting a vector onto two orthogonal spaces.
Moreover, if we concatenate $\mathbf{X}:=(\mathbf{X}_{1} ~~ \mathbf{X}_{2})$, projecting a vector $\mathbf{y}$ onto $\mathrm{col}(\mathbf{X})$ should be equivalent to adding its projection onto $\mathbf{X}_{1,2}$ respectively: $H=H_{1}+H_{2},$where $H$ is the projection matrix of $\mathbf{X}$.
### Two-Step Projections
Suppose $\hat{\beta}_{1,2}$ are the coefficients each block gets in regressing $\mathbf{y}\sim \mathbf{X}$, which we can interpret as the coefficient $\mathbf{y}$ has in the column spaces of $\mathbf{X}_{1,2}$, *module the contribution of the other block*.
- If $\mathbf{X}_{1,2}$ are orthogonal, it is reasonable to expect $\hat{\beta}_{1,2}=\tilde{\beta}_{1,2}$ -- *the orthogonal inputs should not affect the projection of $\mathbf{y}$ onto each other.*
However, if $\tilde{\beta}_{1,2}$ are the coefficients from the OLS models $\mathbf{y}\sim \mathbf{X}_{1}$ and $\mathbf{y} \sim \mathbf{X}_{2}$ (i.e. $\mathbf{y}$ regressed with each block separately), then we don't expect $\hat{\beta}=\tilde{\beta}$ in general. *How do we quantify this impact of including the other block?*
> [!theorem|*] The Frisch–Waugh–Lovell (FWL) Theorem
> The FWL theorem states that *the OLS coefficient $\hat{\beta}_{2}$ is equivalent to that obtained by first orthogonalizing $\mathbf{X}_{2}$ wrt. $\mathbf{X}_{1}$ before running the OLS*:
>
> That is, writing $H_{1}:= \mathbf{X}_{1}(\mathbf{X}_{1}^{T}\mathbf{X}_{1})^{-1}\mathbf{X}_{1}^{T}$ as the projection matrix onto $\mathrm{col}(\mathbf{X}_{1})$, and $\tilde{\mathbf{X}}_{2}:= (I-H_{1})\mathbf{X}_{2}$ as the component of $\mathbf{X}_{2}$ orthogonal to $\mathbf{X}_{1}$, we have $\hat{\beta}_{2}=(\tilde{\mathbf{X}}_{2}^{T}\tilde{\mathbf{X}}_{2})^{-1}\tilde{\mathbf{X}}_{2}^{T}\mathbf{y}.$
> Furthermore, we can replace $\mathbf{y}$ with the residual $\mathbf{e}_{1}=(I-H_{1})\mathbf{y}$ of regressing $\mathbf{y} \sim\mathbf{X}_{1}$.
^4760b7
- This does not say that we can recover the full OLS coefficient $\hat{\beta}_{1,2}$ by doing a two-step regression $\mathbf{y}\sim \mathbf{X}_{1}$ then $\mathbf{e}_{1}\sim \mathbf{X}_{2}$ -- the second step gives $\hat{\beta}_{2}$ indeed, but the first step gives $\tilde{\beta}_{1} \ne \hat{\beta}_{1}$ in general.
### Variance Estimation in Two-Step Projections
If $\hat{V},\tilde{V}$ are the estimated covariance matrices of $\hat{\beta}, \tilde{\beta}$ respectively (i.e. $\hat{V}$ is the $(2,2)$th block in the original OLS covariance estimator, and $\tilde{V}$ is just the covariance estimator from $\mathbf{e}_{1} \sim \mathbf{\tilde{X}}_{2}$), then $(n-p_{1}-p_{2})\hat{V}=(n-p_{2})\tilde{V},$where $p_{1},p_{2}$ are the number of columns (predictors) $\mathbf{X}_{1,2}$ have.
In contrast, the [[OLS with Heteroscedastic noise#The Sandwich Variance Estimator|the sandwich variance estimator]] is invariant: $\hat{V}_{\mathrm{EHW}}=\tilde{V}_{\mathrm{EHW}}$.
### Special Cases of Projections
If $X_{1}$ is the $(1,\dots,1)^{T} \in \mathbb{R}^{n\times{1}}$ column of intercept, then $H_{1}=\frac{1}{n}E_{n}$ (the matrix filled with $1$), and *orthogonalizing $\mathbf{X}_{2}$ with respect to $X_{1}$ is centering each column* (i.e. $x_{ij} \leftarrow x_{ij}-\bar{x}_{j}$, where $\bar{x}_{j}$ is the mean of the $j$th column).
More generally, if $X_{1}\in \mathbb{R}^{n \times K}$ are the **one-hot encoded classes** of a categorical variable (assumed to be sorted), i.e.
$\mathbf{X}_{1}=\begin{pmatrix}
1 \\
\vdots \\
& 1 \\
& \vdots \\
& & 1 \\
& & \vdots
\end{pmatrix},~H_{1}=\begin{pmatrix}
\boxed{\frac{1}{n_{1}}}_{ ~n_{1} \times n_{1}} & & \\ & \ddots\\
& & \boxed{\frac{1}{n_{k}}}_{~n_{k} \times n_{k}} & \\ & & & \ddots\\
& & & & \boxed{\frac{1}{n_{K}}}_{~n_{K} \times n_{K}} \\
\end{pmatrix}$
where $H_{1}$ is in block diagonal form.
Then $\tilde{\mathbf{X}}_{2}$ is computed as: if the $i$th observation is of class $k$, then $x_{ij} \leftarrow x_{ij}-\bar{x}_{jk}$, where $\bar{x}_{jk}$ is the mean of the $j$th column among observations with class $k$.
## Confounding: Simpson's Paradox in OLS
As shown in this section in [[Simpson's Paradox]], the color/cluster $C$ confounds the relationship between $X,Y$.
![[Simpson's Paradox#Theoretical Example Linear Regression]]
Suppose $X,W$ are both scalar in value, write the design matrix as $\mathbf{Z}=(\mathbf{W},\mathbf{X})$. Then FWL theorem guarantees that $\hat{\beta}_{Y\sim X~|~W}=\hat{\beta}_{E_{Y}\sim E_{X}},$where $E_{Y,X}$ are the residuals of the two OLS $Y,X\sim W$ respectively.
Analogous to the [[Simpson's Paradox#Theoretical Example Linearly Correlated RVs|result about RVs]], we may expect the OLS coefficient $\hat{\beta}_{Y \sim X ~|~ W}$ to be flipped by a strong confounding effect of $W$.
## Missing Variable Bias in OLS
Given a design matrix $\mathbf{X}=(\mathbf{X}_{1},\mathbf{X}_{2})$, we can fit OLS models $\begin{align*}
\mathbf{y}&= \mathbf{X}_{1}\hat{\beta}_{1}+\mathbf{X}_{2}\hat{\beta}_{2}+\mathbf{e}\\
\mathbf{y}&= \mathbf{X}_{2}\tilde{\beta}_{2}+\mathbf{\tilde{e}}\\
\mathbf{X}_{1}&= \mathbf{X}_{2}\delta+\mathbf{E},
\end{align*}$where $\mathbf{e},\mathbf{\tilde{e}},\mathbf{E}$ are the OLS residuals.
> [!theorem|*] Cochran's Theorem
> Analogous to the chain rule, $\tilde{\beta}_{2}=\hat{\beta}_{2}+\delta\hat{\beta}_{1}. $
^7f6521
- Note that this is a purely numerical result, where we do not assume any deterministic or stochastic relationship.