> [!info]
> For a general treatment of statistical tests, refer to the notes on [[Hypothesis Testing|hypothesis testing]] and [[Confidence Intervals|confidence intervals]] and [[Confidence Intervals, Tests, P-Values]].
Consider modelling a numerical response $Y$ with a linear combination of predictor $X\hat{\beta}$, where $X$ is a predictor (either scalar or vector-valued), and $\hat{\beta}$ their [[Linear Regression Methods#Least Squares Linear Regression|OLS coefficients]] in the linear combination.
## Distribution of OLS Coefficients
To study the distribution of coefficients, we need to model the distribution of $X,Y$ (at least $Y ~|~ X$). The most basic model is to *treat the predictors $\mathbf{X}$ as constants, and assuming the responses $y_{1},\dots,y_{N}$ are all independent with shared variance $\sigma^{2}$,* so their covariance matrix is just
$\mathrm{Cov}(\mathbf{y})=\sigma^2 I=\mathrm{diag}(\sigma^{2},\dots,\sigma^{2}).$
One way to achieve this is with the model
$Y_{i}=f(\mathbf{x}_{i})+\epsilon_{i}$
where $\epsilon_{i}\overset{\mathrm{iid.}}{\sim} N(0,\sigma^{2})$ are **additive Gaussian errors**, and $f$ is some deterministic relationship.
- In this case the OLS coefficients $\hat{\beta}$ are also the MLE of $\beta$, since the log-likelihood of the Gaussian model is proportional to the RSS.
Then the identity $\mathrm{Cov}(\mathbf{A}\mathbf{y})=\mathbf{A}\mathrm{Cov}(y)\mathbf{A}^{T}$ gives the covariance of the least squares coefficients:
$\mathrm{Cov}(\hat{\beta})=[(\mathbf{X^{T}}\mathbf{X})^{-1}\mathbf{X}] (\sigma^{2} I) [(\mathbf{X^{T}}\mathbf{X})^{-1}\mathbf{X}]^{T}= (\mathbf{X^{T}}\mathbf{X})^{-1}\sigma^{2}.$
### The Gauss-Markov Model
Deriving the full distribution of $\hat{\beta}$, however, requires further assumptions of $f$:
> [!definition|*] Gauss-Markov Model
> The **Gauss-Markov model** assumes that:
> - The true relationship is $Y=X\beta+\epsilon$;
> - **Homescedastic** and independent errors: i.e. $\epsilon \overset{\mathrm{iid.}}{\sim}N(0, \sigma^{2})$ for some *constant, shared $\sigma$*.
> [!theorem|*] Distribution of the OLS RSS and coefficient estimates
> With $n$ independent samples of $(\mathbf{X}, \mathbf{y})$ and additive Gaussian error $\pmb{\epsilon}\overset{\mathrm{iid.}}{\sim}N(0, \sigma^{2})$, in the OLS $Y \sim X$,
> - The residuals $\mathbf{e}$ (and by extension the RSS) and fitted coefficients $\hat{\beta}$ are independent,
> - Their distributions are $\begin{align*}
\mathrm{RSS}&\sim \sigma^{2}\chi^{2}_{n-p},\\ \hat{\beta}&\sim N(\beta, \sigma^{2}(\mathbf{X}^{T}\mathbf{X})^{-1}).
\end{align*}$
> ---
> > [!info]- Corollary for one-variable t-statistics
> > For a sample $\mathbf{y} \overset{\mathrm{iid.}}{\sim} N(\mu, \sigma^{2})$, the t-statistic follows the t-distribution: $T=\frac{\bar{y}-\mu}{S / \sqrt{ n }} \sim t_{n-1}$by applying the result to the null model $\mathbf{y} \sim (1,\dots,1)=: \mathbf{X}$ and $\mu=:\beta$.
> >
> > > [!proof]- Proof (OG lecture notes proof)
> > > Proof: WLOG assume the samples are $Z_{1,\dots,n} \overset{iid.}{\sim} N(0,1)$, since if $Z_{i}=\frac{X_{i}-\mu}{\sigma}$ have independent mean and variances, so do $X_{1,\dots,n}$.
> > >
> > > Now linearly map $Z_{1,\dots,n}$ to $Y_{1,\dots ,n} \overset{iid.}{\sim} N(0,1)$ so that: (1) $Y_1=\sqrt{ n }\bar{Z}$ records the sample mean, and (2) the sample variance $S^{2}_{Z}$ is only dependent on $Y_{2,\dots,n}$ (hence independent of the mean).
> > >
> > > This transformation is done by $\mathbf{Y}=A\mathbf{Z}$, where (1) the first row of $A$ is $\mathbf{a}_{1}=\left( \frac{1}{\sqrt{ n }},\dots, \frac{1}{\sqrt{ n }} \right)$so that $Y_{1}=\bar{Z} \cdot \sqrt{ n }$, and (2) $A$ is orthogonal, so that $Y_{1,\dots ,n} \overset{iid.}{\sim} N(0,1)$, as their covariance matrix is $A^{T}A=I$ (which is equivalent to independence for multivariate Gaussians).
> > >
> > > This is possible by first taking $\mathbf{a}_{1}$ as required, and filling the other rows by extending $\mathbf{a}_{1}$ to an orthonormal basis of $\mathbb{R}^{n}$.
> > >
> > > Then the sample variance is only dependent on $Y_{2,\dots,n}$: $\begin{align*}
> > (n-1)S_{Z}^{2}&=\sum_{i}(Z_{i}-\bar{Z})^{2}\\
> > &= \sum_{i}Z_{i}^{2}-n\bar{Z}^{2}\\
> > &= \sum_{i}Y_{i}^{2}-Y_{1}^{2} \\
> > &= \sum_{i \ge 2} Y_{i}^{2}
> > \end{align*}$Where $\sum_{i}Y_{i}^{2}=\mathbf{Y}^{T}\mathbf{Y}=\mathbf{Z}^{T}A^{T}A\mathbf{Z}=\mathbf{Z}^{T}\mathbf{Z}=\sum_{i}Z^{2}$.
> > >
> > > Hence $\bar{Z}=Y_{1} / \sqrt{ n }$ and $S_{Z}^{2}=\sum_{i \ge 2}Y_{i}^{2} \sim \chi^{2}_{n-1}$ are independent. Then the t-statistic follows the t-distribution by definition.
> >
>
> > [!proof]-
> > Consider an orthonormal basis $\{ u_{1},\dots,u_{p} \}$ of the column space of $\mathbf{X}$, and extend it to an orthogonal one of $\mathbb{R}^{n \times n}$. Collect them into a matrix $\mathbf{U}=[\mathbf{U}_{p}, \mathbf{U}_{n-p}]$.
> >
> > Since $\mathbf{H}$ is the projection onto the column space of $\mathbf{X}$, $\mathbf{H}^{T}u_{j} = \mathbf{1}_{j \le p}$. In matrix form,
> > $\mathbf{U}^{T}\mathbf{H}=\begin{bmatrix}
> \mathbf{U}^{T}_{p} \\ \mathbf{0}
> \end{bmatrix}.$
> >
> > Therefore, $\begin{align*}\mathbf{U}^{T}\hat{\mathbf{y}}&= \mathbf{U}^{T}\mathbf{Hy}=\begin{bmatrix} U^{T}_{p} \\ \mathbf{0} \end{bmatrix}\mathbf{y},\\[0.4em]
> \mathbf{U}^{T} \mathbf{e} &= \mathbf{U}^{T} (I-\mathbf{H})\mathbf{y}\\
> &= \left(\mathbf{U}^{T}-\begin{bmatrix}
> U^{T}_{p} \\ \mathbf{0}
> \end{bmatrix}\right)\mathbf{y} \\
> &= \begin{bmatrix}
> \mathbf{0} \\ U_{n-p}^{T}
> \end{bmatrix}\mathbf{y},
> \end{align*}$ so the two are independent, being dependent on disjoint entries of $\mathbf{y}$. Furthermore, because $\mathbf{U}^{T}$ is bijective, $\hat{\mathbf{y}}$ and $\mathbf{e}$ are also independent.
> >
> > Furthermore, $\mathbf{U}^{T}\mathbf{e}$ has distribution $N(\mathbf{U}_{n-p}^{T}\mathbf{X}\beta,\sigma^{2}I_{n-p})=N(\mathbf{0}, \sigma^{2}I_{n-p})$, as $U_{n-p}$ is orthogonal to $\mathbf{X}$. Therefore $\| \mathbf{e} \| ^{2}=\| \mathbf{U}^{T}\mathbf{e} \|^{2}\sim \sigma^{2}\chi^{2}_{n-p}.$
>
^038a42
### Optimality of Least Squares
> [!bigidea]
> **Gauss-Markov theorem** guarantees that the OLS estimators are "the best" among all estimators of $\beta$ that are linear in $\mathbf{y}$ (under the Gauss-Markov model).
>
> However, biased or non-linear estimators might have lower [[Statistical Inference#^cf7800|MSE]].
As far as unbiased linear estimators are concerned, the OLS estimators are the best for the linear model.
> [!theorem|*] Gauss-Markov
>
> Under the Gauss-Markov model, the OLS coefficients $\hat{\beta}$ is the **BLUE (best linear unbiased estimator)** of $\beta$.
>
> For vector-valued $\beta$, this means that any other linear unbiased estimator $\tilde{\beta}=B\mathbf{y}$ has $\mathrm{Cov}(\hat{\beta}) \preceq \mathrm{Cov}(\tilde{\beta}),$where $\preceq$ is the Loewner order of positive-definitive matrices.
In general, the MSE can be decomposed as
$\begin{align*}
\mathrm{MSE}(\hat{\beta})&= \mathbb{E}[(\beta-\hat{\beta})^{2}]\\
&= \underbrace{\mathrm{Var}(\hat{\beta})}_{\text{variance}} + (\underbrace{\mathbb{E}[\beta-\hat{\beta}]}_{\text{bias}})^{2}
\end{align*}$
which represents the [[Bias-Variance Tradeoff]].
Hence, if a biased estimator $\tilde{\beta}$ can sacrifice unbiasedness for a large reduction in variance, it can have a better overall $\text{MSE}$ than the unbiased OLS estimator.
Examples include [[Linear Regression Methods#Shrinkage Methods|shrinkage methods]] (e.g. ridge regression), variable selection (e.g. forward selection), and dimension reduction methods (e.g. [[Principal Component Analysis|PCA]]).
## Testing Coefficient Significance
> [!tldr]
> Assuming additive Gaussian noise, the OLS RSS is $\chi^{2}_{n-p}$, so along with the normality of the coefficient estimates, we can construct:
> - **t-statistics** to test the significance of one of the coefficients;
> - **F-statistics** to test multiple coefficients in one go;
### T-Tests
The unbiased estimator $s^{2}$ of $\sigma^{2}$ is
$s^{2}:=\frac{\mathrm{RSS}}{\mathbb{E}[\chi^{2}_{n-p}]}=\frac{\mathrm{RSS}}{n-p},$
and $s / \sigma \sim \sqrt{ \chi^{2}_{n-p} / (n-p)}$
The above allows us to find a solvable distribution of $\hat{\beta}$: the Gaussian distribution involves $\sigma$, leaving us with the z-statistic $z:= \frac{\hat{\beta}_{j}-\beta_{j}}{\sigma \sqrt{ (\mathbf{X}^{T}\mathbf{X})^{-1}_{jj} }} \sim N(0,1),$which is not too useful because $\sigma$ is unknown. However, dividing the above by $s / \sigma$ (so $\sigma$ is replaced by $s$) gives the t-statistic:
> [!definition|*] The t-Statistic
> The **t-statistic** of testing $H_{0}:\hat{\beta}_{j}=b$ (for example, $b=0$ for testing the absence of effect) is $t:= \frac{\hat{\beta}_{j}-b}{s\sqrt{ (X^{T}X)^{-1}_{jj} }}.$Under the null hypothesis, it follows the [[T Distribution|$t_{n-p}$ distribution]].
This leads to the following definition:
> [!definition|*] Standard Error of OLS Coefficients
> The **standard error** of a coefficient $\hat{\beta}_{j}$ is $\mathrm{se}(\hat{\beta}_{j}):=s\sqrt{ (\mathbf{X}^{T}\mathbf{X})^{-1}_{jj} }.$This is the value reported in the `summary` call on an OLS object in `R`.
We can then construct tests with standard tools like the p-value and CIs with quantiles of the t-distribution.
One alternative usage of the t-statistic is to *check if the effects of two different classes $c_{1},c_{2}$ of a categorical variable $X_{j}$ are particularly different.*
Suppose $c_{1},c_{2}$ get corresponding coefficients estimated to be $\hat{\beta}_{i},\hat{\beta}_{j}$. Furthermore, we can read the covariance matrix of $\hat{\beta}$ to get $\widehat{\mathrm{Cov}}(\hat{\beta}_{i},\hat{\beta}_{j})=(X^{T}X)_{ij}^{-1}$, so $\begin{align*}
\widehat{\mathrm{Var}}(\hat{\beta}_{i}-\hat{\beta}_{j})&= \widehat{\mathrm{Var}}(\hat{\beta}_{i})+\widehat{\mathrm{Var}}(\hat{\beta}_{j})-2\widehat{\mathrm{Cov}}(\hat{\beta}_{i},\hat{\beta}_{j})\\
&= (X^{T}X)^{-1}_{ii}+(X^{T}X)^{-1}_{jj} - 2(X^{T}X)^{-1}_{ij}.
\end{align*}$
> [!theorem|*] T-Statistic of Difference in Effects
> Under the null hypothesis of $H_{0}:\beta_{i}=\beta_{j}$, the difference $\hat{\beta}_{i}-\hat{\beta}_{j} \sim N(0, \mathrm{Var}(\hat{\beta}_{i}-\hat{\beta}_{j})),$and it is independent from $s^{2}$; therefore its t-statistic is given by $t:= \frac{\hat{\beta}_{i}-\hat{\beta}_{j}}{s\sqrt{ (X^{T}X)^{-1}_{ii}+(X^{T}X)^{-1}_{jj}-2(X^{T}X)^{-1}_{ij} }},$and it follows $t_{n-p}$ under $H_{0}$.
### The F-Statistic
Suppose we wish to test for the exclusion of $k$ coefficients, WLOG ${\beta}_{p-k+1},\dots,{\beta}_{p}$: $\begin{align*}
H_{0}&: {\beta}_{p-k+1}=\cdots={\beta}_{p}=0;\\
H_{1}&: \lnot~ H_{0}.
\end{align*}$
The t-statistic developed in the previous section fails to generalize to testing multiple coefficients in one go.
> [!definition|*] F-Distribution
> The **F-distribution** given by $F_{m,n}=\frac{\chi^{2}_{m} / m}{\chi^{2}_{n} / n},$where the two $\chi^{2}$ distributions are independent.
This is very handy: *we can compare the $\chi^{2}$-distributed RSS (or things on that scale) between two models* via their ratio $\sim F$.
> [!connection] F-statistic and Likelihood Ratio
> Assuming Gaussian noises, the F-statistic is on the same scale as the [[Likelihood Ratios#The Likelihood Ratio Statistic|log-likelihood ratio]] (up to proportionality).
> - For example, the likelihood ratio statistic of two Gaussian models (indexed $0,1$) is $\propto \mathrm{RSS}_{0} / \mathrm{RSS}_{1}$.
>
> However, likelihood ratio tests like [[Likelihood Ratios#^cd7692|wilk's theorem]] are only asymptotically correct.
To compare two models $\mathcal{M}_{0,1}$ with number of coefficients $p_{0} < p_{1}$, we want to directly compute $p_{1}\mathrm{RSS}_{0} / p_{0}\mathrm{RSS}_{1} \overset{?}{\sim}F_{p_{0},p_{1}},$but this does not hold in general, especially when $\mathcal{M}_{0,1}$ are nested, because *there is no guarantee that the RSS are independent $\chi^{2}$ distributions*.
Luckily, if $\mathcal{M}_{0,1}$ are nested, the RSS have "orthogonal increments", i.e. $\mathrm{RSS}_{1}$ and $\mathrm{RSS}_{0}-\mathrm{RSS}_{1}$ are independent, assuming $H_{0}$ is true:
> [!theorem|*] Independence of RSS increments
> Assuming $H_{0}: {\beta}_{1}=\cdots={\beta}_{k}=0$ holds, then $\mathrm{RSS}_{1} \sim \sigma^{2}\chi^{2}_{n-p}$ and $(\mathrm{RSS}_{0}-\mathrm{RSS}_{1})\sim \sigma^{2}\chi^{2}_{k}$ are independent.
>
> Therefore, $F:= \frac{k}{n-p}\frac{\mathrm{RSS}_{0}-\mathrm{RSS}_{1}}{\mathrm{RSS}_{1}}\sim F_{k, n-p}.$
>
> > [!proof]-
> > Consider the truncated design matrix $\tilde{\mathbf{X}} \in \mathbb{R}^{n \times (p-k)}$, where the first $k$ columns of $\mathbf{X}$ are dropped.
> >
> > Now following the proof of [[#^038a42]], find an orthogonal basis $\{ u_{1},\dots ,u_{p-k} \}$ of $\tilde{\mathbf{X}}$, giving a matrix $\tilde{\mathbf{U}}$. Further extend it to $\{ u_{1},\dots,u_{p} \}$ that spans $\mathbf{X}$, and lastly to $\{ u_{1},\dots \mu_{n} \}$ that spans the full $\mathbb{R}^{n \times n}$.
> >
> > Writing $\begin{align*}
> \mathbf{z}:= \mathbf{U}^{T}\mathbf{y} &\sim N(\mathbf{U}^{T}\mathbf{X}\beta, \sigma^{2}I)\\
> &= N([I_{p \times p},\mathbf{0}]^{T}\beta,\sigma^{2}I)\\
> &= N([I_{p \times (p-k)}, \mathbf{0}]^{T}\beta, \sigma^{2}I),
> \end{align*}$where the last equality follows from the $H_{0}$ assumption that the last $k$ entries of $\beta$ are $0$. Therefore, $\mathbf{z}$ are mutually independent (a diagonal covariance matrix is sufficient for independence between Gaussians).
> >
> > The same proof will show that $\begin{align*}
> \mathbf{U}^{T}\mathbf{e}_{p}&= \begin{bmatrix}
> > \mathbf{0}_{p \times n} \\ U_{n-p}^{T}
> > \end{bmatrix}\mathbf{y}=(0,\dots,0,z_{p+1},\dots,z_{n})^{T}\\
> \mathbf{U}^{T}\mathbf{e}_{p-k}&= \begin{bmatrix}
> > \mathbf{0}_{(p -k) \times n} \\ U_{n-p+k}^{T}
> > \end{bmatrix}\mathbf{y}=(0,\dots,0,z_{p-k+1},\dots,z_{n})^{T}
> \end{align*}$Therefore, they only contains $z