Inference in OLS - Random Notes Go Brrrrrrr

> [!info] > For a general treatment of statistical tests, refer to the notes on [[Hypothesis Testing|hypothesis testing]] and [[Confidence Intervals|confidence intervals]] and [[Confidence Intervals, Tests, P-Values]]. Consider modelling a numerical response $Y$ with a linear combination of predictor $X\hat{\beta}$, where $X$ is a predictor (either scalar or vector-valued), and $\hat{\beta}$ their [[Linear Regression Methods#Least Squares Linear Regression|OLS coefficients]] in the linear combination. ## Distribution of OLS Coefficients To study the distribution of coefficients, we need to model the distribution of $X,Y$ (at least $Y ~|~ X$). The most basic model is to *treat the predictors $\mathbf{X}$ as constants, and assuming the responses $y_{1},\dots,y_{N}$ are all independent with shared variance $\sigma^{2}$,* so their covariance matrix is just $\mathrm{Cov}(\mathbf{y})=\sigma^2 I=\mathrm{diag}(\sigma^{2},\dots,\sigma^{2}).$ One way to achieve this is with the model $Y_{i}=f(\mathbf{x}_{i})+\epsilon_{i}$ where $\epsilon_{i}\overset{\mathrm{iid.}}{\sim} N(0,\sigma^{2})$ are **additive Gaussian errors**, and $f$ is some deterministic relationship. - In this case the OLS coefficients $\hat{\beta}$ are also the MLE of $\beta$, since the log-likelihood of the Gaussian model is proportional to the RSS. Then the identity $\mathrm{Cov}(\mathbf{A}\mathbf{y})=\mathbf{A}\mathrm{Cov}(y)\mathbf{A}^{T}$ gives the covariance of the least squares coefficients: $\mathrm{Cov}(\hat{\beta})=[(\mathbf{X^{T}}\mathbf{X})^{-1}\mathbf{X}] (\sigma^{2} I) [(\mathbf{X^{T}}\mathbf{X})^{-1}\mathbf{X}]^{T}= (\mathbf{X^{T}}\mathbf{X})^{-1}\sigma^{2}.$ ### The Gauss-Markov Model Deriving the full distribution of $\hat{\beta}$, however, requires further assumptions of $f$: > [!definition|*] Gauss-Markov Model > The **Gauss-Markov model** assumes that: > - The true relationship is $Y=X\beta+\epsilon$; > - **Homescedastic** and independent errors: i.e. $\epsilon \overset{\mathrm{iid.}}{\sim}N(0, \sigma^{2})$ for some *constant, shared $\sigma$*. > [!theorem|*] Distribution of the OLS RSS and coefficient estimates > With $n$ independent samples of $(\mathbf{X}, \mathbf{y})$ and additive Gaussian error $\pmb{\epsilon}\overset{\mathrm{iid.}}{\sim}N(0, \sigma^{2})$, in the OLS $Y \sim X$, > - The residuals $\mathbf{e}$ (and by extension the RSS) and fitted coefficients $\hat{\beta}$ are independent, > - Their distributions are $\begin{align*} \mathrm{RSS}&\sim \sigma^{2}\chi^{2}_{n-p},\\ \hat{\beta}&\sim N(\beta, \sigma^{2}(\mathbf{X}^{T}\mathbf{X})^{-1}). \end{align*}$ > --- > > [!info]- Corollary for one-variable t-statistics > > For a sample $\mathbf{y} \overset{\mathrm{iid.}}{\sim} N(\mu, \sigma^{2})$, the t-statistic follows the t-distribution: $T=\frac{\bar{y}-\mu}{S / \sqrt{ n }} \sim t_{n-1}$by applying the result to the null model $\mathbf{y} \sim (1,\dots,1)=: \mathbf{X}$ and $\mu=:\beta$. > > > > > [!proof]- Proof (OG lecture notes proof) > > > Proof: WLOG assume the samples are $Z_{1,\dots,n} \overset{iid.}{\sim} N(0,1)$, since if $Z_{i}=\frac{X_{i}-\mu}{\sigma}$ have independent mean and variances, so do $X_{1,\dots,n}$. > > > > > > Now linearly map $Z_{1,\dots,n}$ to $Y_{1,\dots ,n} \overset{iid.}{\sim} N(0,1)$ so that: (1) $Y_1=\sqrt{ n }\bar{Z}$ records the sample mean, and (2) the sample variance $S^{2}_{Z}$ is only dependent on $Y_{2,\dots,n}$ (hence independent of the mean). > > > > > > This transformation is done by $\mathbf{Y}=A\mathbf{Z}$, where (1) the first row of $A$ is $\mathbf{a}_{1}=\left( \frac{1}{\sqrt{ n }},\dots, \frac{1}{\sqrt{ n }} \right)$so that $Y_{1}=\bar{Z} \cdot \sqrt{ n }$, and (2) $A$ is orthogonal, so that $Y_{1,\dots ,n} \overset{iid.}{\sim} N(0,1)$, as their covariance matrix is $A^{T}A=I$ (which is equivalent to independence for multivariate Gaussians). > > > > > > This is possible by first taking $\mathbf{a}_{1}$ as required, and filling the other rows by extending $\mathbf{a}_{1}$ to an orthonormal basis of $\mathbb{R}^{n}$. > > > > > > Then the sample variance is only dependent on $Y_{2,\dots,n}$: $\begin{align*} > > (n-1)S_{Z}^{2}&=\sum_{i}(Z_{i}-\bar{Z})^{2}\\ > > &= \sum_{i}Z_{i}^{2}-n\bar{Z}^{2}\\ > > &= \sum_{i}Y_{i}^{2}-Y_{1}^{2} \\ > > &= \sum_{i \ge 2} Y_{i}^{2} > > \end{align*}$Where $\sum_{i}Y_{i}^{2}=\mathbf{Y}^{T}\mathbf{Y}=\mathbf{Z}^{T}A^{T}A\mathbf{Z}=\mathbf{Z}^{T}\mathbf{Z}=\sum_{i}Z^{2}$. > > > > > > Hence $\bar{Z}=Y_{1} / \sqrt{ n }$ and $S_{Z}^{2}=\sum_{i \ge 2}Y_{i}^{2} \sim \chi^{2}_{n-1}$ are independent. Then the t-statistic follows the t-distribution by definition. > > > > > [!proof]- > > Consider an orthonormal basis $\{ u_{1},\dots,u_{p} \}$ of the column space of $\mathbf{X}$, and extend it to an orthogonal one of $\mathbb{R}^{n \times n}$. Collect them into a matrix $\mathbf{U}=[\mathbf{U}_{p}, \mathbf{U}_{n-p}]$. > > > > Since $\mathbf{H}$ is the projection onto the column space of $\mathbf{X}$, $\mathbf{H}^{T}u_{j} = \mathbf{1}_{j \le p}$. In matrix form, > > $\mathbf{U}^{T}\mathbf{H}=\begin{bmatrix} > \mathbf{U}^{T}_{p} \\ \mathbf{0} > \end{bmatrix}.$ > > > > Therefore, $\begin{align*}\mathbf{U}^{T}\hat{\mathbf{y}}&= \mathbf{U}^{T}\mathbf{Hy}=\begin{bmatrix} U^{T}_{p} \\ \mathbf{0} \end{bmatrix}\mathbf{y},\\[0.4em] > \mathbf{U}^{T} \mathbf{e} &= \mathbf{U}^{T} (I-\mathbf{H})\mathbf{y}\\ > &= \left(\mathbf{U}^{T}-\begin{bmatrix} > U^{T}_{p} \\ \mathbf{0} > \end{bmatrix}\right)\mathbf{y} \\ > &= \begin{bmatrix} > \mathbf{0} \\ U_{n-p}^{T} > \end{bmatrix}\mathbf{y}, > \end{align*}$ so the two are independent, being dependent on disjoint entries of $\mathbf{y}$. Furthermore, because $\mathbf{U}^{T}$ is bijective, $\hat{\mathbf{y}}$ and $\mathbf{e}$ are also independent. > > > > Furthermore, $\mathbf{U}^{T}\mathbf{e}$ has distribution $N(\mathbf{U}_{n-p}^{T}\mathbf{X}\beta,\sigma^{2}I_{n-p})=N(\mathbf{0}, \sigma^{2}I_{n-p})$, as $U_{n-p}$ is orthogonal to $\mathbf{X}$. Therefore $\| \mathbf{e} \| ^{2}=\| \mathbf{U}^{T}\mathbf{e} \|^{2}\sim \sigma^{2}\chi^{2}_{n-p}.$ > ^038a42 ### Optimality of Least Squares > [!bigidea] > **Gauss-Markov theorem** guarantees that the OLS estimators are "the best" among all estimators of $\beta$ that are linear in $\mathbf{y}$ (under the Gauss-Markov model). > > However, biased or non-linear estimators might have lower [[Statistical Inference#^cf7800|MSE]]. As far as unbiased linear estimators are concerned, the OLS estimators are the best for the linear model. > [!theorem|*] Gauss-Markov > > Under the Gauss-Markov model, the OLS coefficients $\hat{\beta}$ is the **BLUE (best linear unbiased estimator)** of $\beta$. > > For vector-valued $\beta$, this means that any other linear unbiased estimator $\tilde{\beta}=B\mathbf{y}$ has $\mathrm{Cov}(\hat{\beta}) \preceq \mathrm{Cov}(\tilde{\beta}),$where $\preceq$ is the Loewner order of positive-definitive matrices. In general, the MSE can be decomposed as $\begin{align*} \mathrm{MSE}(\hat{\beta})&= \mathbb{E}[(\beta-\hat{\beta})^{2}]\\ &= \underbrace{\mathrm{Var}(\hat{\beta})}_{\text{variance}} + (\underbrace{\mathbb{E}[\beta-\hat{\beta}]}_{\text{bias}})^{2} \end{align*}$ which represents the [[Bias-Variance Tradeoff]]. Hence, if a biased estimator $\tilde{\beta}$ can sacrifice unbiasedness for a large reduction in variance, it can have a better overall $\text{MSE}$ than the unbiased OLS estimator. Examples include [[Linear Regression Methods#Shrinkage Methods|shrinkage methods]] (e.g. ridge regression), variable selection (e.g. forward selection), and dimension reduction methods (e.g. [[Principal Component Analysis|PCA]]). ## Testing Coefficient Significance > [!tldr] > Assuming additive Gaussian noise, the OLS RSS is $\chi^{2}_{n-p}$, so along with the normality of the coefficient estimates, we can construct: > - **t-statistics** to test the significance of one of the coefficients; > - **F-statistics** to test multiple coefficients in one go; ### T-Tests The unbiased estimator $s^{2}$ of $\sigma^{2}$ is $s^{2}:=\frac{\mathrm{RSS}}{\mathbb{E}[\chi^{2}_{n-p}]}=\frac{\mathrm{RSS}}{n-p},$ and $s / \sigma \sim \sqrt{ \chi^{2}_{n-p} / (n-p)}$ The above allows us to find a solvable distribution of $\hat{\beta}$: the Gaussian distribution involves $\sigma$, leaving us with the z-statistic $z:= \frac{\hat{\beta}_{j}-\beta_{j}}{\sigma \sqrt{ (\mathbf{X}^{T}\mathbf{X})^{-1}_{jj} }} \sim N(0,1),$which is not too useful because $\sigma$ is unknown. However, dividing the above by $s / \sigma$ (so $\sigma$ is replaced by $s$) gives the t-statistic: > [!definition|*] The t-Statistic > The **t-statistic** of testing $H_{0}:\hat{\beta}_{j}=b$ (for example, $b=0$ for testing the absence of effect) is $t:= \frac{\hat{\beta}_{j}-b}{s\sqrt{ (X^{T}X)^{-1}_{jj} }}.$Under the null hypothesis, it follows the [[T Distribution|$t_{n-p}$ distribution]]. This leads to the following definition: > [!definition|*] Standard Error of OLS Coefficients > The **standard error** of a coefficient $\hat{\beta}_{j}$ is $\mathrm{se}(\hat{\beta}_{j}):=s\sqrt{ (\mathbf{X}^{T}\mathbf{X})^{-1}_{jj} }.$This is the value reported in the `summary` call on an OLS object in `R`. We can then construct tests with standard tools like the p-value and CIs with quantiles of the t-distribution. One alternative usage of the t-statistic is to *check if the effects of two different classes $c_{1},c_{2}$ of a categorical variable $X_{j}$ are particularly different.* Suppose $c_{1},c_{2}$ get corresponding coefficients estimated to be $\hat{\beta}_{i},\hat{\beta}_{j}$. Furthermore, we can read the covariance matrix of $\hat{\beta}$ to get $\widehat{\mathrm{Cov}}(\hat{\beta}_{i},\hat{\beta}_{j})=(X^{T}X)_{ij}^{-1}$, so $\begin{align*} \widehat{\mathrm{Var}}(\hat{\beta}_{i}-\hat{\beta}_{j})&= \widehat{\mathrm{Var}}(\hat{\beta}_{i})+\widehat{\mathrm{Var}}(\hat{\beta}_{j})-2\widehat{\mathrm{Cov}}(\hat{\beta}_{i},\hat{\beta}_{j})\\ &= (X^{T}X)^{-1}_{ii}+(X^{T}X)^{-1}_{jj} - 2(X^{T}X)^{-1}_{ij}. \end{align*}$ > [!theorem|*] T-Statistic of Difference in Effects > Under the null hypothesis of $H_{0}:\beta_{i}=\beta_{j}$, the difference $\hat{\beta}_{i}-\hat{\beta}_{j} \sim N(0, \mathrm{Var}(\hat{\beta}_{i}-\hat{\beta}_{j})),$and it is independent from $s^{2}$; therefore its t-statistic is given by $t:= \frac{\hat{\beta}_{i}-\hat{\beta}_{j}}{s\sqrt{ (X^{T}X)^{-1}_{ii}+(X^{T}X)^{-1}_{jj}-2(X^{T}X)^{-1}_{ij} }},$and it follows $t_{n-p}$ under $H_{0}$. ### The F-Statistic Suppose we wish to test for the exclusion of $k$ coefficients, WLOG ${\beta}_{p-k+1},\dots,{\beta}_{p}$: $\begin{align*} H_{0}&: {\beta}_{p-k+1}=\cdots={\beta}_{p}=0;\\ H_{1}&: \lnot~ H_{0}. \end{align*}$ The t-statistic developed in the previous section fails to generalize to testing multiple coefficients in one go. > [!definition|*] F-Distribution > The **F-distribution** given by $F_{m,n}=\frac{\chi^{2}_{m} / m}{\chi^{2}_{n} / n},$where the two $\chi^{2}$ distributions are independent. This is very handy: *we can compare the $\chi^{2}$-distributed RSS (or things on that scale) between two models* via their ratio $\sim F$. > [!connection] F-statistic and Likelihood Ratio > Assuming Gaussian noises, the F-statistic is on the same scale as the [[Likelihood Ratios#The Likelihood Ratio Statistic|log-likelihood ratio]] (up to proportionality). > - For example, the likelihood ratio statistic of two Gaussian models (indexed $0,1$) is $\propto \mathrm{RSS}_{0} / \mathrm{RSS}_{1}$. > > However, likelihood ratio tests like [[Likelihood Ratios#^cd7692|wilk's theorem]] are only asymptotically correct. To compare two models $\mathcal{M}_{0,1}$ with number of coefficients $p_{0} < p_{1}$, we want to directly compute $p_{1}\mathrm{RSS}_{0} / p_{0}\mathrm{RSS}_{1} \overset{?}{\sim}F_{p_{0},p_{1}},$but this does not hold in general, especially when $\mathcal{M}_{0,1}$ are nested, because *there is no guarantee that the RSS are independent $\chi^{2}$ distributions*. Luckily, if $\mathcal{M}_{0,1}$ are nested, the RSS have "orthogonal increments", i.e. $\mathrm{RSS}_{1}$ and $\mathrm{RSS}_{0}-\mathrm{RSS}_{1}$ are independent, assuming $H_{0}$ is true: > [!theorem|*] Independence of RSS increments > Assuming $H_{0}: {\beta}_{1}=\cdots={\beta}_{k}=0$ holds, then $\mathrm{RSS}_{1} \sim \sigma^{2}\chi^{2}_{n-p}$ and $(\mathrm{RSS}_{0}-\mathrm{RSS}_{1})\sim \sigma^{2}\chi^{2}_{k}$ are independent. > > Therefore, $F:= \frac{k}{n-p}\frac{\mathrm{RSS}_{0}-\mathrm{RSS}_{1}}{\mathrm{RSS}_{1}}\sim F_{k, n-p}.$ > > > [!proof]- > > Consider the truncated design matrix $\tilde{\mathbf{X}} \in \mathbb{R}^{n \times (p-k)}$, where the first $k$ columns of $\mathbf{X}$ are dropped. > > > > Now following the proof of [[#^038a42]], find an orthogonal basis $\{ u_{1},\dots ,u_{p-k} \}$ of $\tilde{\mathbf{X}}$, giving a matrix $\tilde{\mathbf{U}}$. Further extend it to $\{ u_{1},\dots,u_{p} \}$ that spans $\mathbf{X}$, and lastly to $\{ u_{1},\dots \mu_{n} \}$ that spans the full $\mathbb{R}^{n \times n}$. > > > > Writing $\begin{align*} > \mathbf{z}:= \mathbf{U}^{T}\mathbf{y} &\sim N(\mathbf{U}^{T}\mathbf{X}\beta, \sigma^{2}I)\\ > &= N([I_{p \times p},\mathbf{0}]^{T}\beta,\sigma^{2}I)\\ > &= N([I_{p \times (p-k)}, \mathbf{0}]^{T}\beta, \sigma^{2}I), > \end{align*}$where the last equality follows from the $H_{0}$ assumption that the last $k$ entries of $\beta$ are $0$. Therefore, $\mathbf{z}$ are mutually independent (a diagonal covariance matrix is sufficient for independence between Gaussians). > > > > The same proof will show that $\begin{align*} > \mathbf{U}^{T}\mathbf{e}_{p}&= \begin{bmatrix} > > \mathbf{0}_{p \times n} \\ U_{n-p}^{T} > > \end{bmatrix}\mathbf{y}=(0,\dots,0,z_{p+1},\dots,z_{n})^{T}\\ > \mathbf{U}^{T}\mathbf{e}_{p-k}&= \begin{bmatrix} > > \mathbf{0}_{(p -k) \times n} \\ U_{n-p+k}^{T} > > \end{bmatrix}\mathbf{y}=(0,\dots,0,z_{p-k+1},\dots,z_{n})^{T} > \end{align*}$Therefore, they only contains $zs that have mean $0$ (the first $p-k$ entries of are absent), and $\begin{align*} > \mathrm{RSS}_{0}-\mathrm{RSS}_{1}&= \sum_{i=p-k+1}^{p}z_{i}^{2}\sim \sigma^{2}\chi^{2}_{k},\\ > \mathrm{RSS}_{1}&= \sum_{i=p+1}^{n}z_{i}^{2} \sim \sigma^{2}\chi^{2}_{n-p}, > \end{align*}$Therefore the two are independent, being functions of distinct entries of $\mathbf{z}$ (hence of $\mathbf{y}$). > ^a5aa2f ### The F-Test The above result generalizes to a rule (of thumb) for F-tests: if the two "legs" of RSS compared are disjoint (i.e. they correspond to adding/removing disjoint subsets of coefficients), they can form an F-statistic. Schematically, on the scale between a [[Saturated and Null Models|null model and a saturated model]], the residuals are: ![[diagram-20240421 1.png]] and the rule simply requires the F-statistic to be computed with non-overlapping segments on the diagram (e.g. $\mathrm{RSS}_{1} / \mathrm{RSS}_{0}$ does not work). > [!exposition] F-statistic for nested models One option is to compare the two legs labeled: $F:= \frac{n-p}{k}\cdot\frac{\mathrm{RSS}_{0}-\mathrm{RSS}_{1}}{\mathrm{RSS}_{1}} \sim F_{k,~ n-p}.$Here a large observed $F=F_{\mathrm{obs}}$ indicates a large difference in $\mathrm{RSS}_{0}-\mathrm{RSS}_{1}$, i.e. removing the $k$ predictors in $H_{0}$ deteriorates the fit significantly, so the test is one-tailed with p-value $\mathbb{P}[F_{k,n-p} \ge F_{\mathrm{obs}}]$. > > Obviously there are other options: for comparing $H_{0}$ against the null model, it is even possible to use $F:= \frac{n-p}{p-k-1} \cdot \frac{\mathrm{TSS}-\mathrm{RSS}_{0}}{\mathrm{RSS}_{1}} \sim F_{p-k-1, n-p}.$Note that the null model still has a $\mathrm{df}=1$. ### Testing in R As an example, we play with the toy dataset `mtcars` in `R` with the following linear model: ``` model = lm(mpg ~ cyl + disp + wt, data = mtcars) ``` The `summary(model)` call produces the following table: ``` Call: lm(formula = mpg ~ cyl + disp + wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.4035 -1.4028 -0.4955 1.3387 6.0722 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.107678 2.842426 14.462 1.62e-14 *** cyl -1.784944 0.607110 -2.940 0.00651 ** disp 0.007473 0.011845 0.631 0.53322 wt -3.635677 1.040138 -3.495 0.00160 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.595 on 28 degrees of freedom Multiple R-squared: 0.8326, Adjusted R-squared: 0.8147 F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11 ``` > [!Help]- Interpretation of the summary table > *Coefficients table* > `Std. Error` The second column is the $s\sqrt{ (X^{T}X)_{jj}^{-1} }$ estimate of the standard deviation of the coefficient $\hat{\beta}_{j}$ in the same row. > > `t value` The third column is the t-statistic for testing $H_{0}:\hat{\beta}_{j}=0$, so they equal $\hat{\beta}_{j} / \mathrm{se}(\hat{\beta}_{j})$, i.e. `Estimate` / `Std.Error`. > > *Statistics at the bottom* > `Residual standard error: xxx on df degrees of freedom`: this means that $s= \mathrm{xxx}$, and $\mathrm{df}\cdot s^{2}\sim \sigma^{2}\chi^{2}_{\mathrm{df}}$. > > `Multiple R-squared`: this is $1 - \mathrm{RSS} / \mathrm{TSS}$. > `F-statistic`: this is $\frac{\mathrm{TSS}-\mathrm{RSS}_{\mathrm{full}}}{\mathrm{RSS}_{\mathrm{full}}}\cdot \frac{n-p}{p-1}\sim F_{p-1, n-p}.$ The `anova(model)` call in `R` produces the following table: ``` Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) cyl 1 817.71 817.71 121.4689 1.078e-11 *** disp 1 37.59 37.59 5.5845 0.025299 * wt 1 82.25 82.25 12.2177 0.001596 ** Residuals 28 188.49 6.73 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` > [!Help]- Interpretation of the ANOVA table > > `Residuals`: the residuals of the full model (in this case of `mpg ~ 1 + cyl + disp + wt`). > > `Df`: degree of freedom that the variable has; is `1` for any quantitative predictor, and can be more if for categorical ones (one for each level/class). > > `Sum Sq`: the RSS decrease due to adding the predictor. For example, a $817.71$ for `cyl` means that $\mathrm{TSS}-\mathrm{RSS}_{1}=817.71$ (where $\mathrm{RSS}_{1}$ is the RSS of the model `mpg ~ 1 + cyl`). > > `Mean Sq`: `Sum Sq` / `Df`. Handy for F-tests: to test $H_{0}:\beta_{\mathrm{disp}}=0$, we can directly compute $F= \mathrm{(Mean Sq)_{disp}} / \mathrm{(MeanSq)}_{\mathrm{Residuals}}$. > > `F value`: `Mean Sq` of the same row divided by that of the residual, i.e. for testing exclusion of this one variable. ## Model Diagnostics Straightforward checks on the Gaussian OLS assumptions include: - The residual-versus-fitted plot checks independence between fitted values $\hat{\mathbf{y}}$ and residuals $\mathbf{e}$, as well as **homoscedasticity**. - Checking normality of residuals via the [[Order Statistics#Q-Q Plots|Q-Q plot]]. They might reveal issues like: - Non-linear relationship between $Y$ and $\mathbf{X}$: this causes the OLS to have large bias. - Correlated, [[OLS with Heteroscedastic noise|heteroscedastic, or non-Gaussian residuals]]. - Outliers due to incorrect or missing data (e.g. missing data imputed with $-1$). ### Residual Diagnostics > [!definition|*] Standardized Residuals > Normalizing the residuals $\mathbf{e} \sim N(\mathbf{0}, \sigma^{2}(I_{n} - \mathbf{H}))$ with their respective standard deviations give the **standardized residuals $\mathbf{r}$** given by $r_{i}:= \frac{e_{i}}{s\sqrt{1-h_{ii} }}.$ - If the Gaussian assumption holds, approximately $95\%$ of $\mathbf{r}$ will lie in $\pm 2$. - However, when the assumption fails, there can be outliers inflating $\mathbf{s}$ and deflating $\mathbf{r}$. - Moreover, the numerator and denominator both depend on $y_{i}$, hence are not independent, making it difficult to derive its distribution. Decoupling the two motivates the following: > [!definition|*] Studentized Residual > For the $i$th observation $y_{i}$, fit another OLS model with $(\mathbf{X}_{-i},\mathbf{y}_{-i})$, i.e. all data except the $i$th, to get the coefficients $\hat{\beta}_{-i}$. The **studentized residual** is given by $r'_{i}:= \frac{y_{i}-\mathbf{x}_{i}\hat{\beta}_{-i}}{\sqrt{ \widehat{\mathrm{Var}}(y_{i}-\mathbf{x}_{i}\hat{\beta}_{-i}) }}.$ One alternative (but equivalent formula) is $r'_{i}=r_{i}\sqrt{ \frac{n-p-1}{n-p-r_{i}^{2}} }.$ By independence between $y_{i}$ and $\mathbf{y}_{-i}$, the denominator (without the estimates) is just $\sqrt{ {\mathrm{Var}}(y_{i}-\mathbf{x}_{i}\hat{\beta}_{-i}) }= \sqrt{ \mathrm{Var}(y_{i}) + \mathbf{x}_{i}\mathrm{Cov} (\hat{\beta}_{-i})\mathbf{x}_{i}^{T}}.$To obtain its estimate, simply replace the variances with estimates from standard errors. Therefore, the studentized residual follows $r'_{i} \sim t_{n-1-p}.$The degree of freedom follows from fitting $p$ parameters using $(n-1)$ samples in $(\mathbf{X}_{-i}, \mathbf{y}_{-i})$. ### Leverage and Influence > [!definition|*] Leverage The **leverage** of the $i$th observation is simply $h_{ii}$, the $i$th diagonal entry of the hat matrix $\mathbf{H}$. - Note that $\sum_{i}h_{ii}=\mathrm{Tr}\mathbf{H}=p$, so the average leverage should be $p / n$. - A rule of thumb is that $h_{ii}>2p / n$ is counted as high leverage. The leverage appears in $\begin{align*} \mathrm{Cov}(y_{i},e_{i})&= \sigma^{2}(1-h_{ii}),\\ \mathrm{Cov}(y_{i},\hat{y}_{i})&= \mathrm{Var}(y_{i})-\mathrm{Cov}(y_{i},e_{i})=\sigma^{2} h_{ii},\\ \mathrm{Corr}(y_{i},\hat{y}_{i})&= h_{ii}, \end{align*}$so *a high-leverage point has a strong correlation between its observed and fitted values*. This could either mean that: - The observation closely follows the linear relationship, so its residual is consistently low; it does not use its high leverage to influence the fit. - The observation has abnormal $y_{i}$, dragging the OLS hyperplane towards it, "artificially" shrinking the variance of its residual. Therefore, *a point with high leverage does not always exert its leverage*. To differentiate between the pathological case from the alright one, we consider the (actually realized) **influence** of the data point. > [!definition|*] Cook's Distance > **Cook's distance** of the observation $\mathbf{x}_{i}, y_{i}$ measures the influence of the data point with $\mathrm{cd}_{i}:=\frac{\| \hat{\mathbf{y}}-\hat{\mathbf{y}}_{-i} \|^{2}}{ps^{2}} ,$where $\hat{\mathbf{\mathbf{y}}}_{-i}=\mathbf{X}\hat{\beta}_{-i}$ are the predictions of the model fitted with all-but-the-$i$th data set. Alternatively, $\mathrm{cd_{i}}=\frac{r_{i}^{2}}{p} \cdot \frac{h_{ii}}{1-h_{ii}}.$ - One rule of thumb is that since $h > 2p / n$ is counted as large leverage, and $| r_{i} |>2$ as large residual, plugging in those values gives $\mathrm{cd}_{i}>8 / (n-2p)$ as a threshold of "large leverage".