> [!info] This material corresponds to chapter 7 in the ESL. > There are three main styles of model assessment: > - Statistical inference, using the distribution of the model coefficients to conduct tests and compute confidence intervals. > - Simple metrics, like the AIC and Mallow's $C_{p}$. > - Computationally intensive methods, like estimating modeling errors using [[Bootstraps|bootstrapping]] or [[Cross Validation|cross validation]]. > > This note will focus on the second method. [[Pearson's R2|Pearson's $R^2$]] (and $R^{2}$-adjusted) is a simple metric for the purpose, but does not fit into the present discussion. See the linked note for its dedicated page. ### Generalizability of Models The same concepts can be framed in the language of [[Decision Theory]]; we take a more hands-on description. Given a training set $\mathcal{T}=(\mathbf{X},\mathbf{y})$, the **test error** of a model $\hat{f}(X)$ is $\mathrm{Err}_{\mathcal{T}}=\mathbb{E}_{X,Y}[L(Y,\hat{f}(X))\,|\,\mathcal{T}]$That is, the expected loss on a randomly drawn new sample. ^220a49 More generally, the **expected test error** is the test error averaged all possible training sets: $\mathrm{Err}=\mathbb{E}_{X,Y,\mathcal{T}}[L(Y, \hat{f}(X))]$This value is easier to analyze, but is less relevant to the problem at hand, compared to the test error itself. In contrast, the **training error** is the average loss at the data points in the training set: $\overline{\mathrm{err}}=\frac{1}{|\mathcal{T}|}\sum_{(X_{i},Y_{i}) \in \mathcal{T}}L(Y_{i},\hat{f}(X_{i}))$and many supervised methods aim to minimize this error. - However, a low training error does not guarantee low test error: a very flexible model will **overfit** on the training set and generalize poorly to new observations. ### In-Sample Prediction Errors The **in-sample error** of a model $\hat{f}(X)$ is the average expected loss when new observations $Y_{i}^\text{new}$ are drawn at the same predictor values $X_{i} \in \mathcal{T}$: $\text{Err}^{\mathrm{in}}:=\frac{1}{|\mathcal{T}|}\sum_{X_{i} \in \mathcal{T}}\mathbb{E}_{Y_{i}^{\text{new}}}[L(\hat{f}(X_{i}), \,Y_{i}^{\text{new}})\,|\,\mathcal{T}]$ **Optimism** of the model $\hat{f}(X)$ is then the difference between in-sample error and its training error: $\text{op}_{\mathcal{T}}:=\mathrm{Err}^{\mathrm{in}}-\overline{\mathrm{err}}$which measures how much the training error overestimates the model's in-sample generalizability. The **average optimism** its expectation: $\omega :=\mathbb{E}_{\mathbf{y}}[\mathrm{op}_{(\mathbf{X},\mathbf{y})}]$where the expectation is over all sample responses while holding the predictor values constant. Optimism is closely linked to the **flexibility** of the model, measured by its [[Degree of Freedom|degree of freedom]] $\mathrm{df}$. - For $l_{2}$ loss, the average optimism is determined by $\mathrm{Cov}(y_{i},\hat{y}_{i})$, i.e. the impact $y_{i}$ has on its own prediction: $\omega =\frac{2}{|\mathcal{T}|}\sum_{y_{i} \in \mathcal{T}}\mathrm{Cov}(y_{i},\hat{y}_{i}).$Therefore, this is equivalent with the [[Degree of Freedom|degree of freedom]] of the model. > [!proof]- > By linearity it suffices to prove the result for each $i$, as the (total) average optimism is just the sum of optimism of each sample: > $\omega_{i}=\mathrm{Err}^\mathrm{in}_{i}- \mathrm{err}_{i}\overset{?}{=}\mathrm{Cov}(y_{i}, \hat{y}_{i}).$ > Let $\mathbb{E}$ be over $Y^\mathrm{new}$ and $\mathbf{y}$, so $\mathbb{E}_{\mathbf{y}}\mathrm{Err}_{i}^\mathrm{in}=\mathbb{E}[ (Y^{\mathrm{new}}_{i} - \hat{y}_{i})^{2}]$, and $\mathbb{E}_{\mathbf{y}}\mathrm{\mathrm{err}}_{i}=\mathbb{E}( y_{i}-\hat{y}_{i} )^{2}$. Hence $\begin{align*} \omega_{i}&= \mathbb{E}[ (Y^{\mathrm{new}}_{i} - \hat{y}_{i})^{2} - ( y_{i}-\hat{y}_{i} )^{2}]\\ &= \mathbb{E}[ \cancel{{Y^{\mathrm{new}}_{i}}^{2}}-2\hat{y}_{i}Y^{\mathrm{new}}_{i} -\cancel{y_{i}^{2}}+2y_{i}\hat{y}_{i}]& [Y_{i}^{\mathrm{new}},y\text{ are iid.}]\\ &= 2\mathbb{E}[y_{i}\hat{y}_{i}]-2\mathbb{E}[\hat{y}_{i}] \cdot \mathbb{E}[Y^{\mathrm{new}}]\\ &= 2\mathrm{Cov}(\hat{y}_{i}, y_{i}). \end{align*}$ - In linear models $\mathbf{y} \mapsto \mathbf{H}\mathbf{y}=: \hat{\mathbf{y}}$, this further simplifies to $\omega=\frac{2\mathrm{tr}(\mathbf{H})}{|\mathcal{T}|}\sigma^{2}_{\epsilon},$where $\sigma^{2}_{\epsilon}$ is the variance of the additive error in $Y=f(X)+\epsilon$. ### Estimating In-Sample Errors *Low in-sample prediction error is an indicator of generalizability of a model*, so its estimate gives a metric for model selection: $\widehat{\mathrm{Err}^{\mathrm{in}}}=\overline{\mathrm{err}}+\hat{\omega}_{\mathcal{T}}$Since the training error is readily available, it remains to estimate the average optimism $\omega$ of different models. **Mallow's $C_{p}$** is a direct adaptation of the optimism formula: $C_{p}:= \overline{\mathrm{err}}+\frac{2d}{|\mathcal{T}|}\hat{\sigma}^{2}_{\epsilon}$where $d$ is the number of predictors in the model, and $\hat{\sigma}_{\epsilon}^{2}$ can be estimated with the mean squared error of a low-bias model. - It comes from ordinary least squares with $l_{2}$ loss: the OLS hat matrix is $\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}$ with trace $d$, the number of predictors. - This metric discourages complex model with a penalty linear to the number of predictors. **Akaike's information criterion (AIC)** of probabilistic models is $\mathrm{AIC}:=2d- 2l(\mathbf{y};\hat{\theta})$where $l(\mathbf{y};\theta)$ is the log-likelihood of the model; $\hat{\theta}$ is the MLE that maximizes $l$. - AIC approximates the log-likelihood loss ([[Deviance|deviance]]) with the asymptotic identity $\mathbb{E}[\mathrm{AIC}]\approx-2 \mathbb{E}[\log p(Y;\hat{\theta})]$where $p(Y;\theta)$ is the density of $Y$ based on parameter $\theta$. - For a Gaussian OLS model, the log-likelihood reduces to $2\log\left( \frac{\mathrm{RSS}}{N} \right)+\mathrm{const.}$, where the $\log\left( \frac{\mathrm{RSS}}{N} \right)$ term comes from estimating $\sigma^{2}$ with its MLE $\frac{\mathrm{RSS}}{N}$. Therefore, its AIC is (equivalent to) $\mathrm{AIC}=2d + \log \frac{\mathrm{RSS}}{N}.$ The **Bayesian information criterion (BIC)** of a model $\mathcal{M}$ is $\mathrm{BIC}:= d\log N-2l(\mathbf{y};\hat{\theta})$which appears in the Laplace approximation of $\log\mathbb{P}[\mathbf{X},\mathbf{y}\,|\,\mathcal{M}]=-\frac{1}{2}\mathrm{BIC}+\mathrm{const.}$so a small BIC corresponds to a high likelihood. - The BIC is asymptotically consistent compared to the AIC, but punishes large models more severely. For penalized or more complex models, obviously we cannot read off the number of parameters: - For penalized models like [[Linear Regression Methods#Shrinkage Methods|ridge regression and the lasso]], having $p$ predictors does not mean that they are all active or allowed to have any coefficient. - For complex models like neural networks, obviously the number of inputs fails to capture the complexity of a network. Therefore, in those cases we resort to [[Degree of Freedom|effective number of parameters]] (for restricted models).