> [!info] This material corresponds to chapter 7 in the ESL.
> There are three main styles of model assessment:
> - Statistical inference, using the distribution of the model coefficients to conduct tests and compute confidence intervals.
> - Simple metrics, like the AIC and Mallow's $C_{p}$.
> - Computationally intensive methods, like estimating modeling errors using [[Bootstraps|bootstrapping]] or [[Cross Validation|cross validation]].
>
> This note will focus on the second method.
[[Pearson's R2|Pearson's $R^2$]] (and $R^{2}$-adjusted) is a simple metric for the purpose, but does not fit into the present discussion. See the linked note for its dedicated page.
### Generalizability of Models
The same concepts can be framed in the language of [[Decision Theory]]; we take a more hands-on description.
Given a training set $\mathcal{T}=(\mathbf{X},\mathbf{y})$, the **test error** of a model $\hat{f}(X)$ is $\mathrm{Err}_{\mathcal{T}}=\mathbb{E}_{X,Y}[L(Y,\hat{f}(X))\,|\,\mathcal{T}]$That is, the expected loss on a randomly drawn new sample. ^220a49
More generally, the **expected test error** is the test error averaged all possible training sets: $\mathrm{Err}=\mathbb{E}_{X,Y,\mathcal{T}}[L(Y, \hat{f}(X))]$This value is easier to analyze, but is less relevant to the problem at hand, compared to the test error itself.
In contrast, the **training error** is the average loss at the data points in the training set: $\overline{\mathrm{err}}=\frac{1}{|\mathcal{T}|}\sum_{(X_{i},Y_{i}) \in \mathcal{T}}L(Y_{i},\hat{f}(X_{i}))$and many supervised methods aim to minimize this error.
- However, a low training error does not guarantee low test error: a very flexible model will **overfit** on the training set and generalize poorly to new observations.
### In-Sample Prediction Errors
The **in-sample error** of a model $\hat{f}(X)$ is the average expected loss when new observations $Y_{i}^\text{new}$ are drawn at the same predictor values $X_{i} \in \mathcal{T}$: $\text{Err}^{\mathrm{in}}:=\frac{1}{|\mathcal{T}|}\sum_{X_{i} \in \mathcal{T}}\mathbb{E}_{Y_{i}^{\text{new}}}[L(\hat{f}(X_{i}),
\,Y_{i}^{\text{new}})\,|\,\mathcal{T}]$
**Optimism** of the model $\hat{f}(X)$ is then the difference between in-sample error and its training error: $\text{op}_{\mathcal{T}}:=\mathrm{Err}^{\mathrm{in}}-\overline{\mathrm{err}}$which measures how much the training error overestimates the model's in-sample generalizability. The **average optimism** its expectation: $\omega :=\mathbb{E}_{\mathbf{y}}[\mathrm{op}_{(\mathbf{X},\mathbf{y})}]$where the expectation is over all sample responses while holding the predictor values constant.
Optimism is closely linked to the **flexibility** of the model, measured by its [[Degree of Freedom|degree of freedom]] $\mathrm{df}$.
- For $l_{2}$ loss, the average optimism is determined by $\mathrm{Cov}(y_{i},\hat{y}_{i})$, i.e. the impact $y_{i}$ has on its own prediction: $\omega =\frac{2}{|\mathcal{T}|}\sum_{y_{i} \in \mathcal{T}}\mathrm{Cov}(y_{i},\hat{y}_{i}).$Therefore, this is equivalent with the [[Degree of Freedom|degree of freedom]] of the model.
> [!proof]-
> By linearity it suffices to prove the result for each $i$, as the (total) average optimism is just the sum of optimism of each sample:
> $\omega_{i}=\mathrm{Err}^\mathrm{in}_{i}- \mathrm{err}_{i}\overset{?}{=}\mathrm{Cov}(y_{i}, \hat{y}_{i}).$
> Let $\mathbb{E}$ be over $Y^\mathrm{new}$ and $\mathbf{y}$, so $\mathbb{E}_{\mathbf{y}}\mathrm{Err}_{i}^\mathrm{in}=\mathbb{E}[ (Y^{\mathrm{new}}_{i} - \hat{y}_{i})^{2}]$, and $\mathbb{E}_{\mathbf{y}}\mathrm{\mathrm{err}}_{i}=\mathbb{E}( y_{i}-\hat{y}_{i} )^{2}$. Hence $\begin{align*}
\omega_{i}&= \mathbb{E}[ (Y^{\mathrm{new}}_{i} - \hat{y}_{i})^{2} - ( y_{i}-\hat{y}_{i} )^{2}]\\
&= \mathbb{E}[ \cancel{{Y^{\mathrm{new}}_{i}}^{2}}-2\hat{y}_{i}Y^{\mathrm{new}}_{i} -\cancel{y_{i}^{2}}+2y_{i}\hat{y}_{i}]& [Y_{i}^{\mathrm{new}},y\text{ are iid.}]\\
&= 2\mathbb{E}[y_{i}\hat{y}_{i}]-2\mathbb{E}[\hat{y}_{i}] \cdot \mathbb{E}[Y^{\mathrm{new}}]\\
&= 2\mathrm{Cov}(\hat{y}_{i}, y_{i}).
\end{align*}$
- In linear models $\mathbf{y} \mapsto \mathbf{H}\mathbf{y}=: \hat{\mathbf{y}}$, this further simplifies to $\omega=\frac{2\mathrm{tr}(\mathbf{H})}{|\mathcal{T}|}\sigma^{2}_{\epsilon},$where $\sigma^{2}_{\epsilon}$ is the variance of the additive error in $Y=f(X)+\epsilon$.
### Estimating In-Sample Errors
*Low in-sample prediction error is an indicator of generalizability of a model*, so its estimate gives a metric for model selection: $\widehat{\mathrm{Err}^{\mathrm{in}}}=\overline{\mathrm{err}}+\hat{\omega}_{\mathcal{T}}$Since the training error is readily available, it remains to estimate the average optimism $\omega$ of different models.
**Mallow's $C_{p}$** is a direct adaptation of the optimism formula: $C_{p}:= \overline{\mathrm{err}}+\frac{2d}{|\mathcal{T}|}\hat{\sigma}^{2}_{\epsilon}$where $d$ is the number of predictors in the model, and $\hat{\sigma}_{\epsilon}^{2}$ can be estimated with the mean squared error of a low-bias model.
- It comes from ordinary least squares with $l_{2}$ loss: the OLS hat matrix is $\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}$ with trace $d$, the number of predictors.
- This metric discourages complex model with a penalty linear to the number of predictors.
**Akaike's information criterion (AIC)** of probabilistic models is $\mathrm{AIC}:=2d- 2l(\mathbf{y};\hat{\theta})$where $l(\mathbf{y};\theta)$ is the log-likelihood of the model; $\hat{\theta}$ is the MLE that maximizes $l$.
- AIC approximates the log-likelihood loss ([[Deviance|deviance]]) with the asymptotic identity $\mathbb{E}[\mathrm{AIC}]\approx-2 \mathbb{E}[\log p(Y;\hat{\theta})]$where $p(Y;\theta)$ is the density of $Y$ based on parameter $\theta$.
- For a Gaussian OLS model, the log-likelihood reduces to $2\log\left( \frac{\mathrm{RSS}}{N} \right)+\mathrm{const.}$, where the $\log\left( \frac{\mathrm{RSS}}{N} \right)$ term comes from estimating $\sigma^{2}$ with its MLE $\frac{\mathrm{RSS}}{N}$. Therefore, its AIC is (equivalent to) $\mathrm{AIC}=2d + \log \frac{\mathrm{RSS}}{N}.$
The **Bayesian information criterion (BIC)** of a model $\mathcal{M}$ is $\mathrm{BIC}:= d\log N-2l(\mathbf{y};\hat{\theta})$which appears in the Laplace approximation of $\log\mathbb{P}[\mathbf{X},\mathbf{y}\,|\,\mathcal{M}]=-\frac{1}{2}\mathrm{BIC}+\mathrm{const.}$so a small BIC corresponds to a high likelihood.
- The BIC is asymptotically consistent compared to the AIC, but punishes large models more severely.
For penalized or more complex models, obviously we cannot read off the number of parameters:
- For penalized models like [[Linear Regression Methods#Shrinkage Methods|ridge regression and the lasso]], having $p$ predictors does not mean that they are all active or allowed to have any coefficient.
- For complex models like neural networks, obviously the number of inputs fails to capture the complexity of a network.
Therefore, in those cases we resort to [[Degree of Freedom|effective number of parameters]] (for restricted models).