**Cross validation (CV)** is a method for estimating out-of-sample prediction error $\mathrm{Err}$ of a model $\hat{f}$ trained on a training set $\mathcal{T}$.
- It is not an effective estimator of the training set-specific $\mathrm{Err}_{\mathcal{T}}$.
- Let $\hat{f}(\cdot\,;\, \mathcal{C})$ denote the model fitted on some (sub-)dataset $\mathcal{C}$.
It saves part of $\mathcal{T}$ as a validation set $\mathcal{V}$, and fits the same model on $\mathcal{T-V}$ to get $\hat{f}(\cdot \,;\,{\mathcal{T}-\mathcal{V}})$. The **CV error** is just the prediction loss of $\hat{f}(\cdot \,;\,{\mathcal{T}-\mathcal{V}})$ on $\mathcal{V}$: $\mathrm{CV}_{\mathcal{V}}:=\frac{1}{|\mathcal{V}|}\sum_{(x_{i},y_{i}) \in \mathcal{V}}L(y_{i},\,\hat{f}(x_{i} \,;\,{\mathcal{T}-\mathcal{V}})),$which is then used as an estimate of out-of-sample prediction error of the model on $\mathcal{T}$ itself.
### $K$-fold Cross Validation
$K$-fold cross validation separates $\mathcal{T}$ into $K$ (equally sized) subsets $\mathcal{V}_{1},\dots,\mathcal{V}_{K}$ and fits $\hat{f}$ on $\mathcal{T}-\mathcal{V}_{k}$, giving the $k$th CV error: $\mathrm{CV}_{k}:=\frac{1}{|\mathcal{V_{k}}|}\sum_{(x_{i},y_{i}) \in \mathcal{V_{k}}}L(y_{i},\hat{f}(x_{i};\,\mathcal{T}-\mathcal{V}_{k}))$and the error out-of-sample error estimate is the average $\mathrm{CV}:=\frac{1}{K}\sum_{k=1}^{K}\mathrm{CV}_{k}$
Typical choices are $K=5,10$, as larger values are computationally expensive, and smaller values make the validation set $\mathcal{V_{k}}$ too large and overestimates the error.
- For the same $K$, the upward bias is more significant when the sample size $N=|\mathcal{T}|$ is small, so $K=10$ is preferred for small training sets.
When $K=n$ (the sample size), it is called **leave-one-out CV (LOOCV)** and is relatively unbiased but computationally expensive. It is given by $\mathrm{LOOCV}:= \frac{1}{n}\sum_{i}L(y_{i}, \hat{f}(x_{i} ;\, \mathcal{T}_{-i})),$where $\mathcal{T}_{-i}$ is the dataset with the $i$th element removed.
It has expectation $\mathbb{E}[\mathrm{LOOCV}]=\frac{1}{n}\sum_{i}R(\hat{f}(\cdot\,;\, \mathcal{T}_{-i})),$and assuming $\hat{f}(\cdot\,;\,\mathcal{T}_{-i})\approx \hat{f}(\cdot\,;\,\mathcal{T})$ when the sample size is large, $\mathbb{E}[\mathrm{LOOCV}]\approx R(\hat{f}(\cdot\,; \mathcal{T}))$is a good estimate of the out-of-sample prediction error.
### LOOCV for Linear Smoothers
For many [[linear smoothers]] that produces predictions $\hat{\mathbf{y}}=\mathbf{S}\mathbf{y},$where $\mathbf{S}$ is allowed to depend on $\mathbf{X}$ (and potentially other parameters) but not $\mathbf{y}$, LOOCV can be computed cheaply without the need to fit one model for every observation.
> [!theorem|*] Simple LOOCV Error for Linear Smoothers
> If a linear smoother produces weights such that $l_{j}(x_{i};\mathbf{x}_{-i})=\frac{l_{j}(x_{i};\mathbf{x})}{\sum_{k \ne i}l_{k}(x_{i};\mathbf{x})}=\frac{s_{ij}}{\sum_{k \ne i}s_{ik}}.$
> (that is, removing the $i$th observation rescales the weights put on every other observation by the same factor), then its LOOCV error has the form
> $\mathrm{LOOCV}=\frac{1}{N}\sum_{i=1}^{N}\left[ \frac{y_{i}-\hat{m}(x_{i})}{1-s_{ii}} \right]^{2}$where $\hat{m}$ is the model fit on the entire dataset, and $s_{ii}$ is the $i$th diagonal element of $\mathbf{S}$, the matrix where $\hat{m}(\mathbf{x})=\mathbf{Sy}$.
> > [!proof]-
> > The LOOCV error of the $i$th observation is $\begin{align*}
y_{i} - \hat{m}^{(-i)}(x_{i})&= y_{i}- \left( \sum_{j \ne i}l_{j}(x_{i};\mathbf{x}_{-i})\cdot y_{j} \right)\\
&= y_{i}-\Bigg(\frac{1}{\sum_{j \ne i}s_{ij}} \underbrace{\sum_{j \ne i}s_{ij} \cdot y_{j}}_{\hat{m}(x_{i})-s_{ii}y_{i}}\Bigg)\\
&= \frac{1}{1-s_{ii}}(y_{i}-\hat{m}(x_{i})).
\end{align*}$Now take the average of the squared errors give the desired form.
^61f6c1
- This basically adjusts each sample's residual upwards to account for overfitting/optimism.
- The weights condition holds for [[linear regression methods]] like OLS and ridge regression, for example. The lasso fit is not linear in the response $\mathbf{y}$, so it doesn't hold.
- Smoothers like NW weighted average satisfies this condition, but it is still a good approximation for higher order [[Localized Methods#Regression with Kernels|local regression methods]] that do not satisfy it exactly.
To further simplify the computation,
> [!definition|*] Generalized CV Error for Kernel Regression
> The **GCV** estimate of out-of-sample prediction MSE is $\mathrm{GCV}:=\frac{1}{N}\sum_{i=1}^{N}\left[ \frac{y_{i}-\hat{m}(x_{i})}{1-\mathrm{tr}(\mathbf{S}) / N} \right]^{2}$where $\mathrm{tr}(\mathbf{S}) / N$ replaces $s_{ii}$ in the LOOCV formula above as it is usually cheaper to compute.
^e66897