Cross Validation - Random Notes Go Brrrrrrr

**Cross validation (CV)** is a method for estimating out-of-sample prediction error $\mathrm{Err}$ of a model $\hat{f}$ trained on a training set $\mathcal{T}$. - It is not an effective estimator of the training set-specific $\mathrm{Err}_{\mathcal{T}}$. - Let $\hat{f}(\cdot\,;\, \mathcal{C})$ denote the model fitted on some (sub-)dataset $\mathcal{C}$. It saves part of $\mathcal{T}$ as a validation set $\mathcal{V}$, and fits the same model on $\mathcal{T-V}$ to get $\hat{f}(\cdot \,;\,{\mathcal{T}-\mathcal{V}})$. The **CV error** is just the prediction loss of $\hat{f}(\cdot \,;\,{\mathcal{T}-\mathcal{V}})$ on $\mathcal{V}$: $\mathrm{CV}_{\mathcal{V}}:=\frac{1}{|\mathcal{V}|}\sum_{(x_{i},y_{i}) \in \mathcal{V}}L(y_{i},\,\hat{f}(x_{i} \,;\,{\mathcal{T}-\mathcal{V}})),$which is then used as an estimate of out-of-sample prediction error of the model on $\mathcal{T}$ itself. ### $K$-fold Cross Validation $K$-fold cross validation separates $\mathcal{T}$ into $K$ (equally sized) subsets $\mathcal{V}_{1},\dots,\mathcal{V}_{K}$ and fits $\hat{f}$ on $\mathcal{T}-\mathcal{V}_{k}$, giving the $k$th CV error: $\mathrm{CV}_{k}:=\frac{1}{|\mathcal{V_{k}}|}\sum_{(x_{i},y_{i}) \in \mathcal{V_{k}}}L(y_{i},\hat{f}(x_{i};\,\mathcal{T}-\mathcal{V}_{k}))$and the error out-of-sample error estimate is the average $\mathrm{CV}:=\frac{1}{K}\sum_{k=1}^{K}\mathrm{CV}_{k}$ Typical choices are $K=5,10$, as larger values are computationally expensive, and smaller values make the validation set $\mathcal{V_{k}}$ too large and overestimates the error. - For the same $K$, the upward bias is more significant when the sample size $N=|\mathcal{T}|$ is small, so $K=10$ is preferred for small training sets. When $K=n$ (the sample size), it is called **leave-one-out CV (LOOCV)** and is relatively unbiased but computationally expensive. It is given by $\mathrm{LOOCV}:= \frac{1}{n}\sum_{i}L(y_{i}, \hat{f}(x_{i} ;\, \mathcal{T}_{-i})),$where $\mathcal{T}_{-i}$ is the dataset with the $i$th element removed. It has expectation $\mathbb{E}[\mathrm{LOOCV}]=\frac{1}{n}\sum_{i}R(\hat{f}(\cdot\,;\, \mathcal{T}_{-i})),$and assuming $\hat{f}(\cdot\,;\,\mathcal{T}_{-i})\approx \hat{f}(\cdot\,;\,\mathcal{T})$ when the sample size is large, $\mathbb{E}[\mathrm{LOOCV}]\approx R(\hat{f}(\cdot\,; \mathcal{T}))$is a good estimate of the out-of-sample prediction error. ### LOOCV for Linear Smoothers For many [[linear smoothers]] that produces predictions $\hat{\mathbf{y}}=\mathbf{S}\mathbf{y},$where $\mathbf{S}$ is allowed to depend on $\mathbf{X}$ (and potentially other parameters) but not $\mathbf{y}$, LOOCV can be computed cheaply without the need to fit one model for every observation. > [!theorem|*] Simple LOOCV Error for Linear Smoothers > If a linear smoother produces weights such that $l_{j}(x_{i};\mathbf{x}_{-i})=\frac{l_{j}(x_{i};\mathbf{x})}{\sum_{k \ne i}l_{k}(x_{i};\mathbf{x})}=\frac{s_{ij}}{\sum_{k \ne i}s_{ik}}.$ > (that is, removing the $i$th observation rescales the weights put on every other observation by the same factor), then its LOOCV error has the form > $\mathrm{LOOCV}=\frac{1}{N}\sum_{i=1}^{N}\left[ \frac{y_{i}-\hat{m}(x_{i})}{1-s_{ii}} \right]^{2}$where $\hat{m}$ is the model fit on the entire dataset, and $s_{ii}$ is the $i$th diagonal element of $\mathbf{S}$, the matrix where $\hat{m}(\mathbf{x})=\mathbf{Sy}$. > > [!proof]- > > The LOOCV error of the $i$th observation is $\begin{align*} y_{i} - \hat{m}^{(-i)}(x_{i})&= y_{i}- \left( \sum_{j \ne i}l_{j}(x_{i};\mathbf{x}_{-i})\cdot y_{j} \right)\\ &= y_{i}-\Bigg(\frac{1}{\sum_{j \ne i}s_{ij}} \underbrace{\sum_{j \ne i}s_{ij} \cdot y_{j}}_{\hat{m}(x_{i})-s_{ii}y_{i}}\Bigg)\\ &= \frac{1}{1-s_{ii}}(y_{i}-\hat{m}(x_{i})). \end{align*}$Now take the average of the squared errors give the desired form. ^61f6c1 - This basically adjusts each sample's residual upwards to account for overfitting/optimism. - The weights condition holds for [[linear regression methods]] like OLS and ridge regression, for example. The lasso fit is not linear in the response $\mathbf{y}$, so it doesn't hold. - Smoothers like NW weighted average satisfies this condition, but it is still a good approximation for higher order [[Localized Methods#Regression with Kernels|local regression methods]] that do not satisfy it exactly. To further simplify the computation, > [!definition|*] Generalized CV Error for Kernel Regression > The **GCV** estimate of out-of-sample prediction MSE is $\mathrm{GCV}:=\frac{1}{N}\sum_{i=1}^{N}\left[ \frac{y_{i}-\hat{m}(x_{i})}{1-\mathrm{tr}(\mathbf{S}) / N} \right]^{2}$where $\mathrm{tr}(\mathbf{S}) / N$ replaces $s_{ii}$ in the LOOCV formula above as it is usually cheaper to compute. ^e66897