Decompositions of L2 Risks - Random Notes Go Brrrrrrr

### Bias-Variance Decomposition (Inference) From [[Point Estimators#^cf7800]]: Analogous to the Pythagorean theorem, bias and variance are like orthogonal components of the MSE: > [!theorem|*] Bias-Variance Decomposition > Suppose $T$ is an estimator of some value $g(\theta)$. Then > $\mathbb{E}[(T-g(\theta))^{2}]=:\mathrm{MSE}_{\theta}(T)=\underbrace{\mathrm{bias}(T;\theta)^{2}}_{\substack{\text{consistent over/}\\ \text{underestimations}}}+\underbrace{\mathrm{Var}_{\theta}(T)}_{\text{imprecision}}$so in particular for unbiased estimators, their MSE equals their variance. > > [!proof]- Proof > > Note that for the error $e(\mathbf{X}):= T(\mathbf{X})-g(\theta)$, $\mathrm{MSE}_{\theta}(T)=\mathbb{E}[e(\mathbf{X})^{2}]$ and $\mathrm{bias}(T;\theta)^{2}=\mathbb{E}[e(\mathbf{X})]^{2}$, so $ \mathrm{MSE}_{\theta}(T)-\mathrm{bias}(T;\theta)^{2} = \mathrm{Var}_{\theta}(e(\mathbf{X})) = \mathrm{Var}_{\theta}(T),$where the last equality is because $e(\mathbf{X})$ is just $T(\mathbf{X})$ shifted by a constant. ### Bias-Variance Decomposition (Prediction) Suppose the response has the relation $Y=f(X)+\epsilon$ where $\epsilon$ is an independent error term with $0$ mean and $\sigma^{2}_{\epsilon}$ variance. The expected squared prediction error at $X=x_{0}$ is $\begin{align*} \text{Err}(x_{0})&= \mathbb{E}_{Y,\hat{f}}[(Y-\hat{f}(X))^{2}\,|\,X=x_{0}]\\ &= \mathbb{E}[(Y-f(x_{0}))^{2}~|~X=x_{0}]+ \mathbb{E}[(\hat{f}(x_{0})-f(x_{0}))^{2}]\\ &= \sigma^{2}_{\epsilon}+(f(x_{0})-\mathbb{E}\hat{f}(x_{0}))^{2}+\mathbb{E}[(\hat{f}(x_{0})-\mathbb{E} \hat{f}(x_{0}))^{2}]\\ &= \sigma_{\epsilon}^{2}+\text{bias}(\hat{f},f;x_{0})^{2}+\mathrm{Var}(\hat{f}(x_{0})) \end{align*}$which is the same bias-variance decomposition, but with *an extra noise term $\sigma^{2}_{\epsilon}$ that is an irreducible error inherent to $Y$.* ### Excess $l2$ Risk as Bias-Variance Framed in the language of [[decision theory]], the MSE is the frequentist $l_{2}$ risk, and *an estimated decision rule $\hat{h}$ as an approximation of the Bayes predictor $h^{\ast}$* with $\begin{align*} \text{excess risk}&= R(\hat{h})-R(h^{\ast})\\ &= \mathbb{E}_{X}[(\hat{h}(X)-h^{\ast}(X))^{2}]\\ &= \underbrace{\mathbb{E}_{X}[\hat{h}(X)-h^{\ast}(X)]^{2}}_{\text{bias}^{2}}+{\mathrm{Var}_{X}(\hat{h}(X))}, \end{align*}$which is an identity in the input space $\mathcal{X} \ni X$. However, we can further average over all possible datasets ${D}=(\mathbf{X}, \mathbf{Y})$ and denote $\hat{h}_{D}$ to be the estimated relationship using $D$: by treating $\hat{h}_{D}$ as random, we obtain the *decomposition in dataset space $\mathcal{D} \ni D$*: $\begin{align*} \mathbb{E}_{{D}}[\text{excess risk}]&= \mathbb{E}_{{D}}\mathbb{E}_{X}[(\hat{h}_{{D}}(X)-h^{\ast}(X))^{2}]\\ &= \mathbb{E}_{X}\mathbb{E}_{{D}}[(\hat{h}_{{D}}(X)-h^{\ast}(X))^{2}]\\ &= \mathbb{E}_{X}[\underbrace{(\bar{h}(X)-h^{\ast}(X))}_{\text{bias}\,=:\, b(X)}{}^{2} + \underbrace{\mathrm{Var}_{D}(\hat{h}_{D}(X))}_{\text{variance}\,=:\, v(X)}]\\ &= \mathbb{E}_{X}[b(X)^{2} + v(X)]. \end{align*}$Here $\bar{h}(x):= \mathbb{E}_{D}[\hat{h}_{D}(x)]$ is the fitted values at $x$ averaged over all possible datasets. ## Classification-Based Decomposition Consider a variable-pair $(X,Y)$ where $Y \in \{ 1,\dots,K \}$ is the **class**, and we wish to study the variances of $X$. ### Within-and-Between Class Variations Treating $X$ as a column-vector, $\begin{align*} \mathrm{Cov}(X)&= \mathbb{E}[XX^{T}]-\mu_{X}\mu_{X}^{T}\\ &= \mathbb{E}_{Y}[\mathrm{Cov}(X ~|~ Y)] + \mathbb{E}_{Y}[\mu_{X~|~Y}^{T}\mu_{X~|~Y}]-\mu_{X}^{T}\mu_{X}\\ &= \underbrace{\mathbb{E}_{Y}[\mathrm{Cov}(X ~|~ Y)] }_{\text{in-class variance }=:B} + \underbrace{\mathrm{Cov}(\mathbb{E}[X ~|~ Y])}_{\substack{\text{between-class}\\ \text{variance }=:W\vphantom{\frac{1}{2}}}}. \end{align*}$ The two matrices can be broken-down class-wise: $\begin{align*} B&= \sum_{k} \pi_{k}\Sigma_{k};\\[0.8em] W&= \sum_{k} \pi_{k}\cdot(\mu_{k}-\mu)(\mu_{k}-\mu)^{T}; \end{align*}$where $\pi_{k}=\mathbb{P}[Y=k]$, $\mu_{k}=\mathbb{E}[X ~|~ Y=k]$, and $\Sigma_{k}=\mathrm{Cov}(X~|~Y)$. For one-dimensional $X \in \mathbb{R}$, the ratio $\frac{B}{W}=\frac{\mathrm{Cov}_{Y}(\mathbb{E}_{X}[X ~|~ Y])}{\mathbb{E}_{Y}[\mathrm{Cov}_{X}(X ~|~ Y)]}$is the **separability** of the classes.