### Bias-Variance Decomposition (Inference)
From [[Point Estimators#^cf7800]]:
Analogous to the Pythagorean theorem, bias and variance are like orthogonal components of the MSE:
> [!theorem|*] Bias-Variance Decomposition
> Suppose $T$ is an estimator of some value $g(\theta)$. Then
> $\mathbb{E}[(T-g(\theta))^{2}]=:\mathrm{MSE}_{\theta}(T)=\underbrace{\mathrm{bias}(T;\theta)^{2}}_{\substack{\text{consistent over/}\\ \text{underestimations}}}+\underbrace{\mathrm{Var}_{\theta}(T)}_{\text{imprecision}}$so in particular for unbiased estimators, their MSE equals their variance.
> > [!proof]- Proof
> > Note that for the error $e(\mathbf{X}):= T(\mathbf{X})-g(\theta)$, $\mathrm{MSE}_{\theta}(T)=\mathbb{E}[e(\mathbf{X})^{2}]$ and $\mathrm{bias}(T;\theta)^{2}=\mathbb{E}[e(\mathbf{X})]^{2}$, so $
\mathrm{MSE}_{\theta}(T)-\mathrm{bias}(T;\theta)^{2} = \mathrm{Var}_{\theta}(e(\mathbf{X})) = \mathrm{Var}_{\theta}(T),$where the last equality is because $e(\mathbf{X})$ is just $T(\mathbf{X})$ shifted by a constant.
### Bias-Variance Decomposition (Prediction)
Suppose the response has the relation $Y=f(X)+\epsilon$ where $\epsilon$ is an independent error term with $0$ mean and $\sigma^{2}_{\epsilon}$ variance.
The expected squared prediction error at $X=x_{0}$ is $\begin{align*}
\text{Err}(x_{0})&= \mathbb{E}_{Y,\hat{f}}[(Y-\hat{f}(X))^{2}\,|\,X=x_{0}]\\
&= \mathbb{E}[(Y-f(x_{0}))^{2}~|~X=x_{0}]+ \mathbb{E}[(\hat{f}(x_{0})-f(x_{0}))^{2}]\\
&= \sigma^{2}_{\epsilon}+(f(x_{0})-\mathbb{E}\hat{f}(x_{0}))^{2}+\mathbb{E}[(\hat{f}(x_{0})-\mathbb{E} \hat{f}(x_{0}))^{2}]\\
&= \sigma_{\epsilon}^{2}+\text{bias}(\hat{f},f;x_{0})^{2}+\mathrm{Var}(\hat{f}(x_{0}))
\end{align*}$which is the same bias-variance decomposition, but with *an extra noise term $\sigma^{2}_{\epsilon}$ that is an irreducible error inherent to $Y$.*
### Excess $l2$ Risk as Bias-Variance
Framed in the language of [[decision theory]], the MSE is the frequentist $l_{2}$ risk, and *an estimated decision rule $\hat{h}$ as an approximation of the Bayes predictor $h^{\ast}$* with $\begin{align*}
\text{excess risk}&= R(\hat{h})-R(h^{\ast})\\
&= \mathbb{E}_{X}[(\hat{h}(X)-h^{\ast}(X))^{2}]\\
&= \underbrace{\mathbb{E}_{X}[\hat{h}(X)-h^{\ast}(X)]^{2}}_{\text{bias}^{2}}+{\mathrm{Var}_{X}(\hat{h}(X))},
\end{align*}$which is an identity in the input space $\mathcal{X} \ni X$.
However, we can further average over all possible datasets ${D}=(\mathbf{X}, \mathbf{Y})$ and denote $\hat{h}_{D}$ to be the estimated relationship using $D$: by treating $\hat{h}_{D}$ as random, we obtain the *decomposition in dataset space $\mathcal{D} \ni D$*: $\begin{align*}
\mathbb{E}_{{D}}[\text{excess risk}]&= \mathbb{E}_{{D}}\mathbb{E}_{X}[(\hat{h}_{{D}}(X)-h^{\ast}(X))^{2}]\\
&= \mathbb{E}_{X}\mathbb{E}_{{D}}[(\hat{h}_{{D}}(X)-h^{\ast}(X))^{2}]\\
&= \mathbb{E}_{X}[\underbrace{(\bar{h}(X)-h^{\ast}(X))}_{\text{bias}\,=:\, b(X)}{}^{2} + \underbrace{\mathrm{Var}_{D}(\hat{h}_{D}(X))}_{\text{variance}\,=:\, v(X)}]\\
&= \mathbb{E}_{X}[b(X)^{2} + v(X)].
\end{align*}$Here $\bar{h}(x):= \mathbb{E}_{D}[\hat{h}_{D}(x)]$ is the fitted values at $x$ averaged over all possible datasets.
## Classification-Based Decomposition
Consider a variable-pair $(X,Y)$ where $Y \in \{ 1,\dots,K \}$ is the **class**, and we wish to study the variances of $X$.
### Within-and-Between Class Variations
Treating $X$ as a column-vector,
$\begin{align*}
\mathrm{Cov}(X)&= \mathbb{E}[XX^{T}]-\mu_{X}\mu_{X}^{T}\\
&= \mathbb{E}_{Y}[\mathrm{Cov}(X ~|~ Y)] + \mathbb{E}_{Y}[\mu_{X~|~Y}^{T}\mu_{X~|~Y}]-\mu_{X}^{T}\mu_{X}\\
&= \underbrace{\mathbb{E}_{Y}[\mathrm{Cov}(X ~|~ Y)] }_{\text{in-class variance }=:B} + \underbrace{\mathrm{Cov}(\mathbb{E}[X ~|~ Y])}_{\substack{\text{between-class}\\ \text{variance }=:W\vphantom{\frac{1}{2}}}}.
\end{align*}$
The two matrices can be broken-down class-wise: $\begin{align*}
B&= \sum_{k} \pi_{k}\Sigma_{k};\\[0.8em]
W&= \sum_{k} \pi_{k}\cdot(\mu_{k}-\mu)(\mu_{k}-\mu)^{T};
\end{align*}$where $\pi_{k}=\mathbb{P}[Y=k]$, $\mu_{k}=\mathbb{E}[X ~|~ Y=k]$, and $\Sigma_{k}=\mathrm{Cov}(X~|~Y)$.
For one-dimensional $X \in \mathbb{R}$, the ratio $\frac{B}{W}=\frac{\mathrm{Cov}_{Y}(\mathbb{E}_{X}[X ~|~ Y])}{\mathbb{E}_{Y}[\mathrm{Cov}_{X}(X ~|~ Y)]}$is the **separability** of the classes.