Ensemble Methods - Random Notes Go Brrrrrrr

> [!definition|*] Ensemble > An ensemble $f_{M}$ of rules $h_{1}, \dots, h_{M}$ (with weights $w=(w_{1},\dots,w_{M})$) is a weighted combination of those rules. Most commonly, if $h_{m}$ are regressors, $f_{M}(x)=\sum_{m=1}^{M}w_{m}h_{m}(x),$and if $h_{m}$ are classifiers taking values $k \in \{ 1,\dots,K \}$, then $f_{M}(x)=\underset{k}{\arg\max}~\sum_{m=1}^{M}w_{m}h_{m}(x).$ ## Motivation: Reduction of Variance For a regression task, let $h^{\ast}$ denote the (unknown) Bayes rule, and $\hat{h}_{d}$ be the estimated rule from a realized dataset $D=d$. The excess risk is $\text{excess risk}(d):= R(\hat{h}_{d})-R(h^{\ast}).$ The decomposition gives $\mathbb{E}_{D}[\text{excess risk}(D)]=\mathbb{E}_{X}\Big[b(X)^2+ v((X)) \Big],$where $b,v$ are the bias and variance of $\hat{h}_{D}$ in $\mathcal{D}$, i.e. $\begin{align*} b(X)&:=\mathbb{E}_{D}[h^{\ast}(X)-\bar{h}(X)],\\ v(X)&:=\mathbb{E}_{D}[(\hat{h}^{(D)}(X)-\bar{h}(X))^{2}]. \end{align*}$ Therefore, if we can obtain many realizations of $\hat{h}_{D}$, then averaging them (with some weight) will reduce variance in general. Therefore ensemble methods make the assumption that > [!assumption|*] Ensembling Weak Learners Reduce Variance > If $\hat{h}_{D}$ is a **weak learner**, i.e. one with high variance (and ideally with low bias and computational costs), then averaging its realizations $\hat{h}_{d_{1}},\dots ,\hat{h}_{d_{M}}$ with suitable weights produces an ensemble with lower variance. ## Issue: Obtaining New Realizations One issue is that we simply cannot obtain new, independent datasets: otherwise we could have simply fit $\hat{h}$ on the concatenated set $d_{1} \cup \cdots d_{M}$ to reduce variance anyways. There are a number of solutions: starting with one realization $d$, Firstly, we can weaken the requirement of $\hat{h}_{d_{1}},\dots,\hat{h}_{d_{M}}$ being actual independent realizations of $\hat{h}_{D}$. - [[Bootstraps]] offer a way of approximating the distribution of $F_{D}$ (from which we want to sample new datasets) with $\hat{F}_{D}=\mathrm{Unif}(d)$ (uniformly redrawing samples with replacement from $d$). - This leads to the [[Bootstrap Ensemble Methods]] like bagging and random forests. Alternatively, we can drop the requirement completely, and instead fit lots of $\hat{h}_{d}$, where each fit is modified (either in the procedure $\hat{h}$ or the data $d$) to make each learner to learn from each other's mistakes (rather than expecting their mistakes to cancel out). - This leads to [[Boosting]] methods.