> [!definition|*] Ensemble
> An ensemble $f_{M}$ of rules $h_{1}, \dots, h_{M}$ (with weights $w=(w_{1},\dots,w_{M})$) is a weighted combination of those rules.
Most commonly, if $h_{m}$ are regressors, $f_{M}(x)=\sum_{m=1}^{M}w_{m}h_{m}(x),$and if $h_{m}$ are classifiers taking values $k \in \{ 1,\dots,K \}$, then $f_{M}(x)=\underset{k}{\arg\max}~\sum_{m=1}^{M}w_{m}h_{m}(x).$
## Motivation: Reduction of Variance
For a regression task, let $h^{\ast}$ denote the (unknown) Bayes rule, and $\hat{h}_{d}$ be the estimated rule from a realized dataset $D=d$. The excess risk is $\text{excess risk}(d):= R(\hat{h}_{d})-R(h^{\ast}).$
The decomposition gives $\mathbb{E}_{D}[\text{excess risk}(D)]=\mathbb{E}_{X}\Big[b(X)^2+ v((X)) \Big],$where $b,v$ are the bias and variance of $\hat{h}_{D}$ in $\mathcal{D}$, i.e. $\begin{align*}
b(X)&:=\mathbb{E}_{D}[h^{\ast}(X)-\bar{h}(X)],\\
v(X)&:=\mathbb{E}_{D}[(\hat{h}^{(D)}(X)-\bar{h}(X))^{2}].
\end{align*}$
Therefore, if we can obtain many realizations of $\hat{h}_{D}$, then averaging them (with some weight) will reduce variance in general. Therefore ensemble methods make the assumption that
> [!assumption|*] Ensembling Weak Learners Reduce Variance
> If $\hat{h}_{D}$ is a **weak learner**, i.e. one with high variance (and ideally with low bias and computational costs), then averaging its realizations $\hat{h}_{d_{1}},\dots ,\hat{h}_{d_{M}}$ with suitable weights produces an ensemble with lower variance.
## Issue: Obtaining New Realizations
One issue is that we simply cannot obtain new, independent datasets: otherwise we could have simply fit $\hat{h}$ on the concatenated set $d_{1} \cup \cdots d_{M}$ to reduce variance anyways.
There are a number of solutions: starting with one realization $d$,
Firstly, we can weaken the requirement of $\hat{h}_{d_{1}},\dots,\hat{h}_{d_{M}}$ being actual independent realizations of $\hat{h}_{D}$.
- [[Bootstraps]] offer a way of approximating the distribution of $F_{D}$ (from which we want to sample new datasets) with $\hat{F}_{D}=\mathrm{Unif}(d)$ (uniformly redrawing samples with replacement from $d$).
- This leads to the [[Bootstrap Ensemble Methods]] like bagging and random forests.
Alternatively, we can drop the requirement completely, and instead fit lots of $\hat{h}_{d}$, where each fit is modified (either in the procedure $\hat{h}$ or the data $d$) to make each learner to learn from each other's mistakes (rather than expecting their mistakes to cancel out).
- This leads to [[Boosting]] methods.