Suppose we have a [[Bootstraps|bootstrap]] procedure that generates bootstrapped datasets $d_{b},~b=1,\dots,B$, and we have predictors $\hat{f}_{b}:=\hat{f}^{(d_{b})}$ (usually **weak learners**) trained on each dataset. [[Ensemble Methods]] combine predictors trained on each dataset to harness their collective wisdom.
> [!definition|*] Bagging
> Given the bootstrapped models $\hat{f}_{b}$, combining them yields the **bagged (Bootstrap AGGregated) model**: $\hat{f}^\mathrm{bag}(x):=\begin{cases}
\frac{1}{B}\sum_{b}\hat{f}_{b}(x) &\text{for regressions}, \\
\underset{k}{\arg\max}~| b :\hat{f}_{b}(x)=k | &\text{for classifications}.
\end{cases}$For classification, $k$ are the possible categories, and the operation used is usually called **majority voting**.
For bagged models, we can easily estimate its [[Model Error Metrics#Generalizability of Models|generalizability]] using **out-of-bag samples**:
> [!definition|*] Out-of-Bag Samples and Error
> For the $b$th bagged predictor $\hat{f}^{(b)}$ trained on the bootstrapped dataset $d^{(b)}$, its **out-of-bag (OOB) samples** are $O^{(b)}:=\{ i ~|~ (\mathbf{x}_{i},y_{i}) \not \in d^{(b)}\}.$
>
> The bagged model's **out-of-bag prediction** (at $(\mathbf{x}_{i}, y_{i})$) is the averaged prediction from its bagged predictors who haven't seen $(\mathbf{x}_{i},y_{i})$: $\hat{y}_{i}^{\mathrm{OOB}}:= \mathrm{avg}\{ \hat{f}^{(b)}(\mathbf{x}_{i}) ~|~ b: i \in O^{(b)} \}.$
> The model's overall **OOB error estimate** is $\mathrm{\mathrm{err}_{\mathrm{OOB}}}:= \mathrm{avg}\{ L(y_{i}, \hat{y}_{i}^{\mathrm{OOB}}) ~|~ i=1,\dots,n \}.$
>
- Since the model $\hat{f}^{(b)}$ has not seen any data in $O^{(b)}$, $\hat{y}_{i}^{\mathrm{OOB}}$ provides an estimate of the model's performance on unseen data.
- However, since for any $i$, only $\sim 1 / e$ of the bagged predictors have $i$ as OOB, *the OOB error uses ensembles far smaller than the original $\hat{f}^\mathrm{bag}$*.
### Random Forests
> [!definition|*] Random Forest
> If the weak learners are [[Decision Trees|decision trees]], we get a **random forest** model. For each split, however, the trees are only allowed to use $p_{\mathrm{\max}}$ randomly selected features (instead of all $p$ of them).
- Usually $p_{\max}=\sqrt{ p }$, but the performance of RF usually doesn’t degrade even with small $p_{\text{max}}$.
- Doing so *reduces the similarity between trees by preventing the strongest predictors from always being selected first*; this de-correlates the trees and increases the benefit from averaging.
### Averaging Predictors Reduces Variance
Bootstrapping can be seen as sampling from an approximation $\hat{\mathcal{D}}$ of the *space of datasets* $\mathcal{D}$, and we can perform [[Decompositions of L2 Risks#Excess $l2$ Risk as Bias-Variance|bias-variance decomposition of the excess risk in $\mathcal{D}$]].
- Let $h^{\ast}$ denote the (unknown) Bayes rule, and $\hat{h}^{(d)}$ be the estimated rule from a realized dataset $D=d$. Recall that the excess risk is $\text{excess risk}(d):= R(\hat{h}^{(d)})-R(h^{\ast}).$
The decomposition gives $\mathbb{E}_{D}[\text{excess risk}(D)]=\mathbb{E}_{X}\Big[b(X)^2+ v((X)) \Big],$where $b,v$ are the bias and variance of $\hat{h}_{D}$ in $\mathcal{D}$, i.e. $\begin{align*}
b(X)&:=\mathbb{E}_{D}[h^{\ast}(X)-\bar{h}(X)],\\
v(X)&:=\mathbb{E}_{D}[(\hat{h}^{(D)}(X)-\bar{h}(X))^{2}].
\end{align*}$
If the bagging happens not in $\hat{\mathcal{D}}$ but in $\mathcal{D}$, then *averaging $\hat{h}(X):= \frac{1}{B}\sum_{b}\hat{h}^{(d_{b})}(X)$ reduces $v$ while keeps $b$ unchanged*.
By replacing the unknown $\mathcal{D}$ with a attainable $\hat{\mathcal{D}}$, we cope that the same holds.
- *Using $\hat{\mathcal{D}}$ causes a small increase in bias when averaging, so the technique works best with low-bias, high-variance methods like trees* (but not low-dimensional OLS, say).