Suppose we have a [[Bootstraps|bootstrap]] procedure that generates bootstrapped datasets $d_{b},~b=1,\dots,B$, and we have predictors $\hat{f}_{b}:=\hat{f}^{(d_{b})}$ (usually **weak learners**) trained on each dataset. [[Ensemble Methods]] combine predictors trained on each dataset to harness their collective wisdom. > [!definition|*] Bagging > Given the bootstrapped models $\hat{f}_{b}$, combining them yields the **bagged (Bootstrap AGGregated) model**: $\hat{f}^\mathrm{bag}(x):=\begin{cases} \frac{1}{B}\sum_{b}\hat{f}_{b}(x) &\text{for regressions}, \\ \underset{k}{\arg\max}~| b :\hat{f}_{b}(x)=k | &\text{for classifications}. \end{cases}$For classification, $k$ are the possible categories, and the operation used is usually called **majority voting**. For bagged models, we can easily estimate its [[Model Error Metrics#Generalizability of Models|generalizability]] using **out-of-bag samples**: > [!definition|*] Out-of-Bag Samples and Error > For the $b$th bagged predictor $\hat{f}^{(b)}$ trained on the bootstrapped dataset $d^{(b)}$, its **out-of-bag (OOB) samples** are $O^{(b)}:=\{ i ~|~ (\mathbf{x}_{i},y_{i}) \not \in d^{(b)}\}.$ > > The bagged model's **out-of-bag prediction** (at $(\mathbf{x}_{i}, y_{i})$) is the averaged prediction from its bagged predictors who haven't seen $(\mathbf{x}_{i},y_{i})$: $\hat{y}_{i}^{\mathrm{OOB}}:= \mathrm{avg}\{ \hat{f}^{(b)}(\mathbf{x}_{i}) ~|~ b: i \in O^{(b)} \}.$ > The model's overall **OOB error estimate** is $\mathrm{\mathrm{err}_{\mathrm{OOB}}}:= \mathrm{avg}\{ L(y_{i}, \hat{y}_{i}^{\mathrm{OOB}}) ~|~ i=1,\dots,n \}.$ > - Since the model $\hat{f}^{(b)}$ has not seen any data in $O^{(b)}$, $\hat{y}_{i}^{\mathrm{OOB}}$ provides an estimate of the model's performance on unseen data. - However, since for any $i$, only $\sim 1 / e$ of the bagged predictors have $i$ as OOB, *the OOB error uses ensembles far smaller than the original $\hat{f}^\mathrm{bag}$*. ### Random Forests > [!definition|*] Random Forest > If the weak learners are [[Decision Trees|decision trees]], we get a **random forest** model. For each split, however, the trees are only allowed to use $p_{\mathrm{\max}}$ randomly selected features (instead of all $p$ of them). - Usually $p_{\max}=\sqrt{ p }$, but the performance of RF usually doesn’t degrade even with small $p_{\text{max}}$. - Doing so *reduces the similarity between trees by preventing the strongest predictors from always being selected first*; this de-correlates the trees and increases the benefit from averaging. ### Averaging Predictors Reduces Variance Bootstrapping can be seen as sampling from an approximation $\hat{\mathcal{D}}$ of the *space of datasets* $\mathcal{D}$, and we can perform [[Decompositions of L2 Risks#Excess $l2$ Risk as Bias-Variance|bias-variance decomposition of the excess risk in $\mathcal{D}$]]. - Let $h^{\ast}$ denote the (unknown) Bayes rule, and $\hat{h}^{(d)}$ be the estimated rule from a realized dataset $D=d$. Recall that the excess risk is $\text{excess risk}(d):= R(\hat{h}^{(d)})-R(h^{\ast}).$ The decomposition gives $\mathbb{E}_{D}[\text{excess risk}(D)]=\mathbb{E}_{X}\Big[b(X)^2+ v((X)) \Big],$where $b,v$ are the bias and variance of $\hat{h}_{D}$ in $\mathcal{D}$, i.e. $\begin{align*} b(X)&:=\mathbb{E}_{D}[h^{\ast}(X)-\bar{h}(X)],\\ v(X)&:=\mathbb{E}_{D}[(\hat{h}^{(D)}(X)-\bar{h}(X))^{2}]. \end{align*}$ If the bagging happens not in $\hat{\mathcal{D}}$ but in $\mathcal{D}$, then *averaging $\hat{h}(X):= \frac{1}{B}\sum_{b}\hat{h}^{(d_{b})}(X)$ reduces $v$ while keeps $b$ unchanged*. By replacing the unknown $\mathcal{D}$ with a attainable $\hat{\mathcal{D}}$, we cope that the same holds. - *Using $\hat{\mathcal{D}}$ causes a small increase in bias when averaging, so the technique works best with low-bias, high-variance methods like trees* (but not low-dimensional OLS, say).