Boosting - Random Notes Go Brrrrrrr

> [!info] This material corresponds to chapter 10 in the ESL. *Boosting is an [[Ensemble Methods|ensemble method]] that combines many "weak leaners" into a committee.* Instead of having each learner work on the same data, boosting trains the learners on iteratively modified and reweighed data. - For example, forward stagewise boosting (with $l_{2}$ loss) trains them on residuals of previous learners. - Gradient boosting trains them on the gradients of the loss. Boosting is essentially an additive model on a set of generic **basis functions** $\{ b(\cdot\,;\gamma)\,|\, \gamma \}$ indexed by parameters $\gamma$: $G(\{ \beta_{m}, \gamma_{m} \}_{m=1}^{M}):= \sum_{i=m}^{M}\beta_{m}b(x;\,\gamma_{m})$where $\beta=(\beta_{1},\dots, \beta_{M})$ are the weights of each learner. - For general basis expansions, $\gamma_{m}$ can be the degree of polynomial terms $x^{\gamma_{m}}$. - If the learners are trees, $\gamma_{m}$ can encode their nodes and the trees' prediction for each node. ## AdaBoost AdaBoost is one of the earliest boosting algorithms, which produces a sequence of weak learners $(g_{t})$, where *successive learners focus on data points that confused their predecessors*. > [!algorithm] AdaBoost > $[0]$ Initialize with uniform weights $w^{(1)}$ given by $w_{1}^{(1)}=\cdots=w_{N}^{(1)}=\frac{1}{N}$. > > For $[t=1,\dots,T]$: > - Fit the $t$-th learner $g_{t}$ with the data weighted by $w^{(t)}$ by minimizing the weighted error rate $\mathrm{err}_{t}:=\sum_{i=1}^{N}w_{i}^{(t)}{\mathbb{1}(y_{i} \ne g_{t}(x_{i}))}$ > - The weight learner $g_{t}$ will have in the committee is$\alpha_{t}:=\log\left( \frac{1-\mathrm{err}_{t}}{\mathrm{err}_{t}} \right)$ > - Reweigh the data with $w^{(t+1)}_{i} \propto w^{(t)}_{i} \cdot \exp[\alpha_{t}\cdot\mathbb{1}(y_{i} \ne g_{t}(x_{i}))]$i.e. those that $g_{t}$ didn't get right gets assigned a larger weight, and the larger weight $g_{t}$ has in the committee, the stronger this effect. The weights are normalized by some constant $Z_{t}$ to sum to $1$. > > $[\text{END}]$ Return the ensemble $\hat{f}:= \mathrm{sign}\left( \sum_{t=1}^{T}\alpha_{t}g_{t} \right),$where the classification is done via weighted majority voting. Alternatively, for binary data $y \in \{ \pm 1 \}$, the tree weights $\beta_{t}=\alpha_{t} / 2$ and data weights $v^{(t+1)} \propto v^{(t)} \exp[-\beta_{t}\cdot y_{i}\cdot g_{t}(x_{i})]$ give the same result. In this formulation, the recursive definition of data weights reduces to $\begin{align*} v^{(s+1)}&\propto \exp\left[ -y_{i}\sum_{t=1}^{s}\beta_{t}g_{t}(x_{i}) \right]\\ &= \exp[-y_{i}f_{s}(x_{i})]. \end{align*}$The equivalence is solely based on the fact that $\mathbf{1}(y_{i}\ne g(x_{i})) =\frac{1-y_{i}\cdot g(x_{i})}{2}.$ > [!exposition] Closed form of weights > If $Z_{t}$ is the normalizing constant of $v_{t}$, i.e. $Z_{t}= \sum_{i}w^{(t-1)}_{i}\exp(-f_{t}(\mathbf{x}_{i})y_{i}),$a recursive relationship on $v^{(t)}$ shows that $v^{(t)}_{i}=\frac{v^{(1)}_{i}}{\prod_{s=1}^{t}Z_{s}} \exp(-f_{t}(\mathbf{x}_{i}), y_{i})=\frac{1}{N\prod_{s=1}^{t}Z_{s}} \exp(-f_{t}(\mathbf{x}_{i}), y_{i}),$as the initialized weights are just $1 / N$. ## Forward Stagewise Additive Modeling The globally optimal boosting model is determined by $\min_{(\beta_{t}, \gamma_{t})_{t=1}^{T}}\sum_{i=1}^{N}L(y_{i}, G(x_{i}) )$where $L$ is some loss function (e.g. squared error), and $G=G((\beta_{t}, \gamma_{t})_{t=1}^{T})$ is the boosting model. However, this *global optimum is computationally difficult to find for most losses and bases*. However, the one-basis problem is usually manageable: $\min_{\beta, \gamma}\sum_{i=1}^{N}L(y_{i},\beta \cdot b(x_{i}\,;\gamma) ).$This allows greedy algorithms like **Forward Stagewise Additive Modeling** to take the greedy approach, at each step adding $\beta_{m}\cdot b(\cdot\,;\gamma_{m})$ that reduces the loss the most. > [!algorithm] Forward Stagewise Additive Modeling > $[0]$ Initialize with $f_{0}\equiv 0$. > > For $[t=1,\dots,T]$: > - Compute the best new term: $\beta_{t},\gamma_{t}:=\min_{\beta, \gamma}\sum_{i=1}^{N}L\Big(y_{i},\,\,f_{t-1}(x_{i})+\beta \cdot b(x_{i}\,;\gamma) \Big).$ > - Add it to the model: $f_{t}\leftarrow f_{t-1}+\beta_{t}\cdot b(\cdot\,; \gamma_{t}).$ Although forward stagewise additive modeling is defined with generic loss $L$, *it is simplified for a number of special loss functions*: - For $l_{2}$ loss, each step is essentially fitting new term to model the residual $r_{t-1,i}:=y_{i}-f_{t-1}(x_{i})$, since $\begin{align*} L\Big(y_{i},\,\,f_{t-1}(x_{i})&+ \text{new term} \Big) \\ &= \| y _{i} - f_{t-1}(x_{i}) - \text{new term}\, \|_{2} \\[0.4em] &= L\Big(y_{i}-f_{t-1}(x_{i}),\,\,\text{new term} \Big)\\ &= L(r_{t-1,i}, ~\text{new term}). \end{align*}$So minimizing the two are equivalent. - More [[Loss Functions and Robustness|robust loss functions]], however, do not lead to this kind of simplification. ### Adaboost as a Forward Stagewise Algorithm *AdaBoost is equivalent to fitting the model with the* **exponential loss**$L(y, f(x))=\exp(-y \cdot f(x)),$assuming the response $y$ takes values $\pm{1}$. > [!theorem|*] AdaBoost is Forward Stagewise with Exponential Loss > With $L$ being the exponential loss, the *unweighted* stepwise minimization problem $\beta_{t},g_{t}=\underset{\beta,g}{\arg\min}\,\sum_{i} L(y_{i}, f_{t-1}+\beta \cdot g(x_{i}))$gives the same minimizer as the one prescribed by the AdaBoost algorithm: $\begin{align*} g_{t}&= \underset{g}{\arg\min}\,\,\sum_{i}v^{(t)}_{i} \mathbf{1}(y_{i} \ne g(x_{i})), \\ \beta_{t}&= \frac{\alpha_{t}}{2}=\frac{1}{2}\log\left( \frac{1-\mathrm{err}_{t}}{\mathrm{err}_{t}} \right). \end{align*}$ > > > [!proof] > > To begin, the total loss at the $t$th step is $\begin{align*} > > \sum_{i}L(&y_{i},\, f_{t-1}(x_{i})+ \text{new term})\\ > > &= \sum_{i}\exp[-y_{i}(f_{t-1}(x_{i})+\text{new term})]\\ > > &= \sum_{i}\underbrace{e^{-y_{i}f_{t-1}(x_{i})}}_{= \,v^{(t)}_{i}}\cdot L(y_{i}, \text{ new term}), > > \end{align*}$which is fitting the same response $y_{i}$ reweighed by past losses $v^{(t)}_{i}=\exp[-y_{i}f_{t}(x_{i})]=L(y_{i}, f_{t}(x_{i}))$, exactly the same as in AdaBoost. > > > > Now we can optimize for $g,\beta$ separately because $\min_{g,\beta}J(g,\beta)=\min_{g}[\min_{\beta}J(g,\beta)]$for any function $J$ with a minimum. > > > > > [!proof] Proof of equivalence in $g$ > > Note that $ L(y_{i},\,\beta \cdot g(x_{i})) = \begin{cases*} > > e^{\beta}& \text{if $y_{i}\ne g$,}\\ > > e^{-\beta} & \text{if $y_{i}= g$.} \end{cases*}$ > > Therefore, writing the weighted exponential loss as a vector products, $ > > \begin{align*} > > v^{(t)}\cdot L(\mathbf{y}, \beta \cdot g(\mathbf{x})) &= \underbrace{e^{-\beta} }_{\mathrm{const.}} + (e^{\beta}- e^{-\beta})\underbrace{v^{(t)}\cdot \mathbf{1}(\mathbf{y}\ne g(\mathbf{x}))}_{\mathrm{err}_{t}}. > > \end{align*}$so minimizing the two give the same $g$. > > > > > [!proof] Proof of equivalence in $\alpha,\beta$ > > > Minimizing the above expression in $\beta$ by setting the derivative to $0$: $-e^{-\beta}+(e^{\beta}+e^{-\beta})\mathrm{err_{t}} =0,$so solving gives $e^{2\beta_{t}}=\frac{1-\mathrm{err}_{t}}{\mathrm{err}_{t}}.$This is the prescribed $\alpha,\beta$ in AdaBoost. In particular, *the tree weights are the log-odds of $\mathrm{err}_{t}$, the weighted prediction success rate*. > [!exposition] Bounding the training error of AdaBoost > The training classification error of the AdaBoost model $\hat{f}$ (or any algorithm is) $\overline{\mathrm{err}}:=\frac{1}{N}\sum_{i}\mathbf{1}(\hat{f}(\mathbf{x}_{i})\ne y_{i})=\frac{1}{2N}\sum_{i}1-\hat{f}(\mathbf{x}_{i})y_{i}$assuming $y_{i} \in \{ \pm 1 \}$ and $\hat{f}$ only outputs $\pm 1$ predictions. But for exponential loss it produces the inequality $\overline{\mathrm{err}} \le \frac{1}{2N} \sum_{i}\exp(-\hat{f}(\mathbf{x}_{i})y_{i})=\frac{1}{2}\hat{R}(\hat{f})\dots$where $\hat{R}$ is the [[Decision Theory#^2aa114|empirical risk]]. Now substitute each term with the formula of observation-weights $v^{(T)}$: $\begin{align*}\dots&= \frac{1}{2N}\sum_{i}{ \left( N\prod_{t=1}^{T}Z_{t} \right)~ v_{i}^{(T)}}\\ &= \frac{1}{2 }\prod_{t=1}^{T}Z_{t}.\end{align*}$ > [!connection] > On a global scale, the expected loss ([[Decision Theory#^913a85|frequentist risk]]) $R(f)= \mathbb{E}_{X,Y}[L(Y, f(X))]= \mathbb{E}_{X,Y}[e^{Yf(X)}]$ is minimized by (half) the **log-odds** $f(x)= \frac{1}{2}\log\left( \frac{\mathbb{P}[Y=1 \,|\, X=x]}{\mathbb{P}[Y = -1 \,|\, X=x]} \right).$Therefore, Adaboost is essentially [[Generalized Linear Models#Choice of Link Functions|logistic regression]], but modeling the log-odds with boosting instead of a linear model. > [!proof]- Derivation of the optimal classifier > We can optimize the expectation in $R(f)$ point-wise in $X$, i.e. $f^{\ast}(x):= \underset{\hat{y}}{\arg\min}~ \mathbb{E}_{Y ~|~ X}[L(Y, \hat{y}) ~|~ X=x].$Since $Y$ is binary $\pm 1$, the loss reduces to $\mathbf{1}_{Y=1}e^{\hat{y}}+\mathbf{1}_{Y=-1}e^{-\hat{y}}$. Taking expectation gives $\mathbb{E}_{Y~|~ X}[L(Y, \hat{y}) ~|~ X=x]=p_{1}(x)e^{\hat{y}}+p_{-1}(x)e^{-\hat{y}},$where $p_{\pm 1}(x):= \mathbb{P}[Y=\pm 1 ~|~ X=x]$. Now optimize by taking derivative wrt. $\hat{y}$, giving $p_{1}e^{\hat{y}}-p_{-1}e^{-\hat{y} }=0~ ~\Longrightarrow~ ~\hat{y}=\frac{1}{2}\log\left( \frac{p_{1}}{p_{-1}} \right).$ ## Gradient Boosting Using notation in the [[Decision Trees#Fitting the Tree|decision tree page]], a tree $T$ is defined by its parameters $\Theta :=\{ {R}_{j}, {\gamma}_{j} \}_{j=1}^{J}$, where $\{ R_{j} \}_{j=1}^{J}$ is a partition of the sample space $\mathcal{X}$, and $\{ \gamma_{j} \}_{j=1}^{J}$ are the predicted values in each region. Each step of forward stagewise additive modeling finds the new tree $T_{m}=T(\cdot\,;\Theta_{m})$ by solving the optimization problem $\hat{\Theta}_{m}=\underset{\Theta_{m}}{\arg\min}\sum_{i=1}^{N}L\Big(y_{i},f_{m-1}(x_{i})+T(x_{i};\Theta_{m})\Big)$ However, the problem can be seen as the numerical approximation problem in $\mathbb{R}^{N}$, $\hat{\mathbf{f}}=\underset{\mathbf{f}}{\arg\min}\,L(y, \mathbf{f}),$but with the extra constraint that $\mathbf{\hat{f}}$ has to arise from a sum of trees. Without the constraint, the **steepest descent** gives a greedy approach to the optimization problem: $\begin{align*} \mathrm{gradient}=\mathbf{g}_{m}&= (g_{mi})_{i=1}^{N}\\[0.8em] \text{where } g_{mi}&= \left[ \frac{ \partial L(y_{i}, \mathbf{f}_{i}) }{ \partial x_{i} } \right]_{\mathbf{f}_{i}=f_{m-1}(x_{i})}. \end{align*}$The algorithm then moves along the direction $-\mathbf{g}_{m}$ by some step length $\epsilon_{m}$, giving the new model $\begin{align*} \mathbf{f}_{m}&= \mathbf{f}_{m-1} -\epsilon_{m}\mathbf{g}_{m}\\ &= \sum_{m}\underbrace{(-\epsilon_{m}\mathbf{g}_{m})}_{=:\, \mathbf{h}_{m}}, \end{align*}$where $\mathbf{h}_{m}$ can be interpreted as the update added at the $m$th step. Imitating the purely numerical approach, **gradient descent** fits the update $-\mathbf{g}_{m}$ with a tree $T_{m}=T(\cdot\,;\Theta_{m})$ via least squares optimization: $\begin{align*} &\hat{\Theta}_{m}= \underset{\Theta}{\arg\min}\,\| \mathbf{g}_{m} + T(\mathbf{\mathbf{x}}\,;\Theta) \|_{2}\\[0.4em] &\{ \hat{R}_{m,j} \}_{j=1}^{J},\,\ \{ \hat{g}_{m,j} \}_{j=1}^{J} \leftarrow \hat{\Theta}_{m}. \end{align*}$ where $\mathbf{g}_{m}$ is the $N$-vector-valued gradient, and $T(\mathbf{x};\Theta)$ is the vector-valued prediction of $T(\cdot\,;\Theta)$ evaluated on the training dataset. - Since adding the tree $T$ to the model subtracts $\mathbf{g}_{m}$, the sign inside the norm is $+$ instead of $-$. - Note that $\{ \hat{g}_{m,j} \}_{j=1}^{J}$ here are fitted values of the gradient, and are irrelevant for the algorithm. Only the partition regions $\{ \hat{R}_{m,j} \}_{j=1}^{J}$ are of interest. Then gradient descent uses the partition $\{ \hat{R}_{m,j} \}_{j=1}^{J}$ to find the appropriate $\{ \hat{\gamma}_{m,j} \}_{j=1}^{J}$ in each region with the much more manageable optimization $\hat{\gamma}_{m,j}=\underset{\gamma}{\arg\min}\sum_{x_{i}\in R_{m,j}}L\left(y_{i},f_{m-1}(x_{i})+\gamma\right).$ In summary, the gradient boosting algorithm is > [!algorithm] Gradient Tree Boosting > $[\text{Input}]$ learning speed $\nu$, and if necessary a constant tree size $J_{m}=J$. > > $[0]$ Initialize with the constant fit $f_{0}:=\underset{c}{\arg\min} \sum_{i=1}^{N}L(y_{i}, c).$ > For $[m=1,\dots,M]$: > - Compute the gradient $\mathbf{g}_{m}$ where $g_{mi}= -\left[ \frac{ \partial L(y_{i}, t_{i}) }{ \partial t_{i} } \right]_{t_{i}=f_{m-1}(x_{i})}.$ > - Fit a regression tree with $J_{m}$ terminal nodes that models $\mathbf{g}_{m}$; record its partition $\{ R_{m,j} \}_{j=1}^{J_{m}}$. > - For each region $R_{m,j}$, fit the best prediction: $\hat{\gamma}_{m,j}=\underset{\gamma}{\arg\min}\sum_{x_{i}\in R_{m,j}}L\left(y_{i},f_{m-1}(x_{i})+\gamma\right).$ > - Define the new tree $T_{m}(x):= \sum_{j=1}^{J_{m}}\gamma_{m,j} \cdot \mathbb{1}_{R_{m,j}}(x),$and add it to the model: $f_{m}\leftarrow f_{m-1} + \nu \cdot T_{m}.$ > > Set the final estimator $\hat{f} \leftarrow f_{M}$ and return. ## Hyperparameters and Overfitting The greedy and flexible nature of gradient descent naturally make overfitting an issue. It can be tackled by: - Fitting smaller trees, - Reducing learning speed and/or number of trees, - De-correlating trees with stochastic gradient descent. ### Size $J$ of Trees In the gradient boosting algorithm above, each tree is allowed to have its own size $J_{m}$, while in practice they are mostly constant $J_{m}=J\,\, \forall m \in \{1,\dots,M\}$. The choice of $J$ determines the depth of interactions between variables: for example, $J=2$ allows the interaction between predictor pairs $(X_{i}, X_{j})$, but not among three or more predictors. In practice, $J \le 10$ provides enough flexibility without easily overfitting. ### Number of Iterations $M$ and Learning Speed $\nu$ A large number of iteration allows more trees to be added to the model, with the potential of a better fit, and the risk of overfitting. One approach to choosing $M$ is monitoring validation accuracy over $m=1,\dots,M$, and choosing the minimizer. This is similar to **early stopping** in fitting neural networks. Another approach against overfitting is **shrinkage**, i.e. turning down the learning speed $\nu$. - A smaller learning speed will cause gradient boosting to move along the gradient more slowly. - It will also require a larger $M$ to reach the same level of detail in fitting the training set. In practice, the best approach seems to be using a small $\nu<0.1$ and optimize $M$ with early stopping. ### Stochastic Gradient Boosting As seen in [[Bootstraps#Better Point Estimate with Bootstrap Bagging|bagging (bootstrap averaging)]] methods and random forests, reducing the correlation in the ensemble will reduce the overall variance of the predictor. **Stochastic gradient boosting** reduces correlation by fitting the new trees $T_{m}$ on a fraction $\eta$ of the whole training set, resampled at each iteration. Choice of $\eta$ depends on the sample size: $\eta=1 / 2$ is common, but larger data sets can afford a far smaller $\eta$. ## Interpretation of Gradient Boosting Because gradient boosting is an ensemble method, the interpretability of individual trees are lost. ### Relative Importance of Predictors With the variable importances $I_{j}(T)$ defined for [[Decision Trees#Relative Importance of Predictors|a single decision tree]], we can compute the relative importance of a predictor $X_{j}$ by simply averaging its importance over the $M$ trees: $I_{j}=\frac{1}{M}\sum_{m=1}^{M} I_{j}(T_{m}) .$ To compare values, it is common to scale them down till largest importance is $100$ (or $100\%$). ### Partial Dependence Plots The [[Partial Dependence]] of the gradient boosting model $\hat{f}$ on one or two predictors $\mathbf{X}_{\mathcal{S}}\subset \mathbf{X}$ can be estimated by averaging $\mathbf{X}_{-\mathcal{S}}=\mathbf{X}\backslash\mathbf{X}_{\mathcal{S}}$ over the training set: $\hat{f}_{\mathcal{S}}(\mathbf{X}_{\mathcal{S}})=\frac{1}{N}\sum_{i=1}^{N} f(\mathbf{X}_{\mathcal{S}}, \mathbf{x}_{i, -\mathcal{S}})$where $\mathbf{x}_{i,-\mathcal{S}}$ is the $i$th observed value of $\mathbf{X}_{-\mathcal{S}}$. - This is an estimation of an estimation (estimated partial dependence of an estimated model), but two hats is one too many. This estimated partial dependence function only has one or two inputs, hence can be visualized in **partial dependence plots**.