> [!summary] > A **decision tree** $T$ with $J$ terminal nodes partitions the sample space $\mathcal{X}$ into $J$ regions $\{ R_{j} \}_{j=1}^{J}$. It can then assign values $\{ \gamma_{j} \}_{j=1}^{J}$ to each region. > > When used for modeling, it is usually the building block of more complex models like [[Boosting|boosting]] and [[Bootstrap Ensemble Methods#Bagging and Random Forests|random forests]]. If $T$ is used to model the relationship between $x \in \mathcal{X}$ and $y$, it maps $x \mapsto \hat{y}:=T(x)$. It assigns the same value $\gamma_{j}$ for $\forall x \in R_{j}$, so the tree can be written as $T(\underbrace{\{ R_{j}, \gamma_{j} \}_{j=1}^{J}}_{=: \Theta}): x \mapsto \sum_{j=1}^{J}\gamma_{j} \cdot \mathbb{1}_{R_{j}}(x).$where the partition region and values $\{ \hat{R}_{j}, \hat{\gamma}_{j} \}_{j=1}^{J}$ are treated as parameters. ### Fitting a Decision Tree Fitting the optimal tree is then solving the (almost impossible) problem $\hat{\Theta}=\{ \hat{R}_{j}, \hat{\gamma}_{j} \}_{j=1}^{J}=\underset{{\{ R_{j},\gamma_{j} \}_{j=1}^{J}}}{\arg\min}\,\,\sum_{j=1}^{J}\sum_{i:x_{i} \in R_{j}}L(y_{i},\gamma_{j}),$i.e. the [[Loss Functions#Per-Group Losses in Classification|loss function]] $L$ of predicting $y_{i}$ with the constant $\gamma_{j}$, for each $x_{i}$ that falls into $R_{j}$, then summing that over all regions $R_{1},\dots,R_{j}$. - Essentially, the problem has two steps: finding $\{ R_{j} \}_{j=1}^{J}$, then finding the best $\gamma_{j}$ for each $R_{j}$. - The second step is rather trivial, usually $\hat{\gamma}_{j}=\mathrm{avg}_{i}(y_{i} \,|\, x_{i} \in R_{j})$ for regression ($l_{2}$ loss), and $\hat{\gamma}_{j}=\mathrm{mode}(y_{i} \,|\, x_{i} \in R_{j})$, i.e. **majority voting**, for classification. However, *the first step of finding the best $R_{j}$ is too computationally complex to find the global optimum*, and most tree-fitting algorithms try to simplify the problem and settle for suboptimal solutions. - Usually $R_{j}$ are restricted to regions that can be formed by successively partitioning along a single predictor $x_{k}$. - Many algorithms also greedily find the best partition that reduces losses without looking ahead. ### Losses and Impurity Measures Fitting trees has one major difference from other methods, in that *trees do not involve optimizing a differentiable function*. - We can't use gradients to fit one tree (possible for a forest, though -- see applications in [[Boosting]]). - While regression tasks can make use of differentiable losses like the $l_{2}$ loss, it makes no sense to "differentiate the loss with respect to a tree". - Classification tasks no longer need surrogate losses: those are introduced to replace the non-differentiable 0-1 loss to allow for gradient-based fitting, but here the predictor is also not differentiable. Instead, we directly work with a node-wise analogue of the 0-1 loss: class proportions. In a node $R$ with indices $I_{R}$, the proportion of class $k$ is $\eta_{k}(R)=\frac{| \{ i \in I_{R} ~|~ y_{i}=k \} |}{| I_{R} |}.$Many **impurity measures** then uses those to compute the quality of a node; [[Loss Functions#Per-Group Losses in Classification|examples]] include $1-\max_{k}\eta_{k}$, the Gini index, and entropy. ### Greedy Growth of Decision Trees To circumvent the combinatorially impossible tasks of choosing the globally optimal $\{ R_{j} \}$, most algorithms grow the trees greedily, i.e. trying to make the next split that decreases some objective the most. For classification trees with impurity function $i(R)$ (mapping a region $R$ to an impurity number in $\mathbb{R}$): > [!algorithm] Growth of a Classification Tree > Consider the indices $I$ assigned to the current node $R$ that needs to be split further. > > For $X_{j}$ among the possible variables $[X_{1},\dots, X_{p}]$: > &emsp;&emsp;&emsp;For observed values $x_{j}^{\ast} \in [x_{1j},\dots,x_{nj}]$ of $X_{j}$: >&emsp;&emsp;&emsp;&emsp;&emsp;Split the node into two smaller nodes $\begin{align*}R_{-}(x_{j}^{\ast})&:= \{ \mathbf{x} \in R~|~ x_{j} < x_{j}^{\ast} \},\\R_{+}(x_{j}^{\ast})&:= \{ \mathbf{x} \in R~|~ x_{j} \ge x_{j}^{\ast} \}. \end{align*}$ > >&emsp;&emsp;&emsp;&emsp;Similarly split the indices $I_{\pm}(x_{j}^{\ast})$ by the two nodes. >&emsp;&emsp;&emsp;&emsp;&emsp;Compute the impurity decrease $~~~~~~~~~~~~~~~~~\Delta i:=| I |i(R)-| I_{+} |i(R_{+})-| I_{-} |i(R_{-}).$ > Return the optimal split $X_{j}=x_{ij}$ with the largest decrease in impurity. Note that if the impurity measure $i$ is a convex function of $\eta$ (with abuse of notation, $i(\eta(R)):=: i(R)$), then *the tree's total impurity always decreases*, because for each split $R \to R_{\pm}$, $\eta_{k}(R)$ is always between $\eta_{k}(R_{\pm})$; therefore, $i(\eta(R)) \ge \frac{| R_{+} |}{| R |}i(\eta(R_{+})) + \frac{| R_{-} |}{| R |}i(\eta(R_{-}))$due to convexity of $i$. Regression trees are fit similarly, but require the prediction $\hat{\beta}_{R}$ to be determined before computing the loss: using the common losses $L$ to compute $L(\hat{\beta}_{R},R):= \sum_{i \in I_{R}}L(\hat{\beta}_{R}, y_{i})$ > [!algorithm] Growth of a Regression Tree > Consider the indices $I$ assigned to the current node $R$ that needs to be split further. > > For $X_{j}$ among the possible variables $[X_{1},\dots, X_{p}]$: > &emsp;&emsp;&emsp;For possible values $x_{j}^{\ast} \in [X_{1j},\dots,X_{nj}]$ of $X_{j}$: >&emsp;&emsp;&emsp;&emsp;&emsp;Split the dataset into indices $\begin{align*}I_{-}&:= \{ i \in I~|~ x_{ij} < x_{j}^{\ast} \},\\I_{+}&:= \{ i \in I~|~ x_{ij} \ge x_{j}^{\ast} \}. \end{align*}$ > >&emsp;&emsp;&emsp;&emsp;Let the new regions be $R_{\pm}$. >&emsp;&emsp;&emsp;&emsp;&emsp;Compute predictions $\hat{\beta}_{\pm}$ for each region. >&emsp;&emsp;&emsp;&emsp;&emsp;Compute the improvement $\begin{align*}\Delta L:=&| I |L(\hat{\beta},R) \\ &-| I_{+} |L(\hat{\beta}_{+}R_{+})\\ &-| I_{-} |L(\hat{\beta}_{-}R_{-}). \end{align*}$ > Return the optimal split $X_{j}=x_{ij}$ with the largest improvement. ## Relative Importance of Predictors If the predictor $X_{j}$ is chosen to make the $t$th split, it causes an improvement in impurity or risk: $\begin{align*} &[\text{Node }R, \text{ impurity}=i(R)]\\[0.4em] &\longrightarrow \begin{cases} [\text{Node }R_{-},\text{ impurity}=i(R_{-})] \\[0.4em] [\text{Node }R_{+},\text{ impurity}=i(R_{+})] \end{cases}\\[0.6em] &\Delta i_{t}:=\text{improvement}=R -R_{+}-R_{-} \end{align*}$ Then, the relative importance of a predictor $X_{p}$ in tree $T$ is the sum of improvements in nodes where it is chosen: $I_{j}(T)=\sum_{j=1}^{J-1}\Delta i_{t} \cdot \mathbb{1}(X_{j} \text{ is chosen in node }t).$