Loss Functions - Random Notes Go Brrrrrrr

> [!tldr] > A **loss function** $L(y, f(x))$ represents the error incurred by modeling $y$ with $f(x)$. ## Loss Functions for Regression Most common loss functions in regression is the $l_{p}$ norm: $l_{p}:=\| y-f(x) \|_{p} $ where $p=2$ gives the Euclidean distance if $y,f(x)$ are vector-valued. ### Interpolating Behaviour of $l_{1}$ Loss Consider minimising the $l_{1}$ norm of the residuals $\mathbf{y}-\mathbf{f}$: $\min_{\mathbf{f}} \| \mathbf{y}-\mathbf{f} \|_{1} .$ A standard technique from [[Constrained Optimisation]] gives the equivalent problem of minimising (unsigned) residuals $\mathbf{r}$ $\begin{align*} \min_{\mathbf{r}} \| \mathbf{r} \|_{1}, ~~~ \text{subject to }&\mathbf{r} \succeq 0,\\ &\mathbf{y}-\mathbf{f}-\mathbf{r} \preceq 0,\\ &\mathbf{y}-\mathbf{f}+\mathbf{r} \succeq 0 \end{align*} $ Complementary slackness (applied to the first inequality) shows that there will likely exist $i:r_{i}=0$, i.e. *the fit will exactly interpolate some points.* ## Loss Functions for Classification ### Per-Observation Losses in Classification For binary responses $y\in \{ -1,1 \}$ modelled by a discriminant function $f$ and a decision boundary $\{ x\,|\,f(x)=0 \}$: the loss functions largely act on $yf=\begin{cases} 1 &\text{if }y=f \text{ (correctly classified)}; \\ -1 &\text{otherwise (incorrectly classified)}. \end{cases}$ ![[LossFunctions.png#invert|center]] Misclassification error is just the 0-1 loss $\mathbf{1}_{\text{incorrectly classified}}$. However, it is neither convex or differentiable, so most optimisation techniques do not work on it. The other losses are the **surrogate loss functions**, which satisfy conditions amenable to optimization: - Convexity and differentiability. - Always larger than (or equal to) the 0-1 loss. One issue is that if the data is separable, i.e. we found $\hat{f}$ such that $\hat{f}(x_{i})y_{i}>0$ for all $i$ in the dataset, then $c\hat{f}$ is also a perfect classifier. However, for most of the losses above, $\lim_{c \to \infty}L(cy\hat{f})= 0,$so *the training loss can be made arbitrarily small with large $c$, while resulting in a wildly confident discriminant function $c\hat{f}$*. - A solution is to add penalty to the size of the discriminant function. ### Per-Group Losses in Classification For classifiers that predict the same response for a group $G$ of inputs, e.g. [[Decision Trees|decision trees]], a number of loss functions concerns the **purity** of the group (node in decision trees): define the proportion of a group being class $k$ as: $\eta_{k}:= \frac{| \{ i ~:~ x_{i} \in G, y_{i}=k \} |}{| G |}.$Then the following are loss functions: $\begin{align*} \text{classification error} &= 1-\max_{k}\eta_{k}\\ \text{entropy} &= \sum_{k}\eta_{k}\log \eta_{k}\\ \text{Gini impurity} &= \sum_{k}\eta_{k}(1-\eta_{k}). \end{align*}$