> [!tldr]
> A **loss function** $L(y, f(x))$ represents the error incurred by modeling $y$ with $f(x)$.
## Loss Functions for Regression
Most common loss functions in regression is the $l_{p}$ norm:
$l_{p}:=\| y-f(x) \|_{p} $
where $p=2$ gives the Euclidean distance if $y,f(x)$ are vector-valued.
### Interpolating Behaviour of $l_{1}$ Loss
Consider minimising the $l_{1}$ norm of the residuals $\mathbf{y}-\mathbf{f}$:
$\min_{\mathbf{f}} \| \mathbf{y}-\mathbf{f} \|_{1} .$
A standard technique from [[Constrained Optimisation]] gives the equivalent problem of minimising (unsigned) residuals $\mathbf{r}$
$\begin{align*}
\min_{\mathbf{r}} \| \mathbf{r} \|_{1}, ~~~ \text{subject to }&\mathbf{r} \succeq 0,\\
&\mathbf{y}-\mathbf{f}-\mathbf{r} \preceq 0,\\
&\mathbf{y}-\mathbf{f}+\mathbf{r} \succeq 0
\end{align*} $
Complementary slackness (applied to the first inequality) shows that there will likely exist $i:r_{i}=0$, i.e. *the fit will exactly interpolate some points.*
## Loss Functions for Classification
### Per-Observation Losses in Classification
For binary responses $y\in \{ -1,1 \}$ modelled by a discriminant function $f$ and a decision boundary $\{ x\,|\,f(x)=0 \}$: the loss functions largely act on $yf=\begin{cases}
1 &\text{if }y=f \text{ (correctly classified)}; \\
-1 &\text{otherwise (incorrectly classified)}.
\end{cases}$
![[LossFunctions.png#invert|center]]
Misclassification error is just the 0-1 loss $\mathbf{1}_{\text{incorrectly classified}}$. However, it is neither convex or differentiable, so most optimisation techniques do not work on it.
The other losses are the **surrogate loss functions**, which satisfy conditions amenable to optimization:
- Convexity and differentiability.
- Always larger than (or equal to) the 0-1 loss.
One issue is that if the data is separable, i.e. we found $\hat{f}$ such that $\hat{f}(x_{i})y_{i}>0$ for all $i$ in the dataset, then $c\hat{f}$ is also a perfect classifier. However, for most of the losses above, $\lim_{c \to \infty}L(cy\hat{f})= 0,$so *the training loss can be made arbitrarily small with large $c$, while resulting in a wildly confident discriminant function $c\hat{f}$*.
- A solution is to add penalty to the size of the discriminant function.
### Per-Group Losses in Classification
For classifiers that predict the same response for a group $G$ of inputs, e.g. [[Decision Trees|decision trees]], a number of loss functions concerns the **purity** of the group (node in decision trees): define the proportion of a group being class $k$ as: $\eta_{k}:= \frac{| \{ i ~:~ x_{i} \in G, y_{i}=k \} |}{| G |}.$Then the following are loss functions: $\begin{align*}
\text{classification error} &= 1-\max_{k}\eta_{k}\\
\text{entropy} &= \sum_{k}\eta_{k}\log \eta_{k}\\
\text{Gini impurity} &= \sum_{k}\eta_{k}(1-\eta_{k}).
\end{align*}$