> [!tldr]
> A **loss function** $L(y, f(x))$ represents the error incurred by modeling $y$ with $f(x)$.
>
> Most common loss functions is the $l_{p}$ norm:
$l_{p}:=\| y-f(x) \|_{p} $ where $p=2$ gives the Euclidean distance if $y,f(x)$ are vector-valued.
### Per-Observation Losses in Classification
For binary responses $y\in \{ -1,1 \}$ modeled by a discriminant function $f$ and a decision boundary $\{ x\,|\,f(x)=0 \}$: the loss functions largely act on $yf=\begin{cases}
1 &\text{if }y=f \text{ (correctly classified)}; \\
-1 &\text{otherwise (incorrectly classified)}.
\end{cases}$
![[LossFunctions.png#invert]]
Misclassification error is just the 0-1 loss $\mathbf{1}_{\text{incorrectly classified}}$. However, it is neither convex or differentiable, so most optimization techniques do not work on it.
The other losses are the **surrogate loss functions**, which satisfy conditions amenable to optimization:
- Convexity and differentiability.
- Always larger than (or equal to) the 0-1 loss.
One issue is that if the data is separable, i.e. we found $\hat{f}$ such that $\hat{f}(x_{i})y_{i}>0$ for all $i$ in the dataset, then $c\hat{f}$ is also a perfect classifier. However, for most of the losses above, $\lim_{c \to \infty}L(cy\hat{f})= 0,$so *the training loss can be made arbitrarily small with large $c$, while resulting in a wildly confident discriminant function $c\hat{f}$*.
- A solution is to add penalty to the size of the discriminant function.
### Per-Group Losses in Classification
For classifiers that predict the same response for a group $G$ of inputs, e.g. [[Decision Trees|decision trees]], a number of loss functions concerns the **purity** of the group (node in decision trees): define the proportion of a group being class $k$ as: $\eta_{k}:= \frac{| \{ i ~:~ x_{i} \in G, y_{i}=k \} |}{| G |}.$Then the following are loss functions: $\begin{align*}
\text{classification error} &= 1-\max_{k}\eta_{k}\\
\text{entropy} &= \sum_{k}\eta_{k}\log \eta_{k}\\
\text{Gini impurity} &= \sum_{k}\eta_{k}(1-\eta_{k}).
\end{align*}$