Loss Functions and Robustness - Random Notes Go Brrrrrrr

Many parameter models are optimized by minimizing a [[Loss Functions|loss function]], so the choice of the loss function determines the **robustness** of a model. As a general rule, the more weight a loss puts on noisy data points (e.g. outliers), the less robust the algorithm is to such points. - For example, $l_{2}$ norm (squared error) penalizes large residuals harder than the $l_{1}$ norm (absolute error), so it is less robust. - As a consequence, the sample mean (the best intercept estimate under the $l_{2}$ norm) is less robust than the sample median (best under $l_{1}$ norm). - More extreme loss functions like the exponential loss $\exp(-y \cdot f(x))$ for binary $\{ -1,1 \}$ response is very non-robust, and as a result algorithms like [[Boosting#AdaBoost|AdaBoost]] degrades quickly with noisy data. ### Alternative Loss Functions Traditional $l_{p}$ losses (especially $l_{1},l_{2}$) have one of the two drawbacks: - The squared $l_{2}$ loss is easy to optimize, but not robust. - The $l_{1}$ loss is robust, but its discontinuous derivative makes it harder to optimize. **Huber loss** is splices the two together and has the best of both losses: $L(y,f(x))=\begin{cases} &[y-f(x)]^{2}&{{\mathrm{~for~}}|\,y-f(x)\,|\leq\delta,}\\[0.4em] &2\delta|y-f(x)|\delta^{2}&{{\mathrm{~otherwise.}}} \end{cases}$ ![[HuberLoss.png#invert|center]] However, using alternative loss functions like the Huber loss would sacrifice the interpretability and elegance of algorithms like AdaBoost. - Lol who cares about interpretability when we are using ensemble models. > [!idea] Man I love Trees > Although the robust losses are usually harder to differentiate, this is not an issue for non-gradient methods like [[Decision Trees]] (and by extension their [[Ensemble Methods|ensembles]] like [[Bootstrap Ensemble Methods#Bagging and Random Forests|random forests]]). > > [[Boosting|Gradient boosting]] is slightly inconvenienced by the $l_{1}$ loss's discontinuous derivative (making it hard to compute the gradient), but other, differentiable losses are still OK.