Deviance - Random Notes Go Brrrrrrr

> [!tldr] > **Deviance** measures the difference between likelihoods of models. > > For nested models $\mathcal{M}_{0} \subset \mathcal{M}_{1}$, with log-likelihoods $l(\mathbf{X};\mathcal{M_{i}})$ ($i=0,1$), the deviance is $2(l(\mathbf{X};\mathcal{M}_{1})-l(\mathbf{X};\mathcal{M}_{0}))$ > > More specifically, to compare a model $\mathcal{M}$ to a [[Saturated and Null Models|saturated model]] $\mathcal{M}_{s}$, the **scaled/residual deviance** is $2(l(\mathbf{X};\mathcal{M}_{s})-l(\mathbf{X};\mathcal{M})),$which measures the likelihood lost due to imposing the structures of model $\mathcal{M}$. > > Comparing the saturated model to a [[Saturated and Null Models|null model]] $\mathcal{M}_{0}$ gives the **null deviance**: $2(l(\mathbf{X}; \mathcal{M}_{s})-l(\mathbf{X};\mathcal{M}_{0})).$ ### Wilk's Theorem and F-tests Wilk's theorem states that the deviance $D$ between nested parameter spaces $\Theta_{0} \subseteq \Theta_{1}$ asymptotically follow $D \overset{d}{\approx} \chi^{2}(\dim \Theta_{1} - \dim \Theta_{0}).$This allows for hypothesis testing using one-sided percentiles of the $\chi^{2}$ distribution. - In particular, if $\Theta_{1}$ is the saturated model, then $\dim \Theta_{1}=n$ (one parameter for each observation). Another way of comparing two (non-nested) models $\mathcal{M}_{1,2}$ (corresponding to parameter spaces $\Theta_{1,2}$) is the **F-test** applied to the *ratio of their residual deviances*: $\frac{D_{1} / d_{1}}{D_{2} / d_{2}} \overset{d}{\approx} F_{d_{1}, d_{2}},$where $D_{1,2}$ are the residual deviances, and $d_{1,2}:= \dim \Theta_{1,2}$. - One application is for *testing the exclusion of some predictors*, where $\mathcal{M}_{1}$ is the simplified model (so $\Theta_{1} \subset \Theta_{2}$). If the test fails to reject $H_{0}: \theta \in \Theta_{1}$, then there is no evidence to include the extra predictors of $\mathcal{M}_{2}$. - Example in OLS can be found [[Inference in OLS|here]]. Under the [[Inference in OLS#The Gauss-Markov Model|the Gauss-Marko model]], the F-distribution is exact. > [!warning] > Wilk's theorem is asymptotic in $n$ if $\dim \Theta_{0,1}$ are held constant. If the dimensions also grow with $n$ (e.g. in a [[Saturated and Null Models|saturated model]]), it does not necessarily hold. ### Deviance and Regression Losses Consider the stochastic model of a response $Y \in \mathbb{R}$ with additive noise: $Y=f(X)+\epsilon.$Many models of $\epsilon$ translate into deviances that resemble common [[Loss Functions]]: $\begin{array}{c|c|c} \text{noise distribution}& D \propto \cdots & \text{equiv. loss} & \text{``robust''}\\ \hline \text{Gaussian } N(0, \sigma^{2})& \mathrm{const.}+ \frac{1}{\sigma^{2}} \| \mathbf{y}-\hat{\mathbf{y}} \|_{2}^{2} & \mathrm{MSE} \\ \text{Bilarteral } \mathrm{Exp} (\lambda) & \mathrm{const.}+\lambda \| \mathbf{y}-\hat{\mathbf{y}} \|_{1} & \mathrm{MAE} & \checkmark \\ \text{t-distribution} & \text{stuff} & / & \checkmark \\ \text{mixture of }N \text{ and } \mathrm{Exp} & \text{weird stuff} & \text{Huber} & \checkmark \end{array}$ - Here the bilateral exponential distribution (aka Laplace distribution) can be thought of as two $\mathrm{Exp}(\lambda)$ distributions glued together, or equivalently $\epsilon=S\cdot \tilde{\epsilon},~~S\sim \mathrm{Unif}\{ -1, 1 \},~ \tilde{\epsilon}\sim \mathrm{Exp}(\lambda),$so $S$ flips the sign of a regular exponential variable $\tilde{\epsilon}$. - Note that the bilateral exponential has a fatter tail than the Gaussian -- this reflects the fact that fat-tail distributions generate deviances that are relatively [[Loss Functions and Robustness|robust losses]].