Distance between Data Point and Distribution

Gaussian distribution: $l_{2}$ distance - LSR assumes Gaussian error, hence uses $l_{2}$ loss. Generalized: deviance (i.e. $2\times$ (loglik of saturated model - loglik of current model)) - Is $0$ if current model is saturated, i.e. one separate parameter estimate $\hat{\theta}_{i}^{\text{sat}}$ for each observation. - Same scale as Gaussian $l_{2}$: for a Gaussian model, $\text{deviance}=\frac{(x_{i}-\hat{\mu}_{i})^{2}}{{\sigma}_{\epsilon}^{2}}=l_{2} \text{ loss normalized by variance}$where $\sigma_{\epsilon}^{2}$ is assumed to be constant for $\forall i$. - Gaussian case generalizes to Mahalanobis distance in $\mathbb{R}^{p}$: $\mathrm{dist}(x):= (x-\mu)^{T}\Sigma^{-1}(x-\mu)$ The MLE $\alpha$ is the choice of $\alpha$ that minimizes the total deviance. GLM maximum likelihood fitting is “least total deviance” in the same way that ordinary linear regression is least sum of squares. In general, for continuous distributions $X \sim f_{1},f_{2}$, their deviance is $d(f_{1},f_{2}):= \mathbb{E}_{f_{1}}[\log f_{1}(X) - \log f_{2}(X)],$i.e. the expected log of likelihood ratio given $f_{1}$. Hence in general $d(f_{1},f_{2}) \ne d(f_{2},f_{1})$.