Generalized Linear Models - Random Notes Go Brrrrrrr

> [!tldr] > **Generalized linear models (GLM)** model a random variable $Y$ following some distribution $f_{Y}(\cdot; \beta)$ from a certain [[Exponential Dispersion Family|exponential dispersion family]] parametrized by $\beta$. > > Instead modeling $Y=\beta^ T X$ directly like in OLS, GLMs estimate its mean $\mu=\mathbb{E}[Y]$ via $g(\mu)=\beta^{T} X=: \eta$for a suitable [[Link Function|link function]] $g$. The coefficients $\beta$ can then be fitted analytically for some distributions or numerically for others. For inference from a sample $(\mathbf{y},\mathbf{X})$, each $y_{i}$ is assumed to follow the same family of distributions $\mathcal{P}_{Y}$, but have different mean $\mu_{i}:= \mathbb{E}[y_{i}]$ that should be modeled with $g(\mu_{i})=\beta^{T}X_{i}$ using the same $g,\beta$. ## Solving for GLM Coefficients In matrix notation, $g_{\beta}(\hat{\pmb{\mu}})=\pmb{\eta}:= \mathbf{X}\beta,$and the joint likelihood is (assuming $g$ is the canonical link) $f(\mathbf{y} ;\beta)=\exp\left[ \frac{1}{\phi}\left( \pmb{\eta}^{T}\mathbf{y}-\sum_{i=1}^{N}\kappa(\eta_{i}) \right) \right]h(\mathbf{y};\phi),$so the MLE only needs to maximize $\pmb{\eta}^{T} \mathbf{y}-\sum_{i}\kappa(\eta_{i})=\sum_{i}(\eta_{i}y_{i}-\kappa(\eta_{i})).$ Equivalently, *the MLE coefficients of the GLM minimizes the [[Deviance|deviance]]*: $\underbrace{\left( \sum_{i}\hat{\theta}^{\mathrm{sat}}y_{i}-\kappa(\hat{\theta}^{\mathrm{sat}}) \right)}_{\phi\, \cdot\,\text{saturated loglik}}- \underbrace{\sum_{i}(\eta_{i}y_{i}-\kappa(\eta_{i}))}_{\phi\, \cdot\,\text{loglik}(\pmb{\eta})}=\frac{1}{2}D(\pmb{\eta}),$where $D(\pmb{\eta})$ is the deviance of the GLM with coefficients $\pmb{\eta}$ (compared to the saturated model with coefficients $\hat{\theta}^\mathrm{sat}$). Taking the derivative and using the [[Exponential Dispersion Family#Moments of EDFs|identity]] $\kappa'(\eta)=\mu(\eta)=g^{-1}(\eta)$, the MLE solves $\mathbf{X}^{T}(\mathbf{y}-g^{-1}(\mathbf{X}\beta))=0.$This is solvable analytically for distributions like Gaussians where $g^{-1}$ is linear; other distributions require numerical solutions. The [[Iteratively Reweighted Least Squares|IRLS]] is the most common numerical solution, equivalent to applying the Newton-Raphson method to finding the MLE. ## Choice of Link Functions ![[GLMRelation.png#invert|center]] In exponential dispersion families, $\mu=\mu(\theta)$ is invertible, so choosing the [[Link Function#Canonical Link Function|canonical link function]] $g=\mu^{-1}$ allowing us to directly model $\theta=\eta$. - For example, **logistic regression** models $Y_{i} \overset{\mathrm{indep.}}{\sim}\mathrm{Binom}(n_{i}, p_{i})$ variables ($n_{i}$ is known) with the canonical link $p_{i} \mapsto \log(p_{i} / (1-p_{i}))$, aka the **log-odds** function. ^84f6f6 - This is equivalent to linearly model the log-odds $\eta \in \mathbb{R}$, then passing it to the function $\eta \mapsto \exp(\eta) / (1+\exp(\eta))$ to squish it into $[0,1]$. ## GLM Residuals Since GLM inherently assumes heteroscedasticity (each observation has variance $\mathrm{Var}(Y_{i})=\phi V(\mu_{i})$, approximated by $\hat{\phi}V(\hat{\mu}_{i})$), direct comparison of residuals is unsound. > [!definition|*] Pearson Residuals > The normalizing with $V(\hat{\mu}_{i})$ gives the **Pearson residuals**: $r_{P_{i}}:= \frac{y_{i}-\hat{\mu}_{i}}{\sqrt{ V(\hat{\mu}_{i}) }}.$ > Their sum-of-squares is the **Pearson goodness-of-fit (GOF)** statistic with an asymptotic $\chi^{2}$ distribution $P(\mathbf{y}):=\| \mathbf{r}_{P} \|^{2}=\sum_{i} \frac{(y_{i}-\hat{\mu}_{i})^{2}}{V(\hat{\mu}_{i})} \overset{D}{\approx} \chi^{2}_{n-p}.$ An analogous concept is the deviance residual: > [!definition|*] Deviance Residual > The amount of [[Deviance|(saturated) deviance]] caused by the $i$th observation is $d_{i}:= 2(\underbrace{l(\hat{\theta}^{\mathrm{sat}}; y_{i})}_{\substack{\text{saturated}\\ \text{loglik}}}-\underbrace{l(\hat{\theta}; y_{i})}_{\substack{\text{model}\\ \text{loglik}}}).$Note that for Gaussian GLM (i.e. OLS with Gaussian additive errors), $d_{i}=(y_{i} - \hat{\theta})^{2} / \sigma^{2}$ is just the squared loss. > > Taking a (signed) square root to get the **deviance residual**: $r_{D_{i}}:= \mathrm{sign}(y_{i}-\hat{\mu}_{i})\cdot\sqrt{ d_{i} }.$Now the Gaussian GLM has $r_{D_{i}}=(y_{i}-\hat{\theta}) / \sigma$ exactly. Replacing RSS with deviance, the $R^{2}$ goodness-of-fit statistic can also be generalized for GLMs using deviance residuals: > [!definition|*] Goodness-of-Fit in GLM > Let $D:= \sum_{i} d_{i}$ be the sum of saturated deviances, and $D^{(0)}:= \sum_{i}d_{i}^{(0)}$ be the sum of [[Deviance|null deviances]], then the GOF measure is $R^{2}_{D}:= 1-\frac{D}{D^{(0)}}.$ > This is analogous to [[Pearson's R2]], where $D^{(0)}$ is to the $\mathrm{TSS}$, and $D^{(0)}-D$ the $\mathrm{RSS}$ of an OLS model. Standardized residuals analogous to the [[Inference in OLS#Residual Diagnostics|OLS ones]] can be defined usi[](Deviance.md) by $\mathbf{H}:= \tilde{\mathbf{X}}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^{T}=\mathbf{W}^{1 / 2}\mathbf{X}(\mathbf{X}^{T}\mathbf{WX})^{-1}\mathbf{X}^{T}\mathbf{W}^{1 / 2},$with $\tilde{\mathbf{X}}=\mathbf{W}^{1 / 2}\mathbf{X}$ being the reweighted data. The Gaussian variance estimates $s^{2}$ can be replaced with the dispersion $\phi$ (which equals $\sigma^{2}$ in Gaussian GLM) or $\hat{\phi}$. > [!definition|*] Standardized GLM Residuals > Normalizing the Pearson residuals give the **standardized Pearson residuals $\mathbf{r}'_{P}$** given by $r'_{P_{i}}:= \frac{r_{P_{i}}}{\sqrt{\phi(1-h_{ii}) }}.$ > > Normalizing the deviance residuals give the **standardized Pearson residuals $\mathbf{r}'_{D}$** given by $r'_{D_{i}}:= \frac{r_{D_{i}}}{\sqrt{1-h_{ii} }};$Note that the variance $\phi$ is already included in the deviance (i.e. log likelihood), so no need to divide by that. > [!definition|*] Cook's Distance in GLM > Replacing $\hat{y}$ with $\hat{\mu}$ in OLS Cook's distance gives the **GLM Cook's distance**: $\mathrm{cd}_{i}:= \frac{\| \mathbf{W}^{1 / 2}\mathbf{X}(\hat{\beta}-\hat{\beta}_{-i}) \|^{2} }{\phi}$where again $\hat{\beta}_{-i}$ are the coefficients fitted with all-but-the-$i$th data point. Similar rule-of-thumbs from OLS also apply: - The standardized residuals are asymptotically $N(0,1)$, so about $95\%$ of them should be within $\pm 2$ if the model is adequate. ## Hypothesis Testing in GLM Suppose the fitted coefficients are the MLE $\hat{\beta}$, and we wish to test for the significance of a few of them. The most rudimentary is to test if all of them are $0$ (usually except the intercept), i.e. that the null model is true: in this case we can resort to the likelihood ratio (aka [[Deviance|null deviance]]): > [!definition|*] Null-Deviance Test in GLM > Asymptotically, if $H_{0}:\beta_{2}=\dots=\beta_{p}=0$ holds ($\beta_{1}$ assumed to be the intercept), the likelihood ratio statistic $\Lambda :=2(\mathrm{\log lik}(\hat{\beta})-\mathrm{\log lik}(\hat{\beta}^{(0)})) \overset{D}{\approx} \chi^{2}_{p-1}.$ For testing a single coefficient, we can resort to the asymptotic normality of the MLE, and use > [!definition|*] Asymptotic Normality for Testing in GLM > Using the asymptotic normality of the MLE $\hat{\beta}$, we have $\hat{\beta}_{j}\overset{D}{\approx}N(\beta_{j}, I_{jj}^{-1}),$where $I$ is the [[Information and Bounding Errors#Fisher and Observed Information|Fisher information matrix]] of $\hat{\beta}$, and for the IRLS algorithm, it equals $(X^{T}WX)$ in the weighted least squares problem. > > Therefore, under $H_{0}:\beta_{j}=0$, we have the asymptotic test statistic $z:= \frac{\hat{\beta}_{j}}{\sqrt{ I_{jj} }} \overset{D}{\approx} N(0,1).$ However, the **Wald statistic** is defined for testing multiple coefficients at once: > [!definition|*] Wald Statistic of GLM > For testing the significance of a subset of the estimated coefficients $\hat{\beta}$, WLOG being $\hat{\beta}_{1:k}$, let $\hat{\Sigma}_{1 : k}^{-1}$ be their estimated covariance matrix (e.g. by subsetting the full $\hat{\Sigma}^{-1}\approx I_{\beta}(X)= X^{T}WX$ from the IRLS algorithm). > > The **Wald statistic** is then given by $z^{2}:= \hat{\beta}_{1:k}^{T}\hat{\Sigma}_{1: k}^{-1} \hat{\beta}_{1: k}.$ > Under $H_{0}:\beta_{1: k}=0$, $z^{2}$ is asymptotically $\chi^{2}_{k}$. - Therefore, with $k=1$, the above asymptotic normal test is a special case of the Wald statistic. ## GLM vs GAM [[Generalized Additive Models]] (GAMs) have the form $g[\mu_{Y}(X)]=\alpha +\sum f_{i}(X_{i})$which can use non-linear functions $f(X_{i})$ to estimate $g(\mu)$. GLMs only use the original predictors. ## Resources MIT OCW ![MITOCW](https://www.youtube.com/watch?v=mc1y8m9-hOM) [Cross Validated Post](https://stats.stackexchange.com/questions/40876/what-is-the-difference-between-a-link-function-and-a-canonical-link-function)