> [!tldr]
> **Generalized linear models (GLM)** model a random variable $Y$ following some distribution $f_{Y}(\cdot; \beta)$ from a certain [[Exponential Dispersion Family|exponential dispersion family]] parametrized by $\beta$.
>
> Instead modeling $Y=\beta^ T X$ directly like in OLS, GLMs estimate its mean $\mu=\mathbb{E}[Y]$ via $g(\mu)=\beta^{T} X=: \eta$for a suitable [[Link Function|link function]] $g$. The coefficients $\beta$ can then be fitted analytically for some distributions or numerically for others.
For inference from a sample $(\mathbf{y},\mathbf{X})$, each $y_{i}$ is assumed to follow the same family of distributions $\mathcal{P}_{Y}$, but have different mean $\mu_{i}:= \mathbb{E}[y_{i}]$ that should be modeled with $g(\mu_{i})=\beta^{T}X_{i}$ using the same $g,\beta$.
## Solving for GLM Coefficients
In matrix notation, $g_{\beta}(\hat{\pmb{\mu}})=\pmb{\eta}:= \mathbf{X}\beta,$and the joint likelihood is (assuming $g$ is the canonical link) $f(\mathbf{y} ;\beta)=\exp\left[ \frac{1}{\phi}\left( \pmb{\eta}^{T}\mathbf{y}-\sum_{i=1}^{N}\kappa(\eta_{i}) \right) \right]h(\mathbf{y};\phi),$so the MLE only needs to maximize $\pmb{\eta}^{T} \mathbf{y}-\sum_{i}\kappa(\eta_{i})=\sum_{i}(\eta_{i}y_{i}-\kappa(\eta_{i})).$
Equivalently, *the MLE coefficients of the GLM minimizes the [[Deviance|deviance]]*: $\underbrace{\left( \sum_{i}\hat{\theta}^{\mathrm{sat}}y_{i}-\kappa(\hat{\theta}^{\mathrm{sat}}) \right)}_{\phi\, \cdot\,\text{saturated loglik}}- \underbrace{\sum_{i}(\eta_{i}y_{i}-\kappa(\eta_{i}))}_{\phi\, \cdot\,\text{loglik}(\pmb{\eta})}=\frac{1}{2}D(\pmb{\eta}),$where $D(\pmb{\eta})$ is the deviance of the GLM with coefficients $\pmb{\eta}$ (compared to the saturated model with coefficients $\hat{\theta}^\mathrm{sat}$).
Taking the derivative and using the [[Exponential Dispersion Family#Moments of EDFs|identity]] $\kappa'(\eta)=\mu(\eta)=g^{-1}(\eta)$, the MLE solves $\mathbf{X}^{T}(\mathbf{y}-g^{-1}(\mathbf{X}\beta))=0.$This is solvable analytically for distributions like Gaussians where $g^{-1}$ is linear; other distributions require numerical solutions. The [[Iteratively Reweighted Least Squares|IRLS]] is the most common numerical solution, equivalent to applying the Newton-Raphson method to finding the MLE.
## Choice of Link Functions
![[GLMRelation.png#invert|center]]
In exponential dispersion families, $\mu=\mu(\theta)$ is invertible, so choosing the [[Link Function#Canonical Link Function|canonical link function]] $g=\mu^{-1}$ allowing us to directly model $\theta=\eta$.
- For example, **logistic regression** models $Y_{i} \overset{\mathrm{indep.}}{\sim}\mathrm{Binom}(n_{i}, p_{i})$ variables ($n_{i}$ is known) with the canonical link $p_{i} \mapsto \log(p_{i} / (1-p_{i}))$, aka the **log-odds** function. ^84f6f6
- This is equivalent to linearly model the log-odds $\eta \in \mathbb{R}$, then passing it to the function $\eta \mapsto \exp(\eta) / (1+\exp(\eta))$ to squish it into $[0,1]$.
## GLM Residuals
Since GLM inherently assumes heteroscedasticity (each observation has variance $\mathrm{Var}(Y_{i})=\phi V(\mu_{i})$, approximated by $\hat{\phi}V(\hat{\mu}_{i})$), direct comparison of residuals is unsound.
> [!definition|*] Pearson Residuals
> The normalizing with $V(\hat{\mu}_{i})$ gives the **Pearson residuals**: $r_{P_{i}}:= \frac{y_{i}-\hat{\mu}_{i}}{\sqrt{ V(\hat{\mu}_{i}) }}.$
> Their sum-of-squares is the **Pearson goodness-of-fit (GOF)** statistic with an asymptotic $\chi^{2}$ distribution $P(\mathbf{y}):=\| \mathbf{r}_{P} \|^{2}=\sum_{i} \frac{(y_{i}-\hat{\mu}_{i})^{2}}{V(\hat{\mu}_{i})} \overset{D}{\approx} \chi^{2}_{n-p}.$
An analogous concept is the deviance residual:
> [!definition|*] Deviance Residual
> The amount of [[Deviance|(saturated) deviance]] caused by the $i$th observation is $d_{i}:= 2(\underbrace{l(\hat{\theta}^{\mathrm{sat}}; y_{i})}_{\substack{\text{saturated}\\ \text{loglik}}}-\underbrace{l(\hat{\theta}; y_{i})}_{\substack{\text{model}\\ \text{loglik}}}).$Note that for Gaussian GLM (i.e. OLS with Gaussian additive errors), $d_{i}=(y_{i} - \hat{\theta})^{2} / \sigma^{2}$ is just the squared loss.
>
> Taking a (signed) square root to get the **deviance residual**: $r_{D_{i}}:= \mathrm{sign}(y_{i}-\hat{\mu}_{i})\cdot\sqrt{ d_{i} }.$Now the Gaussian GLM has $r_{D_{i}}=(y_{i}-\hat{\theta}) / \sigma$ exactly.
Replacing RSS with deviance, the $R^{2}$ goodness-of-fit statistic can also be generalized for GLMs using deviance residuals:
> [!definition|*] Goodness-of-Fit in GLM
> Let $D:= \sum_{i} d_{i}$ be the sum of saturated deviances, and $D^{(0)}:= \sum_{i}d_{i}^{(0)}$ be the sum of [[Deviance|null deviances]], then the GOF measure is $R^{2}_{D}:= 1-\frac{D}{D^{(0)}}.$
> This is analogous to [[Pearson's R2]], where $D^{(0)}$ is to the $\mathrm{TSS}$, and $D^{(0)}-D$ the $\mathrm{RSS}$ of an OLS model.
Standardized residuals analogous to the [[Inference in OLS#Residual Diagnostics|OLS ones]] can be defined usi[](Deviance.md) by $\mathbf{H}:= \tilde{\mathbf{X}}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^{T}=\mathbf{W}^{1 / 2}\mathbf{X}(\mathbf{X}^{T}\mathbf{WX})^{-1}\mathbf{X}^{T}\mathbf{W}^{1 / 2},$with $\tilde{\mathbf{X}}=\mathbf{W}^{1 / 2}\mathbf{X}$ being the reweighted data. The Gaussian variance estimates $s^{2}$ can be replaced with the dispersion $\phi$ (which equals $\sigma^{2}$ in Gaussian GLM) or $\hat{\phi}$.
> [!definition|*] Standardized GLM Residuals
> Normalizing the Pearson residuals give the **standardized Pearson residuals $\mathbf{r}'_{P}$** given by $r'_{P_{i}}:= \frac{r_{P_{i}}}{\sqrt{\phi(1-h_{ii}) }}.$
>
> Normalizing the deviance residuals give the **standardized Pearson residuals $\mathbf{r}'_{D}$** given by $r'_{D_{i}}:= \frac{r_{D_{i}}}{\sqrt{1-h_{ii} }};$Note that the variance $\phi$ is already included in the deviance (i.e. log likelihood), so no need to divide by that.
> [!definition|*] Cook's Distance in GLM
> Replacing $\hat{y}$ with $\hat{\mu}$ in OLS Cook's distance gives the **GLM Cook's distance**: $\mathrm{cd}_{i}:= \frac{\| \mathbf{W}^{1 / 2}\mathbf{X}(\hat{\beta}-\hat{\beta}_{-i}) \|^{2} }{\phi}$where again $\hat{\beta}_{-i}$ are the coefficients fitted with all-but-the-$i$th data point.
Similar rule-of-thumbs from OLS also apply:
- The standardized residuals are asymptotically $N(0,1)$, so about $95\%$ of them should be within $\pm 2$ if the model is adequate.
## Hypothesis Testing in GLM
Suppose the fitted coefficients are the MLE $\hat{\beta}$, and we wish to test for the significance of a few of them.
The most rudimentary is to test if all of them are $0$ (usually except the intercept), i.e. that the null model is true: in this case we can resort to the likelihood ratio (aka [[Deviance|null deviance]]):
> [!definition|*] Null-Deviance Test in GLM
> Asymptotically, if $H_{0}:\beta_{2}=\dots=\beta_{p}=0$ holds ($\beta_{1}$ assumed to be the intercept), the likelihood ratio statistic $\Lambda :=2(\mathrm{\log lik}(\hat{\beta})-\mathrm{\log lik}(\hat{\beta}^{(0)})) \overset{D}{\approx} \chi^{2}_{p-1}.$
For testing a single coefficient, we can resort to the asymptotic normality of the MLE, and use
> [!definition|*] Asymptotic Normality for Testing in GLM
> Using the asymptotic normality of the MLE $\hat{\beta}$, we have $\hat{\beta}_{j}\overset{D}{\approx}N(\beta_{j}, I_{jj}^{-1}),$where $I$ is the [[Information and Bounding Errors#Fisher and Observed Information|Fisher information matrix]] of $\hat{\beta}$, and for the IRLS algorithm, it equals $(X^{T}WX)$ in the weighted least squares problem.
>
> Therefore, under $H_{0}:\beta_{j}=0$, we have the asymptotic test statistic $z:= \frac{\hat{\beta}_{j}}{\sqrt{ I_{jj} }} \overset{D}{\approx} N(0,1).$
However, the **Wald statistic** is defined for testing multiple coefficients at once:
> [!definition|*] Wald Statistic of GLM
> For testing the significance of a subset of the estimated coefficients $\hat{\beta}$, WLOG being $\hat{\beta}_{1:k}$, let $\hat{\Sigma}_{1 : k}^{-1}$ be their estimated covariance matrix (e.g. by subsetting the full $\hat{\Sigma}^{-1}\approx I_{\beta}(X)= X^{T}WX$ from the IRLS algorithm).
>
> The **Wald statistic** is then given by $z^{2}:= \hat{\beta}_{1:k}^{T}\hat{\Sigma}_{1: k}^{-1} \hat{\beta}_{1: k}.$
> Under $H_{0}:\beta_{1: k}=0$, $z^{2}$ is asymptotically $\chi^{2}_{k}$.
- Therefore, with $k=1$, the above asymptotic normal test is a special case of the Wald statistic.
## GLM vs GAM
[[Generalized Additive Models]] (GAMs) have the form $g[\mu_{Y}(X)]=\alpha +\sum f_{i}(X_{i})$which can use non-linear functions $f(X_{i})$ to estimate $g(\mu)$. GLMs only use the original predictors.
## Resources
MIT OCW

[Cross Validated Post](https://stats.stackexchange.com/questions/40876/what-is-the-difference-between-a-link-function-and-a-canonical-link-function)