Iteratively Reweighted Least Squares - Random Notes Go Brrrrrrr

> [!tldr] > **Iteratively Reweighted Least Squares (IRLS)** is an iterative algorithm that solves the MLE of a [[Generalized Linear Models|GLM]]. > > It iteratively solves a series of weighted LR problem, which yield coefficients that converge to the MLE. > > It is equivalent to applying Newton-Raphson method to maximize the log-likelihood. Consider the GLM problem $Y \sim X$ and $\mu:= \mathbb{E}[Y]$ modeled with $g(\mu)=\beta^{T}X=: \eta$where the responses $Y=(Y_{1},\dots,Y_{n})$ are from the same exponential dispersion family. Applying **Newton-Raphson** to solve the MLE gives the iterations $\beta ^{(k+1)}=\beta ^{(k)}-J^{-1}\frac{ \partial l }{ \partial \beta }\Big|_{\beta=\beta ^{(k)}} $and the **Fisher scoring algorithm** replaces the observed information $J$ with the expected information $I$. They are equal when $g$ is the [[Link Function#Canonical Link Function|canonical link]]. ```mermaid flowchart LR Y[Y] --> Err[Error] MU[MU] --> Err Err --> |Nudge| Z[Z] subgraph GLM["GLM"] X[X] --> ETA BETA[BETA] --> ETA ETA --> |Link| MU end Z ---> |refit| GLM ``` > [!algorithm] The IRLS Algorithm > $[0]$ Initialization. > - Set $\mu_{i}^{(0)}=\bar{y}$ or $y_{i}$ for all $i$. > - Compute the linear estimators $\eta^{(0)}=g(\mu^{(0)})$ term-wise. > - Compute errors $e_{i}^{(0)}=y_{i}-\mu_{i}^{(0)}=y_{i}-\bar{y}$. > - Nudge the linear predictor along the graident $(g^{-1})'$ to get $\begin{align*} z_{i}^{(0)}&= \eta_{i}^{(0)}+e_{i}^{(0)} / (g^{-1})'(\eta^{(0)}_{i})\\ &= \eta_{i}^{(0)}+e_{i}^{(0)}\cdot g'(\mu_{i}^{(0)}) \end{align*}$ > - Compute the weights $W^{(0)}_{ii}=\frac{1}{\phi V(\mu) g'(\mu)^{2}}\Bigg|_{\mu=\mu^{(0)}_{i}}.$ > > $[k=0,\dots]$ Iterate until convergence in $\beta^{(k)}$: > - Solve the weighted LS problem $z^{(k)} \sim X$ with weights $W_{ii}^{(k)}$, giving coefficients $\beta^{(k+1)}=(X^{T}W^{(k)}X)^{-1}X^{T}W^{(k)}z^{(k)}.$ > - Update: > - The linear predictor $\eta ^{(k+1)}=(\beta ^{(k+1)})^{T}X$. > - The predicted means $\mu_{i} ^{(k+1)}=g^{-1}(\eta_{i} ^{(k+1)})$. > - The errors $e_{i}^{(k+1)}=y_{i}-\mu_{i}^{(k+1)}$. > - Nudge the linear predictors again: $z_{i}^{(k+1)}= \eta_{i}^{(k+1)}+e_{i}^{(k+1)}\cdot g'(\mu_{i}^{(k+1)})$ > - New weights $W^{(k+1)}_{ii}=\frac{1}{\phi V(\mu) g'(\mu)^{2}}\Bigg|_{\mu=\mu^{(k+1)}}$ > [!exposition] IRLS with Canonical Links > The expression of IRLS components are significantly simpler if $g$ is chosen to be the canonical link: > > Canonical links have ${g'(\mu)} = 1 / V(\mu)$, so $W_{ii}=V(\mu) / \phi$, and $W^{(k+1)}=\frac{1}{\phi}\mathrm{diag}(V(\mu ^{(k+1)}_{1}),\dots,V(\mu ^{(k+1)}_{n}))$ ### Geometry of IRLS Consider IRLS as a projection of $\mathbf{y} \in \mathbb{R}^{n}$ onto a constrained set of $\pmb{\mu} \in \mathbb{R}^{n}$ called the **model manifold**, where in the case of a GLM is given by $\{ \pmb{\mu}=g^{-1}(\pmb{\eta}) ~|~ \pmb{\eta}=\mathbf{X}\pmb{\beta} \},$that is, the set obtained by applying $g^{-1}$ to all possible linear combinations of the features $\mathbf{X}$. In general, *the model manifold is a $p$-dimensional manifold.* The **equivalent fit lines** in $\mathbb{R}^{n}$ are given by sets of $\mathbf{y}$ that lead to the same MLE $\hat{\pmb{\beta}}$. They are lines because the log-likelihood of a EDF looks like $\mathrm{log lik}=\sum_{i}\frac{y_{i}\theta_{i}-\kappa(\theta_{i})}{\phi}+\mathrm{const.},$so when setting the derivative wrt. $\pmb{\theta}$ to be $0$, scaling $\mathbf{y}$ by a scalar does not affect the MLE of $\pmb{\theta}$, and by extension that of $\pmb{\beta}$. The following is a GLM problem with $p=2$, $n=3$ (for low-dimensional visualizations): ![[GLMManifoldEquivalentFits.png#invert|center]] > [!help]- > - The left panel plots the dataset. > - The right panel is in $\mathbb{R}^{3}$, i.e. space of all possible $\mathbf{y}$. The surface is the $2$-manifold of the GLM, and the lines are equivalent fit lines. The solid dot is the observed values of $\mathbf{y}$. Therefore, *our sole goal is to find the MLE by simply projecting $\mathbf{y}$ along its equivalent fit line onto the model manifold*. However, this is not directly feasible: - We do not know the equivalent fit line, so the direction is unsure. - The direction of orthogonal projection is known (i.e. by solving an OLS onto a tangent plane), but it does not give the correct $\hat{\beta}$ since *the equivalent fit lines are not orthogonal to the model manifold in general*. IRLS essentially *transforms the data so that the equivalent fit lines are approximately orthogonal to the model manifold*, and so orthogonal projection would give the decent estimates. - However, the orthogonality only holds for the equivalent fit line crossing the current estimates $\pmb{\mu}^{(k)}$ (not the desired $\mathbf{y}$), so the "correct estimates" need to be iterated several times to converge. - Nevertheless, $\pmb{\mu}^{(k)}$ will (hopefully) converge to $\mathbf{y}$, so the IRLS indeed converges toward the correct projection. - Therefore, the orthogonal projection essentially projects $\mathbf{y}$ onto the **tangent space** (the hyperplane tangent to the model manifold) at $\pmb{\mu}^{(k)}$. - Note that this tangent space is spanned by the columns of $\nabla_{\pmb{\beta}}\pmb{\mu}=\left( \frac{ \partial \mu_{i} }{ \partial \beta_{j} } \right) \in \mathbb{R}^{n \times p}$. Translated into geometric language, IRLS performs the following. The steps (a) ~ (d) correspond to the plot below (for a GLM with $n=2$, $p=1$). > [!algorithm] IRLS as Geometric Transformations > $[0]$ Initialize. > > For $[k=0,\dots]$ until convergence: > - (a) Start in the original space of $\mathbf{y}$ (●) and $\pmb{\mu}^{(k)}$ (■). > > - (b) *Center* the dataset at $\pmb{\mu}^{(k)}$, transforming ● into $\mathbf{e}^{(k)}$, so that the subsequent rotation/scaling make sense. > > - (c) *Scale* the errors with a factor $g'(\pmb{\mu}^{(k)})=(g'(\mu_{1}^{(k)}),\dots,g'(\mu_{n}^{(k)}))$, so that the scaled tangent space is in the column space of $\mathbf{X}$, and the projection would actually equal $\mathbf{X}\hat{\beta}^{(k+1)}$ for some $\hat{\beta}^{(k+1)}$. At this point, ● has been transformed into $\mathbf{z}^{(k)}$. > > - (d) *Reorient* the entire dataset (both the transformed tangent space and the dataset $\mathbf{X}$) with weights $\mathbf{W}$, so that the equivalent fit line going through (what used to be) $\pmb{\mu}^{(k)}$ is orthogonal to the (scaled and reoriented) tangential space and model manifold. > > - (e) *Project* ● onto the tangent space -- if ● and ■ are close (usually they are), the equivalent fit line going through ● will also be approximately orthogonal to the model manifold. > > - (f) *Estimate* $\hat{\beta}^{(k+1)}$ corresponding to the projection, which is essentially solving an OLS in the transformed space, or a weighted OLS in the space obtained after (c). > > - (g) *Repeat* with $\pmb{\mu}^{(k+1)}:=\mathbf{X}\hat{\beta}^{(k+1)}$ as ■. ![[IRLSSteps.png#invert]] ### Variance of IRLS Estimates The term *$X^{T}WX$ found in the weighted regression step of IRLS is in fact the Fisher information matrix of the coefficients $\beta$*, assuming that the response $Y$ follows an [[Exponential Dispersion Family|EDF]]. Hence as a standard result of MLEs, the variances of the parameter estimates $\hat{\beta}$ is $\mathrm{Cov}(\hat{\beta})=(X^{T}WX)^{-1}.$Asymptotically, the normality of MLE gives $\hat{\beta}\approx N(\beta, I^{-1})\approx N(\beta, (X^{T}WX)^{-1}),$where $I^{-1}$ is replaced with its estimate $(X^{T}WX)^{-1}$ ($W$ is evaluated at the MLE $\hat{\beta}$). This allows the approximate $(1-\alpha)$ confidence interval of $\beta_{j}$: $\left( \hat{\beta_{j}}\pm z_{\alpha / 2}\widehat{\mathrm{Var}}(\hat{\beta}_{j}) \right)= \left( \hat{\beta}_{j} \pm z_{\alpha / 2} \cdot (X^{T}WX)_{jj} \right).$ ### Estimating the EDF Dispersion Parameter $\phi$ For an EDF, *the IRLS algorithm itself does not require knowledge of $\phi$* (the dispersion parameter), which cancels out in all steps. Its value can be estimated from the model as $\hat{\phi}=\frac{1}{n-p}\sum_{i=1}^{n} \frac{(y_{i}-\hat{\mu}_{i})^{2}}{V(\hat{\mu}_{i})}.$ - The matrix $W$ contains $\phi$, so $\hat{\phi}$ is still needed for estimating the parameter variances.