> [!info] This material corresponds to chapter 3 in the ESL.
> The main focus is on fitting the models as machine learning methods.
>
> Refer to [[Inference in OLS|this note]] for the stochastic aspect of OLS (inferences and statistical tests).
>
> Refer to [[Orthogonal Projection, Confounding, and Missing Variable in OLS|this note]] for a discussion about misspecification of OLS and it as a projective linear algebra technique.
Consider the model $Y=g(X)+\epsilon$ where $g$ is a general function in $\mathbb{R}^{p}$, and $\epsilon$ is some additive error. A **linear model** of predictors $\mathbf{x}^{T}=(x_{1},x_{2},\dots,x_{p})$ has the form $f_{\beta,\beta_{0}}(\mathbf{x})=\beta_{0} + \beta^{T}\mathbf{x} =\beta_{0}+\sum_{j=1}^{p}\beta_{i}x_{i},$where $\beta \in \mathbb{R}^{p},\beta_{0}\in \mathbb{R}$ are the model's parameters. For a particular sample $j$ with features $\mathbf{x}_{j}^{T}=(x_{1j},x_{2j},\dots,x_{pj})$, the model's fitted value is $\hat{y}_{i}=f_{\beta}(\mathbf{x}_{i}).$*From now on, we assume one of the entries in $\mathbf{x}$ is the constant $1$* (WLOG the zeroth entry), so that $\beta_{0}$ is the intercept term, and we do not have to set aside an intercept $\beta_{0}$ -- this simplifies the model to $f_{\beta}(\mathbf{x})=\beta^{T}\mathbf{x}.$
Now the natural next step is to find the "best" $\beta$ -- but in what sense, and how do we solve for it?
## Least Squares Linear Regression
**Least squares regression** (or **ordinary least squares, OLS**) is a method of determining $\beta$ by minimizing the **residual sum of squares (RSS)** given by $\mathrm{RSS}= \sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}=\| \mathbf{y}-\mathbf{X}\beta \|_{2} $and can be maximised by setting its derivative wrt. $\beta=(\beta_{0},\dots,\beta_{p})$ to be $0$: $\nabla_{\!\beta}~\mathrm{RSS}= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\equiv \mathbf{0},$which gives the least squares coefficients $\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$assuming invertibility of $\mathbf{X}^{T}\mathbf{X}$, i.e. that the predictors are all linearly independent.
The predicted values are then $\hat{\mathbf{y}}=\mathbf{X}\hat{\beta}=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}\mathbf{y},$and the matrix $\mathbf{H}:= \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}$ is called the **hat matrix**.
Directly computing the closed forms above is very [[Numerical Stability|unstable]] and expensive (due to inverting a matrix). Instead, we can use the [[QR Factorisation#Solving Linear Systems|QR factorisation to solve least squares problems]]:
If $A=\hat{Q}\hat{R}$ is the thin QR, we have $\begin{align*}
\hat{R}\hat{\beta}&= \hat{Q}\mathbf{y},\\
\hat{\mathbf{y}}&= \mathbf{X}\hat{\beta},
\end{align*}$where $\hat{\beta}$ can be found in a backward stable manner efficiently with back substitution, since $\hat{R}$ is upper-triangular.
## Penalty Shrinkage Methods
When there are a large number of predictors, OLS can overfit the data by learning spurious patterns that are just noise. As the variance of the prediction $\hat{y}$ is quadratic in the coefficients $\hat{\beta}$, adding useless coefficients will make predictions unstable.
To mitigate this, shrinkage methods reduces the coefficients of variables -- ideally those of the irrelevant ones. Most common shrinkage is done with a penalty term against large coefficients, with the form
$\hat{\beta}_{\lambda}=\min_{\beta} ~\text{risk}(\beta) + \lambda\cdot \text{penalty}(\beta),~~~~({\ast})$
where $\text{risk}$ can be just the $\mathrm{RSS}=\| \mathbf{y}-\mathbf{X}\beta \|^{2}$ like in OLS.
They are called **shrinkage** methods because the penalty causes the optimal $\hat{\beta}_{\lambda}$ to be shrunk towards $\mathbf{0}$, compared to the unpenalised OLS coefficient $\hat{\beta}_{\mathrm{OLS}}$. Naturally, the larger $\lambda$ is, the harder the shrinkage.
> [!implementations] Shrinkage methods in R
> `glmnet::glmnet(x, y, alpha, lambda)`
> - Specifying `alpha=0` runs the ridge regression, `alpha=1` runs the lasso, and anything in between for a mix of the two.
> - The best lambda: `cv.glmnet(x,y,alpha)$lambda.min`
> - Prediction: `predict(model, s=lambda, new_predictors)`
> - Predict with automatically selected lambda: `predict(model, s="lambda.min", new_predictors)`
### Shrinkage as Budget
As characterised [[Equivalence of Budgets and Tradeoff in Optimisation]], the tradeoffs in the objective $({\ast})$ are equivalent to optimising $\mathrm{risk}$ under budget constraints $\mathrm{penalty}(\beta)\le t$ (where $t$ is a function of $\lambda$).
- Technically, the equivalence is with the dual problem of the constrained problem, but here we assume the risk and penalty functions are convex, so the problem is convex, and the dual and primal have the same optimisers.
Consider, for example $\mathrm{penalty}(\beta)=\| \beta \|_{i}$ for $i=1,2$ (corresponding to Ridge regression and the LASSO, discussed later). The set of $\beta$ that are within the budget limits are then spheres under $\| \cdot \|_{i}$ (i.e. a diamond for $i=1$, and an actually round ball for $i=2$).
Solving the regression coefficients is then a convex optimisation problem minimising the risk (e.g. $\mathrm{RSS}$) within the constraints:
![[LassoRidgeOptimization.png#invert|center|w80]]
In the diagram, we see that the risk's level sets often first touch the corner of the $l_{1}$ diamonds, where some of the coefficients (in this case $\beta_{1}$) exactly $0$. This is not the case for (hyper)spheres, which yields tiny but non-$0$ coefficients.
## Ridge Regression
Ridge regression encourages smaller coefficients with a [[Ridge Penalty|$l_{2}$ penalty term]]. It shrinks the coefficients to generally non-0 values.
> [!definition|*] Ridge Regression
> Instead of minimizing the $\mathrm{RSS}$ like OLS, **ridge regression** finds coefficients $\hat{\beta}^{\text{\,ridge}} =(\hat{\beta}_{0},\dots,\hat{\beta}_{p})$ that minimize the objective $J_{\mathrm{ridge}}(\hat{\beta})=\mathrm{RSS}(\hat{\beta})+\underbrace{\lambda\sum_{i=1}^p \hat{\beta}_{i}^{2}}_{l_{2}\text{ penalty}}$where $\lambda$ is the **tuning parameter**, and a larger $\lambda$ encourages smaller coefficients.
>
> > [!remarks]
> > - To account for different scales, the predictors are assumed to be standardized.
> > - In the penalty term, the index starts at $1$ to avoid penalizing the intercept.
> [!exposition] Solving ridge regression coefficients
> In matrix form, the ridge regression objective is $J_{\mathrm{ridge}}=(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)-\lambda \beta^{T}\beta,$which is minimized by setting$\hat{\beta}^{\text{\,ridge}}=(\mathbf{X}^{T}\mathbf{X}+\lambda I)^{-1}\mathbf{X}^{T}\mathbf{y}.$
Ridge regression can also arise from a #bayesian model, following the interpretation of the ridge penalty as a prior:
- Let the coefficients have prior $\beta \sim N\left( \mathbf{0}, \frac{\sigma^{2}}{\lambda}I \right)$, the errors $\epsilon \sim N(\mathbf{0}, \sigma^{2}I)$, and a generation process of $Y=X\beta+\epsilon$.
- This gives a posterior MLE/mean of $\mathbb{E}[\beta \,|\, \mathbf{X}]=(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}\mathbf{X}^{T}\mathbf{y},$the same as the ridge estimator $\hat{\beta}^\mathrm{\,ridge}$.
### Mathematical Properties
The shrinking effect of ridge regression takes place in the [[Principal Component Analysis|principal component]] of $\mathbf{X}$:
- If the [[Singular Value Decomposition|SVD]] of $\mathbf{X}$ is $\mathbf{X}=\mathbf{UDV}^{T}$, then the principal components are the columns of $\mathbf{V}$, and the projections of $\mathbf{X}$ onto them are the columns of $\mathbf{UD}$.
- Substituting into the fitted values of ridge regression, $\begin{align*}
\mathbf{\hat{y}}^{\text{ ridge}}&= \mathbf{X}\hat{\beta}^{\text{\,ridge}}\\
&= \mathbf{X}(\mathbf{X}^{T}\mathbf{X}+\lambda I)^{-1}\mathbf{X}^{T}\mathbf{y}\\
&= \mathbf{UDV}^{T}(\mathbf{VD}^{2}\mathbf{V}^{T}+\lambda I)^{-1}\mathbf{VDU}^{T}\mathbf{y}\\
&= \mathbf{UD}(\mathbf{D}^{2}+\lambda I)^{-1}\mathbf{DU}^{T}\mathbf{y}\\
&= \sum_{j=1}^{p} \Big(\underbrace{ \frac{d_{j}^{2}}{d_{j}^{2}+\lambda}}_{\substack{\text{shrinking}\\\text{factor}}}\cdot\underbrace{\vphantom{ \frac{d_{j}^{2}}{d_{j}^{2}+\lambda}}({u}_{j}^{T}\mathbf{y})}_{\substack{\text{projection} \\ \text{onto }u_{j}}}\Big) {u}_{j}
\end{align*}$so the component of $\mathbf{y}$ in the direction of ${u_{j}}$ is shrunk by a factor of $\frac{d_{j}^{2}}{d_{j}^{2}+\lambda}$ -- *that is, the smaller the PCA score, the harder the shrinking*.
The [[Degree of Freedom#Special Form for Linear Estimators|effective degrees of freedom]] measures the effective number of parameters under the current strength of shrinking:
![[Degree of Freedom#^918055]]
This coincides with the sum of the shrinking factors.
> [!theorem|*] Duplicate Features Share Ridge Coefficients
> Fix $\lambda$, design matrix $\mathbf{X}$ and response $\mathbf{y}$.
> - Write $\mathbf{X}=(\mathbf{X}^{\ast},\mathbf{x}_{p})$, and make $k$ copies of the feature $\mathbf{x}_{p}$, giving the matrix $\mathbf{\tilde{X}}=(\mathbf{X}^{\ast},\mathbf{X}_{p})$, where $\mathbf{X}_{p}=(\mathbf{x}_{p},\dots,\mathbf{x}_{p})$ has $k$ columns.
> - Fit the ridge regression $\mathbf{y} \sim \mathbf{\tilde{X}}$, giving coefficients $\tilde{\beta}=(\tilde{\beta}^{\ast},\tilde{\beta}_{p},\dots,\tilde{\beta}_{p+k-1})$. Here $\tilde{\beta}^{\ast}$ are the coefficients of $\mathbf{X}^{\ast}$, and $\tilde{\beta}_{p},\dots,\tilde{\beta}_{p+k-1}$ those of the duplicate columns.
>
> Then $\tilde{\beta}_{p}=\cdots \tilde{\beta}_{p+k-1}.$Furthermore, their sum $\gamma$ (i.e. the total contribution of the duplicated column) *solves the original ridge regression problem $\mathbf{y} \sim \mathbf{X}$ but with a weighted penalty*: $\underset{\beta=(\beta ^{\ast},\gamma)}{\arg\min}~\left\{ \| \mathbf{y}-\mathbf{X}\beta \| ^{2} +\lambda \| \beta ^{\ast} \|^{2}+\frac{\lambda}{k}\gamma^{2} \right\}.$Here the duplicated column's effect $\gamma$ is penalized less, by a factor of $\sqrt{ k }$ (on the coefficient scale).
>
> > [!proof]- Proof that $\tilde{\beta}_{p}=\dots =\tilde{\beta}_{p+k}$
> > Consider the ridge objective in regressing $\mathbf{y} \sim \mathbf{\tilde{X}}$: $\begin{align*}
> > J(\beta;\mathbf{\tilde{X}}, \mathbf{y},\lambda)&:= \| \mathbf{y}-\mathbf{\tilde{X}}\beta \|^{2} + \lambda \| \beta \| ^{2}\\
> > &\hphantom{:}= \underbrace{\vphantom{\sum_{i=p}^{p+k-1}}\| \mathbf{y}-\mathbf{X}^{\ast}\beta ^{\ast}-(\beta_{p} + \cdots \beta_{p+k-1})\mathbf{x}_{p} \|^{2} }_{\mathrm{RSS}} +\underbrace{ \lambda \| \beta ^{\ast} \| ^{2}+\lambda \sum_{i=p}^{p+k-1}\beta_{i}^{2}}_{\text{penalty}},
> > \end{align*}$where $\beta ^{\ast}\in \mathbb{R}^{p-1}$ are the coefficients of $\mathbf{X}^{\ast}$ and $\beta=(\beta ^{\ast},\beta_{p},\dots,\beta_{p+k-1})$.
> >
> > The problem is symmetric in $\beta_{p},\dots,\beta_{p+k-1}$, and given any optimizer $\tilde{\beta}$, permuting their values gives another optimizer.
> >
> > This would contradict the strict convexity of the problem, so $\tilde{\beta}$ must be invariant under such permutation, i.e. $\tilde{\beta}_{p}=\dots=\tilde{\beta}_{p+k-1}$.
>
> > [!proof]- Proof of equivalence with weighted ridge
> > Substitute $\gamma:= \sum_{j=p}^{p+k-1}\beta_{j}$ to get the new objective$\begin{align*} J'(\beta ^{\ast},\gamma~;\mathbf{X},\mathbf{y},\lambda)&= \left\| \mathbf{y}-\mathbf{X}\begin{pmatrix} \beta ^{\ast} \\ \gamma\end{pmatrix} \right\| ^{2} +\lambda \| \beta ^{\ast} \|^{2}+ \lambda k\left( \frac{\gamma}{k} \right)^{2}\\ &= \left\| \mathbf{y}-\mathbf{X}\begin{pmatrix} \beta ^{\ast} \\ \gamma\end{pmatrix} \right\| ^{2}+\lambda \left\|\begin{pmatrix} \beta ^{\ast} \\ \gamma / \sqrt{ k }\end{pmatrix} \right\| ^{2}.
> \end{align*}$
> > Therefore, the problem is equivalent to an ordinary ridge regression without duplicates, but *with a weighted penalty that penalizes the duplicated column less, by a factor of $\sqrt{ k }$.*
The ridge is equivalent to fitting OLS on an augmented dataset that includes $q$ columns of independent noises $\pmb{\epsilon}_{j}\overset{\mathrm{iid.}}{\sim}N\left( 0, \frac{\lambda}{q} \right), ~~ j=1,\dots,q,$and if the fitted coefficients of the non-noise features are $\hat{\beta}_{q}$, then $\hat{\beta}_{q} \xrightarrow{\mathrm{a.s.}}\hat{\beta}_{\lambda},$where $\hat{\beta}_{\lambda}$ is the ridge-regularised coefficients. ^cd8b6d
- Source: [this Cross Validated post](https://stats.stackexchange.com/questions/328630/is-ridge-regression-useless-in-high-dimensions-n-ll-p-how-can-ols-fail-to?ref=quantymacro.com), quoted in [quantymacro's article](https://www.quantymacro.com/when-a-ridge-regression-maxi-meets-random-forest/).
- Therefore, in OLS involving a lot of noisy features, the non-noise features get penalised in some sense, and sometimes a [negative ridge penalty](http://www.tandfonline.com/doi/abs/10.1080/03610928308828440) can help.
- Conversely, adding random noise as features can act as some form of regularisation.
## The Lasso
> [!definition|*] The lasso
> Similar to ridge regression, **the lasso** also penalizes large coefficients, but with their $l_{1}$ norm: it finds ${\beta}$ that minimizes $\text{lasso error}=\frac{1}{2}\text{RSS}+\underbrace{\lambda\sum_{i=1}^p|\beta_{i}|}_{l_{1}\text{ penalty}}.$
Because of the $l_{1}$ error, the lasso's coefficients have no closed form; they can still be computed with the same complexity as ridge regression coefficients.
As noted in [[#Shrinkage as Budget]], although a shrinkage method, *the lasso also performs variable selection* by shrinking some coefficients to exactly $0$.
Taking the derivative of the lasso penalty with respect to $\beta_{j}$ gives the **subgradient condition** $\frac{1}{n} \left< \mathbf{x}_{j},\mathbf{y}-\mathbf{X}\beta \right>=\lambda \cdot\mathrm{sign}(\beta_{j}),$assuming $\beta_{j}$ is in the set of active predictors indexed by $\mathcal{A}$.
- Assuming that the data has been centered, every active predictor $\mathbf{x}_{j}$ has covariance $\propto\pm \lambda$ with the residual.
In matrix form, this is $\mathbf{X}_{\mathcal{A}}^{T}(\mathbf{y}-\mathbf{X}_{\mathcal{A}}\beta_{\mathcal{A}})=n\lambda \cdot\mathrm{sign}(\beta_{\mathcal{A}}).$Solving for $\beta_{\mathcal{A}}$ gives $\beta_{\mathcal{A}}=\mathbf{X}_{\mathcal{A}}^{T}\mathbf{y} - n\lambda \cdot (\mathbf{X}^{T}_{\mathcal{A}}\mathbf{X}_{\mathcal{A}})^{-1}\mathrm{sign}(\beta_{\mathcal{A}}), $so as long as $\mathcal{A}$ is constant (i.e. no predictor becomes inactive/active) and $\mathrm{sign}(\beta)$ does not change, the coefficients $\beta$ are linear in $\lambda$. *In other words, each coefficient $\beta_{j}$ has a piece-wise linear trajectory when plotted against $\lambda$.*
For fitting **generalized lasso models** (i.e. [[Generalized Linear Models|GLMs]] with a $l_{1}$ penalty), we can also use the lasso loss given by $\text{lasso penalty}=\text{(some suitable loss)}+\lambda \| \hat{\pmb{\beta}} \| _{1},$e.g. for logistic regression, we can use$\underbrace{-\frac{1}{n}\sum_{i}\Big[y_{i}\log \hat{\mu}_{i}+(1-y_{i})\log (1-\hat{\mu}_{i})\Big]}_{\text{entropy}}+\lambda \| \hat{\pmb{\beta}} \|_{1}.$However, an issue with this is that in general, *the loss term is not linear in $\pmb{\beta}$ -- in this case $\log\hat{\mu}_{i}$ depends on $\hat{\pmb{\beta}}$, so the path is no longer piecewise linear*.
- We can no longer explicitly solve for the exact point at which a new variable joins the active set.
- Instead, we need to actually nudge the coefficients along its path, which is more costly.
### Least Angle Regression
**Least Angle Regression (LAR)** is a step-wise regression method that gradually increase the coefficients $\hat{\beta}$ of predictors.
- At each step, LAR has a set of active predictors whose coefficients are being modified; variables are gradually added to the set.
- In particular, the active predictors all have the same correlation with the current residual; when an inactive variable ties the active ones in correlation, it is added to the active set.
- In spirit, LAR coefficients growing in time are similar to lasso coefficients as a function of decreasing $\lambda$.
> [!algorithm] Least Angle Regression
> $[0]$ Begin with the empty model $\hat{\beta}=(0,\dots,0)$, and the active set $A_{0}=\emptyset$. Normalize and center the predictors. Initialize the residuals to be $\mathbf{r}=\mathbf{y}$.
>
> $[1]$ Activate the predictor with the strongest correlation to $\mathbf{r}$ (say $\mathbf{x}_{1}$), giving $A_{1}=\{ \mathbf{x}_{1} \}$.
>
> For $[k=2,...]$:
> - Move the coefficients $\hat{\beta}_{A_{k}}$ in the direction of $\tilde{\beta}_{A_{k}}$, the OLS coefficients of $\mathbf{r} \sim A_{k}$.
> - Recompute the residual $\mathbf{r}$; its correlation with all predictors in $A_{k}$ should have decreased, while still being equal across $\mathbf{x}_{j} \in A_{k}$.
> - Keep moving until a new variable ties with $A_{k}$ in correlation with $\mathbf{r}$. Append it to $A_{k}$ and go to $[k+1]$.
>
> $[\mathrm{END}]$ The algorithm terminates after $\min(p,N-1)$ steps, when the model becomes OLS, or when a perfect fit is achieved (if $N \ge p$).
> - Record the sequence of coefficients $\hat{\beta}_{A_{k}}$ at break points when a new variable is added, as well as the (shared) correlation $\lambda_{k}$ at that point.
> - Return the sequence $\{ \hat{\beta}_{A_{k}},\lambda_{k} \}_{k=1,\dots}$.
LAR can be performed efficiently to compute the entire path of fitted coefficients:
- Since the correlations $\left< \mathbf{r}, \mathbf{x}_{j} \right>$ are inner products (hence linear), *we can compute the exact moment when a new variable catches up with the active set $A_{k}$*; there is no need for incremental nudges in the for loop.
- With a modification, the LAR is an efficient algorithm for computing the entire lasso path as $\lambda \to \infty$: in particular, *if a predictor is deactivated from $A_{j}$ when its coefficient becomes $0$, then LAR returns the lasso path*.
### Extensions to the Lasso
Coefficient estimated by the LASSO is biased to $0$ and is not consistent, as the LASSO coefficients do not converge to the true ones when $N \to \infty$, even if the data generating model is linear.
Some alternatives use the LASSO for predictor selection only: suppose [[Cross Validation|CV]] selects the lasso model with $\lambda=\lambda_{0}$ and active predictors $A_{0}$,
- One can fit an OLS of $\mathbf{y} \sim\mathbf{X}_{A_{0}}$ to obtain coefficients $\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$.
- **Relaxed lasso** \[[[BestSubsetForwardStepwiseLasso.pdf|paper]]\] fits another lasso model of $\mathbf{y} \sim \mathbf{X}_{A_{0}}$, whose new $\lambda=\lambda_{1}$ is selected by CV.
- Alternatively, one can linearly interpolate between $\hat{\beta}_{A_{0}}^{~\mathrm{LASSO}}$ and $\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$ with some hyperparameter $\mu$ to get $\hat{\beta}_{A_{0}}^{~\mathrm{relaxed}}=\mu\hat{\beta}_{A_{0}}^{~\mathrm{LASSO}} + (1-\mu)\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$.
- Any of these techniques *separate the variable selection* (controlled by $\lambda_{0}$ and the first LASSO fit) *from the shrinkage* (controlled by the second fit).
Reducing the penalty on larger coefficients also help:
- Smoothly clipped absolute deviation (SCAD) penalty $J_{a}(\beta,\lambda)$ is an alternative, where $a$ is a parameter controlling the range of $\beta$ for which the penalty is capped. It is not convex, so computation is a lot harder.
- The adaptive lasso uses a weighted penalty $\sum_{j}w_{j}|\beta_{j}|$, using the weight $w_{j}=|\hat{\beta}_{j}|^{-\nu}$, where $\hat{\beta}_{j}$ is OLS coefficient, and $\nu$ is a tuning parameter. The penalty is convex and yields consistent estimates.
The $l_{1}$ penalty can be generalised:
- Instead of computing $l_{1}$ norm of the entire vector of coefficients $\beta$, grouping the coefficients into sub-vectors $\theta_{k}$ gives $\beta=(\theta_{1},\dots,\theta_{q})$ gives the $l_{1}$ "norm" $\| \beta \|_{1}=\sum_{k} \| \theta_{k} \|_{2}, $giving results sparse in $\theta$, i.e. selection of groups of variables.
- In models like graphs and matrices, particular $l_{1}$-like penalties (e.g. sum of singular values) can give similar sparse results.
- $l_{1}$ penalties in sparse PCA give loadings that are sparse in the original variables.