> [!info] This material corresponds to chapter 3 in the ESL. > The main focus is on fitting the models as machine learning methods. > > Refer to [[Inference in OLS|this note]] for the stochastic aspect of OLS (inferences and statistical tests). > > Refer to [[Orthogonal Projection, Confounding, and Missing Variable in OLS|this note]] for a discussion about misspecification of OLS and it as a projective linear algebra technique. Consider the model $Y=g(X)+\epsilon$ where $g$ is a general function in $\mathbb{R}^{p}$, and $\epsilon$ is some additive error. A **linear model** of predictors $\mathbf{x}^{T}=(x_{1},x_{2},\dots,x_{p})$ has the form $f_{\beta,\beta_{0}}(\mathbf{x})=\beta_{0} + \beta^{T}\mathbf{x} =\beta_{0}+\sum_{j=1}^{p}\beta_{i}x_{i},$where $\beta \in \mathbb{R}^{p},\beta_{0}\in \mathbb{R}$ are the model's parameters. For a particular sample $j$ with features $\mathbf{x}_{j}^{T}=(x_{1j},x_{2j},\dots,x_{pj})$, the model's fitted value is $\hat{y}_{i}=f_{\beta}(\mathbf{x}_{i}).$*From now on, we assume one of the entries in $\mathbf{x}$ is the constant $1$* (WLOG the zeroth entry), so that $\beta_{0}$ is the intercept term, and we do not have to set aside an intercept $\beta_{0}$ -- this simplifies the model to $f_{\beta}(\mathbf{x})=\beta^{T}\mathbf{x}.$ Now the natural next step is to find the "best" $\beta$ -- but in what sense, and how do we solve for it? ## Least Squares Linear Regression **Least squares regression** (or **ordinary least squares, OLS**) is a method of determining $\beta$ by minimizing the **residual sum of squares (RSS)** given by $\mathrm{RSS}= \sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}=\| \mathbf{y}-\mathbf{X}\beta \|_{2} $and can be maximised by setting its derivative wrt. $\beta=(\beta_{0},\dots,\beta_{p})$ to be $0$: $\nabla_{\!\beta}~\mathrm{RSS}= -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\equiv \mathbf{0},$which gives the least squares coefficients $\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$assuming invertibility of $\mathbf{X}^{T}\mathbf{X}$, i.e. that the predictors are all linearly independent. The predicted values are then $\hat{\mathbf{y}}=\mathbf{X}\hat{\beta}=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}\mathbf{y},$and the matrix $\mathbf{H}:= \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}$ is called the **hat matrix**. Directly computing the closed forms above is very [[Numerical Stability|unstable]] and expensive (due to inverting a matrix). Instead, we can use the [[QR Factorization#Solving Linear Systems|QR factorisation to solve least squares problems]]: If $A=\hat{Q}\hat{R}$ is the thin QR, we have $\begin{align*} \hat{R}\hat{\beta}&= \hat{Q}\mathbf{y},\\ \hat{\mathbf{y}}&= \mathbf{X}\hat{\beta}, \end{align*}$where $\hat{\beta}$ can be found in a backward stable manner efficiently with back substitution, since $\hat{R}$ is upper-triangular. ## Shrinkage Methods Instead of selecting a subset of variables directly, shrinkage methods reduces the coefficients of variables -- ideally those of the irrelevant ones. Most common shrinkage is done with a penalty term against large coefficients, e.g. in **ridge regression** and **the lasso**. As the variance of the prediction $\hat{y}$ is quadratic in the coefficients $\hat{\beta}$, *shrinkage limits the variance incurred from including all predictors*. > [!implementations] Shrinkage methods in R > `glmnet::glmnet(x, y, alpha, lambda)` > - Specifying `alpha=0` runs the ridge regression, `alpha=1` runs the lasso, and anything in between for a mix of the two. > - The best lambda: `cv.glmnet(x,y,alpha)$lambda.min` > - Prediction: `predict(model, s=lambda, new_predictors)` > - Predict with automatically selected lambda: `predict(model, s="lambda.min", new_predictors)` ## Ridge Regression Ridge regression encourages smaller coefficients with a [[Ridge Penalty|$l_{2}$ penalty term]]. It shrinks the coefficients to generally non-0 values. > [!definition|*] Ridge Regression > Instead of minimizing the $\mathrm{RSS}$ like OLS, **ridge regression** finds coefficients $\hat{\beta}^{\text{\,ridge}} =(\hat{\beta}_{0},\dots,\hat{\beta}_{p})$ that minimize the objective $J_{\mathrm{ridge}}(\hat{\beta})=\mathrm{RSS}(\hat{\beta})+\underbrace{\lambda\sum_{i=1}^p \hat{\beta}_{i}^{2}}_{l_{2}\text{ penalty}}$where $\lambda$ is the **tuning parameter**, and a larger $\lambda$ encourages smaller coefficients. > > > [!remarks] > > - To account for different scales, the predictors are assumed to be standardized. > > - In the penalty term, the index starts at $1$ to avoid penalizing the intercept. > [!exposition] Solving ridge regression coefficients > In matrix form, the ridge regression objective is $J_{\mathrm{ridge}}=(\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)-\lambda \beta^{T}\beta,$which is minimized by setting$\hat{\beta}^{\text{\,ridge}}=(\mathbf{X}^{T}\mathbf{X}+\lambda I)^{-1}\mathbf{X}^{T}\mathbf{y}.$ Ridge regression can also arise from a #bayesian model, following the interpretation of the ridge penalty as a prior: - Let the coefficients have prior $\beta \sim N\left( \mathbf{0}, \frac{\sigma^{2}}{\lambda}I \right)$, the errors $\epsilon \sim N(\mathbf{0}, \sigma^{2}I)$, and a generation process of $Y=X\beta+\epsilon$. - This gives a posterior MLE/mean of $\mathbb{E}[\beta \,|\, \mathbf{X}]=(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}\mathbf{X}^{T}\mathbf{y},$the same as the ridge estimator $\hat{\beta}^\mathrm{\,ridge}$. ### Mathematical Properties The shrinking effect of ridge regression takes place in the [[Principal Component Analysis|principal component]] of $\mathbf{X}$: - If the [[Singular Value Decomposition|SVD]] of $\mathbf{X}$ is $\mathbf{X}=\mathbf{UDV}^{T}$, then the principal components are the columns of $\mathbf{V}$, and the projections of $\mathbf{X}$ onto them are the columns of $\mathbf{UD}$. - Substituting into the fitted values of ridge regression, $\begin{align*} \mathbf{\hat{y}}^{\text{ ridge}}&= \mathbf{X}\hat{\beta}^{\text{\,ridge}}\\ &= \mathbf{X}(\mathbf{X}^{T}\mathbf{X}+\lambda I)^{-1}\mathbf{X}^{T}\mathbf{y}\\ &= \mathbf{UDV}^{T}(\mathbf{VD}^{2}\mathbf{V}^{T}+\lambda I)^{-1}\mathbf{VDU}^{T}\mathbf{y}\\ &= \mathbf{UD}(\mathbf{D}^{2}+\lambda I)^{-1}\mathbf{DU}^{T}\mathbf{y}\\ &= \sum_{j=1}^{p} \Big(\underbrace{ \frac{d_{j}^{2}}{d_{j}^{2}+\lambda}}_{\substack{\text{shrinking}\\\text{factor}}}\cdot\underbrace{\vphantom{ \frac{d_{j}^{2}}{d_{j}^{2}+\lambda}}({u}_{j}^{T}\mathbf{y})}_{\substack{\text{projection} \\ \text{onto }u_{j}}}\Big) {u}_{j} \end{align*}$so the component of $\mathbf{y}$ in the direction of ${u_{j}}$ is shrunk by a factor of $\frac{d_{j}^{2}}{d_{j}^{2}+\lambda}$ -- *that is, the smaller the PCA score, the harder the shrinking*. The [[Degree of Freedom#Special Form for Linear Estimators|effective degrees of freedom]] measures the effective number of parameters under the current strength of shrinking: ![[Degree of Freedom#^918055]] This coincides with the sum of the shrinking factors. > [!theorem|*] Duplicate Features Share Ridge Coefficients > Fix $\lambda$, design matrix $\mathbf{X}$ and response $\mathbf{y}$. > - Write $\mathbf{X}=(\mathbf{X}^{\ast},\mathbf{x}_{p})$, and make $k$ copies of the feature $\mathbf{x}_{p}$, giving the matrix $\mathbf{\tilde{X}}=(\mathbf{X}^{\ast},\mathbf{X}_{p})$, where $\mathbf{X}_{p}=(\mathbf{x}_{p},\dots,\mathbf{x}_{p})$ has $k$ columns. > - Fit the ridge regression $\mathbf{y} \sim \mathbf{\tilde{X}}$, giving coefficients $\tilde{\beta}=(\tilde{\beta}^{\ast},\tilde{\beta}_{p},\dots,\tilde{\beta}_{p+k-1})$. Here $\tilde{\beta}^{\ast}$ are the coefficients of $\mathbf{X}^{\ast}$, and $\tilde{\beta}_{p},\dots,\tilde{\beta}_{p+k-1}$ those of the duplicate columns. > > Then $\tilde{\beta}_{p}=\cdots \tilde{\beta}_{p+k-1}.$Furthermore, their sum $\gamma$ (i.e. the total contribution of the duplicated column) *solves the original ridge regression problem $\mathbf{y} \sim \mathbf{X}$ but with a weighted penalty*: $\underset{\beta=(\beta ^{\ast},\gamma)}{\arg\min}~\left\{ \| \mathbf{y}-\mathbf{X}\beta \| ^{2} +\lambda \| \beta ^{\ast} \|^{2}+\frac{\lambda}{k}\gamma^{2} \right\}.$Here the duplicated column's effect $\gamma$ is penalized less, by a factor of $\sqrt{ k }$ (on the coefficient scale). > > > [!proof]- Proof that $\tilde{\beta}_{p}=\dots =\tilde{\beta}_{p+k}$ > > Consider the ridge objective in regressing $\mathbf{y} \sim \mathbf{\tilde{X}}$: $\begin{align*} > > J(\beta;\mathbf{\tilde{X}}, \mathbf{y},\lambda)&:= \| \mathbf{y}-\mathbf{\tilde{X}}\beta \|^{2} + \lambda \| \beta \| ^{2}\\ > > &\hphantom{:}= \underbrace{\vphantom{\sum_{i=p}^{p+k-1}}\| \mathbf{y}-\mathbf{X}^{\ast}\beta ^{\ast}-(\beta_{p} + \cdots \beta_{p+k-1})\mathbf{x}_{p} \|^{2} }_{\mathrm{RSS}} +\underbrace{ \lambda \| \beta ^{\ast} \| ^{2}+\lambda \sum_{i=p}^{p+k-1}\beta_{i}^{2}}_{\text{penalty}}, > > \end{align*}$where $\beta ^{\ast}\in \mathbb{R}^{p-1}$ are the coefficients of $\mathbf{X}^{\ast}$ and $\beta=(\beta ^{\ast},\beta_{p},\dots,\beta_{p+k-1})$. > > > > The problem is symmetric in $\beta_{p},\dots,\beta_{p+k-1}$, and given any optimizer $\tilde{\beta}$, permuting their values gives another optimizer. > > > > This would contradict the strict convexity of the problem, so $\tilde{\beta}$ must be invariant under such permutation, i.e. $\tilde{\beta}_{p}=\dots=\tilde{\beta}_{p+k-1}$. > > > [!proof]- Proof of equivalence with weighted ridge > > Substitute $\gamma:= \sum_{j=p}^{p+k-1}\beta_{j}$ to get the new objective$\begin{align*} J'(\beta ^{\ast},\gamma~;\mathbf{X},\mathbf{y},\lambda)&= \left\| \mathbf{y}-\mathbf{X}\begin{pmatrix} \beta ^{\ast} \\ \gamma\end{pmatrix} \right\| ^{2} +\lambda \| \beta ^{\ast} \|^{2}+ \lambda k\left( \frac{\gamma}{k} \right)^{2}\\ &= \left\| \mathbf{y}-\mathbf{X}\begin{pmatrix} \beta ^{\ast} \\ \gamma\end{pmatrix} \right\| ^{2}+\lambda \left\|\begin{pmatrix} \beta ^{\ast} \\ \gamma / \sqrt{ k }\end{pmatrix} \right\| ^{2}. > \end{align*}$ > > Therefore, the problem is equivalent to an ordinary ridge regression without duplicates, but *with a weighted penalty that penalizes the duplicated column less, by a factor of $\sqrt{ k }$.* The ridge is equivalent to fitting OLS on an augmented dataset that includes $q$ columns of independent noises $\pmb{\epsilon}_{j}\overset{\mathrm{iid.}}{\sim}N\left( 0, \frac{\lambda}{q} \right), ~~ j=1,\dots,q,$and if the fitted coefficients of the non-noise features are $\hat{\beta}_{q}$, then $\hat{\beta}_{q} \xrightarrow{\mathrm{a.s.}}\hat{\beta}_{\lambda},$where $\hat{\beta}_{\lambda}$ is the ridge-regularised coefficients. ^cd8b6d - Source: [this Cross Validated post](https://stats.stackexchange.com/questions/328630/is-ridge-regression-useless-in-high-dimensions-n-ll-p-how-can-ols-fail-to?ref=quantymacro.com), quoted in [quantymacro's article](https://www.quantymacro.com/when-a-ridge-regression-maxi-meets-random-forest/). - Therefore, in OLS involving a lot of noisy features, the non-noise features get penalised in some sense, and sometimes a [negative ridge penalty](http://www.tandfonline.com/doi/abs/10.1080/03610928308828440) can help. - Conversely, adding random noise as features can act as some form of regularisation. ## The Lasso > [!definition|*] The lasso > Similar to ridge regression, **the lasso** also penalizes large coefficients, but with their $l_{1}$ norm: it finds ${\beta}$ that minimizes $\text{lasso error}=\frac{1}{2}\text{RSS}+\underbrace{\lambda\sum_{i=1}^p|\beta_{i}|}_{l_{1}\text{ penalty}}.$ Because of the $l_{1}$ error, the lasso's coefficients have no closed form; they can still be computed with the same complexity as ridge regression coefficients. Although a shrinkage method, *the lasso also performs variable selection* by shrinking some coefficients to exactly $0$. > [!exposition] Penalty Terms and Convex Optimization > The $l_{1},l_{2}$ penalties are equivalent to giving a budget to the magnitude of the coefficients ${\hat{\beta}}$. > - For $l_{1}$, the coefficients fitting this budget form a polytope (polyhedron in general $\mathbb{R}^{n}$). > - The $l_{2}$ constraint in ridge regression gives (hyper)spheres instead. > > Solving the regression coefficients is then a convex optimization problem minimizing the $\mathrm{RSS}$ within the constraints: > > ![[LassoRidgeOptimization.png#invert]] > > - The $\mathrm{RSS}$ gradient often first touches the corner of the polytopes of $l_{1}$, where some of the coefficients (in this case $\beta_{1}$) exactly $0$. > - This is not the case for (hyper)spheres, where in this example $\beta_{1}$ is tiny but non-$0$. Taking the derivative of the lasso penalty with respect to $\beta_{j}$ gives the **subgradient condition** $\frac{1}{n} \left< \mathbf{x}_{j},\mathbf{y}-\mathbf{X}\beta \right>=\lambda \cdot\mathrm{sign}(\beta_{j}),$assuming $\beta_{j}$ is in the set of active predictors indexed by $\mathcal{A}$. - Assuming that the data has been centered, every active predictor $\mathbf{x}_{j}$ has covariance $\propto\pm \lambda$ with the residual. In matrix form, this is $\mathbf{X}_{\mathcal{A}}^{T}(\mathbf{y}-\mathbf{X}_{\mathcal{A}}\beta_{\mathcal{A}})=n\lambda \cdot\mathrm{sign}(\beta_{\mathcal{A}}).$Solving for $\beta_{\mathcal{A}}$ gives $\beta_{\mathcal{A}}=\mathbf{X}_{\mathcal{A}}^{T}\mathbf{y} - n\lambda \cdot (\mathbf{X}^{T}_{\mathcal{A}}\mathbf{X}_{\mathcal{A}})^{-1}\mathrm{sign}(\beta_{\mathcal{A}}), $so as long as $\mathcal{A}$ is constant (i.e. no predictor becomes inactive/active) and $\mathrm{sign}(\beta)$ does not change, the coefficients $\beta$ are linear in $\lambda$. *In other words, each coefficient $\beta_{j}$ has a piece-wise linear trajectory when plotted against $\lambda$.* For fitting **generalized lasso models** (i.e. [[Generalized Linear Models|GLMs]] with a $l_{1}$ penalty), we can also use the lasso loss given by $\text{lasso penalty}=\text{(some suitable loss)}+\lambda \| \hat{\pmb{\beta}} \| _{1},$e.g. for logistic regression, we can use$\underbrace{-\frac{1}{n}\sum_{i}\Big[y_{i}\log \hat{\mu}_{i}+(1-y_{i})\log (1-\hat{\mu}_{i})\Big]}_{\text{entropy}}+\lambda \| \hat{\pmb{\beta}} \|_{1}.$However, an issue with this is that in general, *the loss term is not linear in $\pmb{\beta}$ -- in this case $\log\hat{\mu}_{i}$ depends on $\hat{\pmb{\beta}}$, so the path is no longer piecewise linear*. - We can no longer explicitly solve for the exact point at which a new variable joins the active set. - Instead, we need to actually nudge the coefficients along its path, which is more costly. ### Least Angle Regression **Least Angle Regression (LAR)** is a step-wise regression method that gradually increase the coefficients $\hat{\beta}$ of predictors. - At each step, LAR has a set of active predictors whose coefficients are being modified; variables are gradually added to the set. - In particular, the active predictors all have the same correlation with the current residual; when an inactive variable ties the active ones in correlation, it is added to the active set. - In spirit, LAR coefficients growing in time are similar to lasso coefficients as a function of decreasing $\lambda$. > [!algorithm] Least Angle Regression > $[0]$ Begin with the empty model $\hat{\beta}=(0,\dots,0)$, and the active set $A_{0}=\emptyset$. Normalize and center the predictors. Initialize the residuals to be $\mathbf{r}=\mathbf{y}$. > > $[1]$ Activate the predictor with the strongest correlation to $\mathbf{r}$ (say $\mathbf{x}_{1}$), giving $A_{1}=\{ \mathbf{x}_{1} \}$. > > For $[k=2,...]$: > - Move the coefficients $\hat{\beta}_{A_{k}}$ in the direction of $\tilde{\beta}_{A_{k}}$, the OLS coefficients of $\mathbf{r} \sim A_{k}$. > - Recompute the residual $\mathbf{r}$; its correlation with all predictors in $A_{k}$ should have decreased, while still being equal across $\mathbf{x}_{j} \in A_{k}$. > - Keep moving until a new variable ties with $A_{k}$ in correlation with $\mathbf{r}$. Append it to $A_{k}$ and go to $[k+1]$. > > $[\mathrm{END}]$ The algorithm terminates after $\min(p,N-1)$ steps, when the model becomes OLS, or when a perfect fit is achieved (if $N \ge p$). > - Record the sequence of coefficients $\hat{\beta}_{A_{k}}$ at break points when a new variable is added, as well as the (shared) correlation $\lambda_{k}$ at that point. > - Return the sequence $\{ \hat{\beta}_{A_{k}},\lambda_{k} \}_{k=1,\dots}$. LAR can be performed efficiently to compute the entire path of fitted coefficients: - Since the correlations $\left< \mathbf{r}, \mathbf{x}_{j} \right>$ are inner products (hence linear), *we can compute the exact moment when a new variable catches up with the active set $A_{k}$*; there is no need for incremental nudges in the for loop. - With a modification, the LAR is an efficient algorithm for computing the entire lasso path as $\lambda \to \infty$: in particular, *if a predictor is deactivated from $A_{j}$ when its coefficient becomes $0$, then LAR returns the lasso path*. ### Extensions to the Lasso Coefficient estimated by the LASSO is biased to $0$ and is not consistent, as the LASSO coefficients do not converge to the true ones when $N \to \infty$, even if the data generating model is linear. Some alternatives use the LASSO for predictor selection only: suppose [[Cross Validation|CV]] selects the lasso model with $\lambda=\lambda_{0}$ and active predictors $A_{0}$, - One can fit an OLS of $\mathbf{y} \sim\mathbf{X}_{A_{0}}$ to obtain coefficients $\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$. - **Relaxed lasso** \[[[BestSubsetForwardStepwiseLasso.pdf|paper]]\] fits another lasso model of $\mathbf{y} \sim \mathbf{X}_{A_{0}}$, whose new $\lambda=\lambda_{1}$ is selected by CV. - Alternatively, one can linearly interpolate between $\hat{\beta}_{A_{0}}^{~\mathrm{LASSO}}$ and $\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$ with some hyperparameter $\mu$ to get $\hat{\beta}_{A_{0}}^{~\mathrm{relaxed}}=\mu\hat{\beta}_{A_{0}}^{~\mathrm{LASSO}} + (1-\mu)\hat{\beta}_{A_{0}}^{~\mathrm{OLS}}$. - Any of these techniques *separate the variable selection* (controlled by $\lambda_{0}$ and the first LASSO fit) *from the shrinkage* (controlled by the second fit). Reducing the penalty on larger coefficients also help: - Smoothly clipped absolute deviation (SCAD) penalty $J_{a}(\beta,\lambda)$ is an alternative, where $a$ is a parameter controlling the range of $\beta$ for which the penalty is capped. It is not convex, so computation is a lot harder. - The adaptive lasso uses a weighted penalty $\sum_{j}w_{j}|\beta_{j}|$, using the weight $w_{j}=|\hat{\beta}_{j}|^{-\nu}$, where $\hat{\beta}_{j}$ is OLS coefficient, and $\nu$ is a tuning parameter. The penalty is convex and yields consistent estimates. The $l_{1}$ penalty can be generalised: - Instead of computing $l_{1}$ norm of the entire vector of coefficients $\beta$, grouping the coefficients into sub-vectors $\theta_{k}$ gives $\beta=(\theta_{1},\dots,\theta_{q})$ gives the $l_{1}$ "norm" $\| \beta \|_{1}=\sum_{k} \| \theta_{k} \|_{2}, $giving results sparse in $\theta$, i.e. selection of groups of variables. - In models like graphs and matrices, particular $l_{1}$-like penalties (e.g. sum of singular values) can give similar sparse results. - $l_{1}$ penalties in sparse PCA give loadings that are sparse in the original variables.