Ridge Penalty - Random Notes Go Brrrrrrr

> [!tldr] Ridge Penalty > A ridge penalty generically refers to a penalty of the form $\beta^{T}D\beta$that is added to an (unpenalized) objective when optimizing for some $\beta$. This commonly relates to a #bayesian prior of $\beta \sim N(0, D^{-1}),$since $\beta^{T}D\beta$ is the only term in its log-likelihood containing $\beta$. Common forms include: - [[Linear Regression Methods#Ridge Regression|Ridge regression]] where $D \propto I$, - The Bayesian formulation of [[Mixed Linear Models]], - The [[Splines#The Wiggliness Penalty]] which penalizes $\beta$ based on the wiggliness (second derivative) of the basis. For a purely predictive perspective, it reduces variability in the [[Bias-Variance Tradeoff]] by reducing the [[Degree of Freedom]] in the model, i.e. the degree to which coefficients $\beta$ are allowed to vary. ### Bayesian / Mixed Modeling Interpretations We can also interpret the ridge penalty as a Bayesian prior that $\beta \sim N\left( 0, \frac{\sigma^{2}_{\epsilon}}{\lambda} D^{-1} \right),$i.e. a mixed model where $\beta$ are (partly) random effects. - $D$ captures our "idea" about what $\beta$ behave like -- e.g. continuous, around 0, etc. If we set some entries in $D$ to 0 to avoid penalizing some of the terms, it can result in an (improper) uniform prior for those terms' coefficients. - A large $\lambda$ corresponds to a small $\kappa$, representing a belief that $\beta$ shouldn't deviate too much from $0$. In a mixed model context, $\lambda$ controls the signal-to-noise ratio: assuming the noise $\pmb{\epsilon} \sim N(\mathbf{0},\sigma^{2}_{\epsilon}I)$ independent of all else, and $\mathrm{Cov}(\beta)= \sigma^{2}_{b}I$, we have $\lambda=\frac{\sigma^{2}_{\epsilon}}{\sigma^{2}_{b}}.$Therefore, a high signal-to-noise ratio corresponds to a small $\lambda$, where we can reduce penalty without worrying about overfitting.