Basis Expansion and Regularization - Random Notes Go Brrrrrrr

> [!info] This material corresponds to chapter 5 in the ESL. ## Basis Expansion Using linear models in both regression and classification are simplifying approximations that trade bias for lower variance and better interpretability. Instead of working with just the predictors $X_{j}$ as given, **basis expansions** introduce transformations to the original predictors. Expansions can introduce: - Non-linearity: e.g. $h(X)=X_{j}^{2},\log(X_{j})$, etc. - Interaction terms: e.g. $h(X)=X_{j}X_{k}$. - Localization, e.g. $h(X)=\mathbb{1}_{X_{j} \in [a_{j},b_{j}]}$ in splines. In general, basis expansion methods create a **dictionary** $\mathcal{D}$ of transformed predictors, whose large size make overfitting an issue. - Restricting assumptions, e.g. linearity of the model in the expanded basis, can limit the class of functions in the basis. - Model selection and regularization limits the variability of the coefficients. One typical example is introducing higher powers and interaction terms, i.e. $\mathcal{D}(x_{1},\dots,x_{n})=\{ x_{1},\dots,x_{n} \} \cup \{ x_{i}x_{j} ~|~ 1 \le i,j \le n \} \cup \cdots$The expanded design matrix is then $\pmb{\Phi}$, and we can use it to fit models like the OLS $Y \sim \Phi$. - The variability can be controlled with [[Linear Regression Methods#Shrinkage Methods|shrinkage methods like lasso and ridge regression]]. Another class of basis expansion is the [[Splines]], which uses piecewise polynomials (usually linear or cubic) as bases. ## Wavelet Smoothing Wavelet smoothing models the response function as a linear combination of [[Orthogonal Wavelets|wavelet bases]], which are orthonormal function bases. By filtering out bases with small coefficients, *wavelet smoothing compresses the response by imposing sparsity*. ### Wavelet Fitting Wavelet smoothing is convenient when the response is evenly spaced, e.g. daily temperature and images. Suppose $\mathbf{y}\in \mathbb{R}^{2^{J}}$ contains $2^{J}$ observations, then we can model it using the $V_{J}$ space, spanned by $2^{J}$ basis wavelets. Denoting $\mathbf{W}$ as the $2^{J}\times 2^{J}$ matrix containing the basis wavelets evaluated at the observations, then the **wavelet transform** maps $\mathbf{y} \mapsto \mathbf{y}^{*} := \mathbf{W}^{T}\mathbf{y}$where $\mathbf{y}^{*}$ is the OLS coefficients of $\mathbf{y}$ in terms of the $2^{J}$ wavelets. ### Wavelet Filtering Wavelet smoothing shrinks the smaller coefficients to $0$, producing a filtering effect and a sparser representation of the response: $ \mathbf{y}^{*}= (E_{1},\epsilon_{1},\epsilon_{2},E_{2},E_{3},\dots) \xrightarrow{\,\text{ shrink }\,}\hat{\theta}=(E_{1}',0,0,E_{2}',E_{3}',\dots) $which is used by the **inverse wavelet transform** to compute the fitted values $\hat{\mathbf{f}}(t):=\mathbf{W}\hat{\mathbf{\theta}}$. **SURE shrinkage** is one of the filtering methods, which shrinks all coefficients by $\lambda$, truncating at $0$: $\hat{\theta}_{j}(\lambda)=\mathrm{sign}(y^{*}_{j})\cdot (|y^{*}_{j}|-\lambda)_{+}.$ ![[LassoShrinkage.png#invert|w60|center]]It is derived from a criterion analogous to that of the lasso: $\hat{\theta}(\lambda)= \underset{\theta}{\mathrm{argmin}}\bigg[ \| \mathbf{y}-\mathbf{W}\theta \|_{2}^{2}+2\lambda \| \theta \|_{1} \bigg] $where $\lambda$ is the shrinkage parameter; a natural choice is $\sigma \sqrt{ 2\log N }$, which is the expectation of $\max_{j}|y^{*}_{j}|$ if the response $\mathbf{y}$ are all random noise $\sim N(0,\sigma)$.