> [!info] This material corresponds to chapter 5 in the ESL.
## Basis Expansion
Using linear models in both regression and classification are simplifying approximations that trade bias for lower variance and better interpretability.
Instead of working with just the predictors $X_{j}$ as given, **basis expansions** introduce transformations to the original predictors. Expansions can introduce:
- Non-linearity: e.g. $h(X)=X_{j}^{2},\log(X_{j})$, etc.
- Interaction terms: e.g. $h(X)=X_{j}X_{k}$.
- Localization, e.g. $h(X)=\mathbb{1}_{X_{j} \in [a_{j},b_{j}]}$ in splines.
In general, basis expansion methods create a **dictionary** $\mathcal{D}$ of transformed predictors, whose large size make overfitting an issue.
- Restricting assumptions, e.g. linearity of the model in the expanded basis, can limit the class of functions in the basis.
- Model selection and regularization limits the variability of the coefficients.
One typical example is introducing higher powers and interaction terms, i.e. $\mathcal{D}(x_{1},\dots,x_{n})=\{ x_{1},\dots,x_{n} \} \cup \{ x_{i}x_{j} ~|~ 1 \le i,j \le n \} \cup \cdots$The expanded design matrix is then $\pmb{\Phi}$, and we can use it to fit models like the OLS $Y \sim \Phi$.
- The variability can be controlled with [[Linear Regression Methods#Shrinkage Methods|shrinkage methods like lasso and ridge regression]].
Another class of basis expansion is the [[Splines]], which uses piecewise polynomials (usually linear or cubic) as bases.
## Wavelet Smoothing
Wavelet smoothing models the response function as a linear combination of [[Orthogonal Wavelets|wavelet bases]], which are orthonormal function bases.
By filtering out bases with small coefficients, *wavelet smoothing compresses the response by imposing sparsity*.
### Wavelet Fitting
Wavelet smoothing is convenient when the response is evenly spaced, e.g. daily temperature and images. Suppose $\mathbf{y}\in \mathbb{R}^{2^{J}}$ contains $2^{J}$ observations, then we can model it using the $V_{J}$ space, spanned by $2^{J}$ basis wavelets.
Denoting $\mathbf{W}$ as the $2^{J}\times 2^{J}$ matrix containing the basis wavelets evaluated at the observations, then the **wavelet transform** maps $\mathbf{y} \mapsto \mathbf{y}^{*} := \mathbf{W}^{T}\mathbf{y}$where $\mathbf{y}^{*}$ is the OLS coefficients of $\mathbf{y}$ in terms of the $2^{J}$ wavelets.
### Wavelet Filtering
Wavelet smoothing shrinks the smaller coefficients to $0$, producing a filtering effect and a sparser representation of the response: $
\mathbf{y}^{*}= (E_{1},\epsilon_{1},\epsilon_{2},E_{2},E_{3},\dots)
\xrightarrow{\,\text{ shrink }\,}\hat{\theta}=(E_{1}',0,0,E_{2}',E_{3}',\dots)
$which is used by the **inverse wavelet transform** to compute the fitted values $\hat{\mathbf{f}}(t):=\mathbf{W}\hat{\mathbf{\theta}}$.
**SURE shrinkage** is one of the filtering methods, which shrinks all coefficients by $\lambda$, truncating at $0$: $\hat{\theta}_{j}(\lambda)=\mathrm{sign}(y^{*}_{j})\cdot (|y^{*}_{j}|-\lambda)_{+}.$
![[LassoShrinkage.png#invert|w60|center]]It is derived from a criterion analogous to that of the lasso: $\hat{\theta}(\lambda)=
\underset{\theta}{\mathrm{argmin}}\bigg[ \| \mathbf{y}-\mathbf{W}\theta \|_{2}^{2}+2\lambda \| \theta \|_{1} \bigg] $where $\lambda$ is the shrinkage parameter; a natural choice is $\sigma \sqrt{ 2\log N }$, which is the expectation of $\max_{j}|y^{*}_{j}|$ if the response $\mathbf{y}$ are all random noise $\sim N(0,\sigma)$.