**Generalized additive models (GAMs)** further generalize the OLS and [[Generalized Linear Models|GLMs]], modeling $\mu_{Y}(X)=\mathbb{E}[Y|X]$ with a [[Link Function|link function]] $g$: $g(\mu_{Y}(X))=\alpha + \sum_{i}f_{i}(X)$where $f_{i}$ can be (in general) non-linear relationships. Expanding the summation by explicit subsets of features: $g(\pmb{\mu})=\alpha + \mathbf{A}\pmb{\theta}+\sum_{j}f_{j}(\mathbf{x}_{j})+\sum_{j,k}f_{j,k}(\mathbf{x}_{j}, \mathbf{x}_{k})+\cdots,$ - $\alpha$ is the intercept shared by all data points, - $\mathbf{A\theta}$ are explicitly parametrized effects, e.g. [[Basis Expansion and Regularization#Basis Expansion|basis expansion]] on the original features $\mathbf{X}$. - $f_{j}$ are single-variate effects, $f_{j,k}$ are first-order interactions, etc., all of which are fitted with non-parametric methods like [[Scatterplot Smoothers|scatterplot smoothers]]. ### Additive Models The basic form of GLMs stem from an additive model with $g: \mu \mapsto \mu$: $\pmb{\mu}=\alpha + \mathbf{A}\pmb{\theta}+\sum_{j}f_{j}(\mathbf{x}_{j})+\sum_{j,k}f_{j,k}(\mathbf{x}_{j}, \mathbf{x}_{k})+\cdots.$ ### Fitting GAMs The **backfitting algorithm** fits simple GAMs (without interaction terms): using scatterplot smoother $\mathcal{S}(y \sim x)$, > [!algorithm] > $[1]$ Initialize with $\hat{\alpha}=\bar{y}$, $\hat{f}_{j}\equiv 0$. > $[2]$ Cycling through predictors $j=1,\dots,p$: > $\hat{f}_{j} \leftarrow \mathcal{S}\left[ \text{residuals}_{-j} \sim x_{j} \right]$where $\text{residuals}_{-j}=\left( y_{i} -\hat{\alpha} - \sum\limits_{k \ne j}\hat{f}_{k}(x_{ik}) \right)_{i=1}^{N}$, i.e. the residual of fitting $\mathbf{y}$ with current estimates $\hat{\alpha}$ and $f_{k}(x_{ik})$ (except for $f_{j}$). > > $[3]$ Re-center $\hat{f}_{j}$ around $0$ with $\hat{f}_{j} \leftarrow \hat{f}_{j}-\frac{1}{N}\sum_{i=1}^{N}\hat{f}_{j}(x_{ij})$ > $[4]$ Repeat $[2,3]$ until sufficient convergence in all $\hat{f}_{j}$. ### Limitations of GAMs Vanilla GAM fitting algorithms have trouble in large data mining applications: - They do not perform predictor selection, fitting a relation for every predictor. This is inefficient and likely overfits when many irrelevant predictors are present. - LASSO-type penalties in scatterplot smoothers can produce sparser models.