**Generalized additive models (GAMs)** further generalize the OLS and [[Generalized Linear Models|GLMs]], modeling $\mu_{Y}(X)=\mathbb{E}[Y|X]$ with a [[Link Function|link function]] $g$: $g(\mu_{Y}(X))=\alpha + \sum_{i}f_{i}(X)$where $f_{i}$ can be (in general) non-linear relationships. Expanding the summation by explicit subsets of features: $g(\pmb{\mu})=\alpha + \mathbf{A}\pmb{\theta}+\sum_{j}f_{j}(\mathbf{x}_{j})+\sum_{j,k}f_{j,k}(\mathbf{x}_{j}, \mathbf{x}_{k})+\cdots,$
- $\alpha$ is the intercept shared by all data points,
- $\mathbf{A\theta}$ are explicitly parametrized effects, e.g. [[Basis Expansion and Regularization#Basis Expansion|basis expansion]] on the original features $\mathbf{X}$.
- $f_{j}$ are single-variate effects, $f_{j,k}$ are first-order interactions, etc., all of which are fitted with non-parametric methods like [[Scatterplot Smoothers|scatterplot smoothers]].
### Additive Models
The basic form of GLMs stem from an additive model with $g: \mu \mapsto \mu$: $\pmb{\mu}=\alpha + \mathbf{A}\pmb{\theta}+\sum_{j}f_{j}(\mathbf{x}_{j})+\sum_{j,k}f_{j,k}(\mathbf{x}_{j}, \mathbf{x}_{k})+\cdots.$
### Fitting GAMs
The **backfitting algorithm** fits simple GAMs (without interaction terms): using scatterplot smoother $\mathcal{S}(y \sim x)$,
> [!algorithm]
> $[1]$ Initialize with $\hat{\alpha}=\bar{y}$, $\hat{f}_{j}\equiv 0$.
> $[2]$ Cycling through predictors $j=1,\dots,p$:
> $\hat{f}_{j} \leftarrow \mathcal{S}\left[ \text{residuals}_{-j} \sim x_{j} \right]$where $\text{residuals}_{-j}=\left( y_{i} -\hat{\alpha} - \sum\limits_{k \ne j}\hat{f}_{k}(x_{ik}) \right)_{i=1}^{N}$, i.e. the residual of fitting $\mathbf{y}$ with current estimates $\hat{\alpha}$ and $f_{k}(x_{ik})$ (except for $f_{j}$).
>
> $[3]$ Re-center $\hat{f}_{j}$ around $0$ with $\hat{f}_{j} \leftarrow \hat{f}_{j}-\frac{1}{N}\sum_{i=1}^{N}\hat{f}_{j}(x_{ij})$
> $[4]$ Repeat $[2,3]$ until sufficient convergence in all $\hat{f}_{j}$.
### Limitations of GAMs
Vanilla GAM fitting algorithms have trouble in large data mining applications:
- They do not perform predictor selection, fitting a relation for every predictor. This is inefficient and likely overfits when many irrelevant predictors are present.
- LASSO-type penalties in scatterplot smoothers can produce sparser models.