Splines is a form of [[Basis Expansion and Regularization|basis expansion methods]] in regression, using piecewise polynomials that are cut off at **knots $\{ \xi_{k} \}_{k=1}^{K}$**.
> [!definition|*] Splines
> In general, an **order-$M$ spline** is a piecewise polynomial of order $(M-1)$. It is required to be continuous up to the $(M-2)$th derivative (at the knots).
>
> Given a basis $\mathcal{B}:=\{ b_{j}(X) : j = 1,\dots, p\}$, one can linearly combine them to get an estimated relationship $\hat{f}(X)=\sum_{j}\hat{\beta}_{j}b_{j}(X)$for some (estimated) coefficients $\hat{\beta}=(\hat{\beta}_{1},\dots,\hat{\beta}_{p})$.
Given the basis, we have the **augmented design matrix** $\mathbf{X}^{\ast}:=:\mathbf{B}(\mathbf{x}):= \begin{bmatrix} b_{1}(\mathbf{x})&\dots&b_{p}(\mathbf{x}) \end{bmatrix}\in \mathbb{R}^{n\times p}$on the dataset $(\mathbf{x},\mathbf{y})$, where $b_{j}(\mathbf{x})$ is the column vector $(b_{j}(x_{1}),\dots,b_{j}(x_{n}))^T$. Here we usually assume the constant/intercept $\mathbf{1}=(1,\dots,1)$ and the original $\mathbf{x}$ are contained in $\mathbf{B}(\mathbf{x})$, or is in its span.
> [!warning] Which is which
> $b_{1},\dots$ are the basis functions.
> $\mathcal{B}=\{ b_{1},\dots \}$ is the (unordered) set of basis functions.
> $\mathbf{b}=(b_{1},\dots, b_{p})$ is the ordered tuple of them.
>
> $b_{j}(\mathbf{x})$ are vectors produced by applying $b_{j}$ entry-wise to $\mathbf{x}$.
> $\mathbf{b}(x) =(b_{1}(x),\dots,b_{p}(x))$ is the basis evaluated at some point $x$.
> $\mathbf{B}(\mathbf{x}) \in \mathbb{R}^{n \times p}$ is the matrix produced by concatenating $b_{1}(\mathbf{x}),\dots,b_{J}(\mathbf{x})$, i.e. the augmented design matrix.
Splines (just like all basis expansion methods) have to be fitted by answering two questions:
*What is the (spline) basis?* This involves choosing:
- The degree of the spline,
- The basis representation (what are the knots and piecewise polynomial functions).
*What are their coefficients?* This involves choosing:
- The form of the objective (error and penalty),
- The strength of penalty in the objective.
## Choice of Basis: First Steps
A basic implementation uses the **truncated power basis** for some power $p$, defined as $b(x) :=\{ x^{i}: i=0,\dots,p \} \cup \{ (x-\xi_{k})^{p}_{+}: k = 1,\dots,K\},$where $(\dots)_{+}:=\max(\dots,0)$, so the basis is *polynomials of up to power $p$, and truncated polynomials of power $p$.*
However, these bases have the issue of being not orthogonal, causing numerical instability and high computational costs.
### Piecewise Linear Basis
One improvement over the truncated (linear) basis is the **tent functions**: $b_{k}(x):=\begin{dcases*}
\frac{x-\xi_{k-1}}{\xi_{k}-\xi_{k-1}} &\text{if $x \in (\xi_{k-1},\xi_{k}]$} \\[0.4em]
\frac{\xi_{k+1}-x}{\xi_{k+1}-\xi_{k}} &\text{if $x \in (\xi_{k}, \xi_{k+1})$}\\[0.4em]
\,\,0 &\text{otherwise}
\end{dcases*}$
for $k=2,\dots,K-1$. For $k=1,K$, the branch going beyond the knots is dropped.
Given distinct datapoints $(\mathbf{x},\mathbf{y})$ (dots in the left panels), setting one knot for each $x_{i}$ produce the tent functions in the left panel:
![[LinearBasisInterpolation.png#invert|center]]
The right panel shows the linearly combined tents that *interpolate* the data.
## Penalized Spline Fitting
Just like any basis expansion technique, splines have the risk of overfitting the data due to their flexibility. By imposing constraints and penalties, we can require the spline to display a few desirable qualities:
(yapping alert -- expand to see justifications for those constraints/penalities)
> [!info]- Smoothness penalty on the second derivative
> We may expect the relationship $\mathbb{E}[Y~|~X]=f(X)$ to be reasonably smooth in $X$. *A rapidly changing second derivative causes the function to look very wiggly*, so we can penalize the fitted relationship's second derivative.
> [!info]- Continuity up the second derivative
> Piecewise linear splines, although providing decent fits, are not differentiable at knots, so the fitted relationship looks janky. In fact, human can perceive *continuity up the continuous second derivative*.
>
> *Second-derivative constraints will require a polynomial basis to be at least cubic*, since a quadratic basis does not provide enough flexibility (its 3 coefficients are all determined by the $C_{0},C_{1},C_{2}$ constraints).
### The Wiggliness Penalty
Formalizing the smoothness requirements,
> [!definition|*] Wiggliness-Penalized Curve Fitting
> Fixing a penalty parameter $\lambda$, we can solve the problem $\begin{align*}
\underset{f \in C^{2}[a,b]}{\arg\min}~\| \mathbf{y}-f(\mathbf{x}) \|^{2}+\lambda \int _{a}^{b}(f'')^{2} ~ dx,& &(\ast)
\end{align*} $i.e. the RSS with penalty for wiggliness.
^3e1e8f
- This requires the optimizer $\hat{f}$ to be $C^{2}$-continuous, and also controls its wiggliness.
However, in the context of basis expansion with basis $b$ and coefficients $\beta$, this reduces to
> [!theorem|*] Wiggliness penalty as $l_2$ penalty
> The wiggliness penalty is equivalent to the $l_{2}$ penalty: if $f:=\sum_{j}\beta_{j}b_{j}$, and denoting $\left< f,g \right>:= \int _{a}^{b} fg~ dx$, the penalties are $\begin{align*}
\int _{a}^{b} (f'')^{2}~ dx &= \| f'' \|^{2}\\
&= \| \sum_{j}\beta_{j}b_{j} \| ^{2}\\
&= \beta^{T}D\beta,
\end{align*}$where $D_{ij}:= \left< b_{i}'',b_{j}'' \right>$.
Of course, the optimal coefficients are $\hat{\beta}=(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda D)^{-1}\mathbf{X}^{\ast},$where $\mathbf{X}^{\ast}$ is the expanded design matrix.
> [!connection]
> Therefore, the wiggliness penalty in basis expansions is equivalent to a [[Ridge Penalty|ridge penalty]] with weights $D$.
> - In [[Linear Regression Methods#Ridge Regression|ridge regression]], the shrink is equally applied via the matrix $I$.
> - In basis expansion, the shrinkage is applied proportional to the wiggliness $\Omega$.
A natural question is now *what (family of equivalent) bases provides the optimizer of the above problem*. In fact, we shall see that the answer is piecewise-cubic bases.
- Note that there are multiple bases that all span the vector space of piecewise cubics -- any one of them work in theory, even though some are easier to use in practice.
### Penalized Splines as a Mixed Linear Model
Following the characterization of [[Mixed Linear Models#Relationship to Ridge-Like Penalties|mixed linear models as a ridge-penalized linear regression]], we can use the equivalence to cast ridge-penalized splines as a mixed linear model:
Suppose $\mathbf{X}=[\mathbf{1} ~~ \mathbf{x}]$ is the original design matrix, and $\mathbf{Z}$ contains the spline basis expansion. The penalized spline optimizes the objective $J=\left\| \mathbf{y}-\mathbf{X}\beta - \mathbf{Z}b \right\|^{2}+\lambda b^{T}Db,$where $\beta$ are the (unpenalized) coefficients of the original data, and $b$ are those of the spline basis.
This corresponds to a mixed linear model of
$\begin{align*}
Y&= X\beta+Zb+\epsilon,\\
b&\sim N\left( 0, \frac{\sigma_{\epsilon}^{2}}{\lambda}D^{-1} \right),\\
\epsilon&\overset{\mathrm{iid.}}{\sim} N(0, \sigma^{2}_{\epsilon}).
\end{align*}$
^09ce88
Of course $\sigma^{2}_{\epsilon}$ is not determined here, but we only care about the value of the noise-to-signal ratio $\lambda:= \frac{\sigma^{2}_{\epsilon}}{\sigma^{2}_{b}}$if we assume $D=I$.
### Selecting the Penalty $\lambda$
From a purely theoretic standpoint, we can estimate $\lambda$ using the above characterization as a mixed linear model
> [!definition|*] Mixed Model Likelihood-Based Selection
> Set $D=I$ in the mixed model, and compute $\hat{\lambda}={\widehat{\sigma^{2}_{\epsilon}}}~ \Big/ ~{\widehat{\sigma^{2}_{b}}},$where $\widehat{\sigma_{\epsilon}^{2}},\widehat{\sigma^{2}_{b}}$ are MLE or REML estimates in the mixed model.
Alternatively, we can use standard model selection methods like [[Cross Validation|cross validation]]. Since [[#Splines as Linear Smoothers|spline is a linear smother]], the same [[Cross Validation#LOOCV for Linear Smoothers|convenient results for LOOCV]] applies.
## Cubic Splines
As explained above, the $C^{2}$ continuity requires a polynomial basis to be at least cubic, and indeed **cubic splines** are the most common polynomial spline. But why stop at cubics?
### Optimality Results
> [!bigidea] Cubic splines make the least wiggly interpolant, and are optimizers of the wiggliness-penalized curve fitting problem.
Consider an interpolation task on a 2D dataset $\mathcal{D}=(\mathbf{x},\mathbf{y})$ with strictly increasing feature $x_{1}<\cdots<x_{n}$.
- Examples include a polynomial of order $n$ -- see [[Polynomial Interpolation]].
> [!definition|*] Natural Cubic Splines
> An interpolating cubic spline $g$ of the dataset is called a **natural cubic spline** if
> - $g$ is a piecewise cubic polynomial with knots on each $x_{i}$,
> - $g$ is continuous up to the second derivative (so $g \in C_{2}$),
> - and $g$ is linear at the end points with $g''(x_{1})=g''(x_{n})=0$.
The cubic spline has enough flexibility (esp. continuous second derivatives) to look continuous to human eyes, and stable enough to avoid excessive overfitting.
- The last requirement of natural cubic splines avoids the erratic behavior of cubics (and high-order polynomials) in extrapolation.
The natural cubic interpolant is optimal in the sense that
> [!theorem|*] Optimality of natural cubic spline as interpolant
> *The natural cubic interpolant is the least wiggly interpolant*, i.e. it is the minimizer of the problem $\min_{\substack{\hat{f} \in C^{2}[x_{1},x_{n}] \\ \text{ interpolating }}} \int_{x_{1}}^{x_{n}} \hat{f}''(x)^{2} ~ dx.$
>
> > [!proof]-
> > Let the natural cubic spline be $g$, and consider any other intepolant $\hat{f}$. Define $h:=\hat{f}-g$.
> >
> > Then $\begin{align*}
> \int \hat{f}''^{2} ~ dx&= \int (h''+g'')^{2} ~ dx\\
> &= \int h''^{2} ~ dx +2\int h''g'' ~ dx +\int g''^{2} ~ dx
> \end{align*}$The second term evaluates to $0$, because $g''$ is piecewise constant, so with $x^{\ast}_{i} \in [x_{i}, x_{i+1})$: $\begin{align*}
> \int h''g'' ~ dx&= \sum_{i=1}^{n-1}g''(x_{i}^{\ast})\int _{x_{i}}^{x_{i+1}}h'' ~ dx\\
> &= \sum_{i}g''(x_{i}^{\ast})(h'(x_{i+1})-h'(x_{i}))\\
> &= g''(x_{n})h'(x_{n})-g''(x_{1})h'(x_{1})\\
> &= 0,
> \end{align*}$as the natural cubic spline has $g''=0$ at the end points.
> >
> > Therefore, $\int \hat{f}''^{2} ~ dx > \int g''^{2} ~ dx$.
>
With a more complex dataset with dense or even duplicate $x