Splines - Random Notes Go Brrrrrrr

Splines is a form of [[Basis Expansion and Regularization|basis expansion methods]] in regression, using piecewise polynomials that are cut off at **knots $\{ \xi_{k} \}_{k=1}^{K}$**. > [!definition|*] Splines > In general, an **order-$M$ spline** is a piecewise polynomial of order $(M-1)$. It is required to be continuous up to the $(M-2)$th derivative (at the knots). > > Given a basis $\mathcal{B}:=\{ b_{j}(X) : j = 1,\dots, p\}$, one can linearly combine them to get an estimated relationship $\hat{f}(X)=\sum_{j}\hat{\beta}_{j}b_{j}(X)$for some (estimated) coefficients $\hat{\beta}=(\hat{\beta}_{1},\dots,\hat{\beta}_{p})$. Given the basis, we have the **augmented design matrix** $\mathbf{X}^{\ast}:=:\mathbf{B}(\mathbf{x}):= \begin{bmatrix} b_{1}(\mathbf{x})&\dots&b_{p}(\mathbf{x}) \end{bmatrix}\in \mathbb{R}^{n\times p}$on the dataset $(\mathbf{x},\mathbf{y})$, where $b_{j}(\mathbf{x})$ is the column vector $(b_{j}(x_{1}),\dots,b_{j}(x_{n}))^T$. Here we usually assume the constant/intercept $\mathbf{1}=(1,\dots,1)$ and the original $\mathbf{x}$ are contained in $\mathbf{B}(\mathbf{x})$, or is in its span. > [!warning] Which is which > $b_{1},\dots$ are the basis functions. > $\mathcal{B}=\{ b_{1},\dots \}$ is the (unordered) set of basis functions. > $\mathbf{b}=(b_{1},\dots, b_{p})$ is the ordered tuple of them. > > $b_{j}(\mathbf{x})$ are vectors produced by applying $b_{j}$ entry-wise to $\mathbf{x}$. > $\mathbf{b}(x) =(b_{1}(x),\dots,b_{p}(x))$ is the basis evaluated at some point $x$. > $\mathbf{B}(\mathbf{x}) \in \mathbb{R}^{n \times p}$ is the matrix produced by concatenating $b_{1}(\mathbf{x}),\dots,b_{J}(\mathbf{x})$, i.e. the augmented design matrix. Splines (just like all basis expansion methods) have to be fitted by answering two questions: *What is the (spline) basis?* This involves choosing: - The degree of the spline, - The basis representation (what are the knots and piecewise polynomial functions). *What are their coefficients?* This involves choosing: - The form of the objective (error and penalty), - The strength of penalty in the objective. ## Choice of Basis: First Steps A basic implementation uses the **truncated power basis** for some power $p$, defined as $b(x) :=\{ x^{i}: i=0,\dots,p \} \cup \{ (x-\xi_{k})^{p}_{+}: k = 1,\dots,K\},$where $(\dots)_{+}:=\max(\dots,0)$, so the basis is *polynomials of up to power $p$, and truncated polynomials of power $p$.* However, these bases have the issue of being not orthogonal, causing numerical instability and high computational costs. ### Piecewise Linear Basis One improvement over the truncated (linear) basis is the **tent functions**: $b_{k}(x):=\begin{dcases*} \frac{x-\xi_{k-1}}{\xi_{k}-\xi_{k-1}} &\text{if $x \in (\xi_{k-1},\xi_{k}]$} \\[0.4em] \frac{\xi_{k+1}-x}{\xi_{k+1}-\xi_{k}} &\text{if $x \in (\xi_{k}, \xi_{k+1})$}\\[0.4em] \,\,0 &\text{otherwise} \end{dcases*}$ for $k=2,\dots,K-1$. For $k=1,K$, the branch going beyond the knots is dropped. Given distinct datapoints $(\mathbf{x},\mathbf{y})$ (dots in the left panels), setting one knot for each $x_{i}$ produce the tent functions in the left panel: ![[LinearBasisInterpolation.png#invert|center]] The right panel shows the linearly combined tents that *interpolate* the data. ## Penalized Spline Fitting Just like any basis expansion technique, splines have the risk of overfitting the data due to their flexibility. By imposing constraints and penalties, we can require the spline to display a few desirable qualities: (yapping alert -- expand to see justifications for those constraints/penalities) > [!info]- Smoothness penalty on the second derivative > We may expect the relationship $\mathbb{E}[Y~|~X]=f(X)$ to be reasonably smooth in $X$. *A rapidly changing second derivative causes the function to look very wiggly*, so we can penalize the fitted relationship's second derivative. > [!info]- Continuity up the second derivative > Piecewise linear splines, although providing decent fits, are not differentiable at knots, so the fitted relationship looks janky. In fact, human can perceive *continuity up the continuous second derivative*. > > *Second-derivative constraints will require a polynomial basis to be at least cubic*, since a quadratic basis does not provide enough flexibility (its 3 coefficients are all determined by the $C_{0},C_{1},C_{2}$ constraints). ### The Wiggliness Penalty Formalizing the smoothness requirements, > [!definition|*] Wiggliness-Penalized Curve Fitting > Fixing a penalty parameter $\lambda$, we can solve the problem $\begin{align*} \underset{f \in C^{2}[a,b]}{\arg\min}~\| \mathbf{y}-f(\mathbf{x}) \|^{2}+\lambda \int _{a}^{b}(f'')^{2} ~ dx,& &(\ast) \end{align*} $i.e. the RSS with penalty for wiggliness. ^3e1e8f - This requires the optimizer $\hat{f}$ to be $C^{2}$-continuous, and also controls its wiggliness. However, in the context of basis expansion with basis $b$ and coefficients $\beta$, this reduces to > [!theorem|*] Wiggliness penalty as $l_2$ penalty > The wiggliness penalty is equivalent to the $l_{2}$ penalty: if $f:=\sum_{j}\beta_{j}b_{j}$, and denoting $\left< f,g \right>:= \int _{a}^{b} fg~ dx$, the penalties are $\begin{align*} \int _{a}^{b} (f'')^{2}~ dx &= \| f'' \|^{2}\\ &= \| \sum_{j}\beta_{j}b_{j} \| ^{2}\\ &= \beta^{T}D\beta, \end{align*}$where $D_{ij}:= \left< b_{i}'',b_{j}'' \right>$. Of course, the optimal coefficients are $\hat{\beta}=(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda D)^{-1}\mathbf{X}^{\ast},$where $\mathbf{X}^{\ast}$ is the expanded design matrix. > [!connection] > Therefore, the wiggliness penalty in basis expansions is equivalent to a [[Ridge Penalty|ridge penalty]] with weights $D$. > - In [[Linear Regression Methods#Ridge Regression|ridge regression]], the shrink is equally applied via the matrix $I$. > - In basis expansion, the shrinkage is applied proportional to the wiggliness $\Omega$. A natural question is now *what (family of equivalent) bases provides the optimizer of the above problem*. In fact, we shall see that the answer is piecewise-cubic bases. - Note that there are multiple bases that all span the vector space of piecewise cubics -- any one of them work in theory, even though some are easier to use in practice. ### Penalized Splines as a Mixed Linear Model Following the characterization of [[Mixed Linear Models#Relationship to Ridge-Like Penalties|mixed linear models as a ridge-penalized linear regression]], we can use the equivalence to cast ridge-penalized splines as a mixed linear model: Suppose $\mathbf{X}=[\mathbf{1} ~~ \mathbf{x}]$ is the original design matrix, and $\mathbf{Z}$ contains the spline basis expansion. The penalized spline optimizes the objective $J=\left\| \mathbf{y}-\mathbf{X}\beta - \mathbf{Z}b \right\|^{2}+\lambda b^{T}Db,$where $\beta$ are the (unpenalized) coefficients of the original data, and $b$ are those of the spline basis. This corresponds to a mixed linear model of $\begin{align*} Y&= X\beta+Zb+\epsilon,\\ b&\sim N\left( 0, \frac{\sigma_{\epsilon}^{2}}{\lambda}D^{-1} \right),\\ \epsilon&\overset{\mathrm{iid.}}{\sim} N(0, \sigma^{2}_{\epsilon}). \end{align*}$ ^09ce88 Of course $\sigma^{2}_{\epsilon}$ is not determined here, but we only care about the value of the noise-to-signal ratio $\lambda:= \frac{\sigma^{2}_{\epsilon}}{\sigma^{2}_{b}}$if we assume $D=I$. ### Selecting the Penalty $\lambda$ From a purely theoretic standpoint, we can estimate $\lambda$ using the above characterization as a mixed linear model > [!definition|*] Mixed Model Likelihood-Based Selection > Set $D=I$ in the mixed model, and compute $\hat{\lambda}={\widehat{\sigma^{2}_{\epsilon}}}~ \Big/ ~{\widehat{\sigma^{2}_{b}}},$where $\widehat{\sigma_{\epsilon}^{2}},\widehat{\sigma^{2}_{b}}$ are MLE or REML estimates in the mixed model. Alternatively, we can use standard model selection methods like [[Cross Validation|cross validation]]. Since [[#Splines as Linear Smoothers|spline is a linear smother]], the same [[Cross Validation#LOOCV for Linear Smoothers|convenient results for LOOCV]] applies. ## Cubic Splines As explained above, the $C^{2}$ continuity requires a polynomial basis to be at least cubic, and indeed **cubic splines** are the most common polynomial spline. But why stop at cubics? ### Optimality Results > [!bigidea] Cubic splines make the least wiggly interpolant, and are optimizers of the wiggliness-penalized curve fitting problem. Consider an interpolation task on a 2D dataset $\mathcal{D}=(\mathbf{x},\mathbf{y})$ with strictly increasing feature $x_{1}<\cdots<x_{n}$. - Examples include a polynomial of order $n$ -- see [[Polynomial Interpolation]]. > [!definition|*] Natural Cubic Splines > An interpolating cubic spline $g$ of the dataset is called a **natural cubic spline** if > - $g$ is a piecewise cubic polynomial with knots on each $x_{i}$, > - $g$ is continuous up to the second derivative (so $g \in C_{2}$), > - and $g$ is linear at the end points with $g''(x_{1})=g''(x_{n})=0$. The cubic spline has enough flexibility (esp. continuous second derivatives) to look continuous to human eyes, and stable enough to avoid excessive overfitting. - The last requirement of natural cubic splines avoids the erratic behavior of cubics (and high-order polynomials) in extrapolation. The natural cubic interpolant is optimal in the sense that > [!theorem|*] Optimality of natural cubic spline as interpolant > *The natural cubic interpolant is the least wiggly interpolant*, i.e. it is the minimizer of the problem $\min_{\substack{\hat{f} \in C^{2}[x_{1},x_{n}] \\ \text{ interpolating }}} \int_{x_{1}}^{x_{n}} \hat{f}''(x)^{2} ~ dx.$ > > > [!proof]- > > Let the natural cubic spline be $g$, and consider any other intepolant $\hat{f}$. Define $h:=\hat{f}-g$. > > > > Then $\begin{align*} > \int \hat{f}''^{2} ~ dx&= \int (h''+g'')^{2} ~ dx\\ > &= \int h''^{2} ~ dx +2\int h''g'' ~ dx +\int g''^{2} ~ dx > \end{align*}$The second term evaluates to $0$, because $g''$ is piecewise constant, so with $x^{\ast}_{i} \in [x_{i}, x_{i+1})$: $\begin{align*} > \int h''g'' ~ dx&= \sum_{i=1}^{n-1}g''(x_{i}^{\ast})\int _{x_{i}}^{x_{i+1}}h'' ~ dx\\ > &= \sum_{i}g''(x_{i}^{\ast})(h'(x_{i+1})-h'(x_{i}))\\ > &= g''(x_{n})h'(x_{n})-g''(x_{1})h'(x_{1})\\ > &= 0, > \end{align*}$as the natural cubic spline has $g''=0$ at the end points. > > > > Therefore, $\int \hat{f}''^{2} ~ dx > \int g''^{2} ~ dx$. > With a more complex dataset with dense or even duplicate $xs, is not desirable or possible to fit an interpolation. Instead, it is easier to model $Y=f(X)+\epsilon,$ and we can allow for some error in approximating $f$. Using splines for this purpose produces **smoothing splines** and is an example of [[Scatterplot Smoothers|scatterplot smoothers]]. Following the optimality of interpolating splines, smoothing splines are also the best at reducing wiggliness: > [!theorem|*] Optimality of Cubic Splines > Fixing a penalty parameter $\lambda$, the optimizer of the [[#^3e1e8f|wiggliness-penalized curve fitting problem $(*)$]] is a cubic spline. > > > [!proof]- > > If for contradiction any other (non-cubic-spline) function $h$ is the minimizer, finding the interpolating cubic spline $g$ such that $g(\mathbf{x}')=h(\mathbf{x}')$ (where $\mathbf{x}'$ are the unique $x$-values) yields the same $l_{2}$ loss but lower wiggliness penalty, contradicting the optimality of $h$ unless $h=g$. ### Choice of Basis: B-Splines The problem remains for choosing the basis of (piecewise) cubic polynomials for constructing the cubic spline. The most common basis is the **b-spline** basis $\{ B_{i,k} ~|~i=1,\dots,n \}$, $i$ indexing its location, and $k$ its degree. > [!idea] Why b-splines > The main strength of cubic b-splines is that any linear combination of it is $C^{2}$ continuous internally. To see why, see the technicalities box below. It is defined over a set of (internal and boundary) knots $\{ t_{i} ~|~ i=0,\dots,N+1 \}$, its maximum degree $d$ ($d=3$ for cubic splines). > [!exposition] Technicalities of Defining the B-Spline > > Augment the knots by adding $d-1$ other knots beyond $t_{0}$ and $t_{N}$ resp., e.g. by adding $d-1$ copies of each. This gives the knots $t:=(t_{0},\dots,t_{0},t_{1},\dots,t_{N}, \dots, t_{N}),$which we then re-index with $j=0,\dots,N+2d-1$. > > Now recursively defined the basis with a linear transition between two lower-order bases: define $\alpha(x;a,b):= \frac{x-a}{b-a}$with its graph joining $(a,0),(b,1)$ and representing the increasing half of the transition, the recursive definition is > $\begin{align*} > B_{i,k}(x)&= \alpha(x;t_{i},t_{i+k})\cdot B_{i,k-1}(x)\\ > &~~~~~~+ (1-\alpha(x; t_{i+1},t_{i+k+1}) )\cdot B_{i+1,k-1}(x). > \end{align*}$ > > > Lastly, the base case is simply the indicator of intervals between knots: $B_{i,0}(x):= \mathbf{1}\{ x \in [t_{i},t_{i+1}] \}.$ Visually, B-spline bases are recursively constructed as: ![[BSplineDefinition.png#invert|center]] ```python fold def bspline(x, knots, i, degree = 3): padded_knots = [knots[0]] * (degree - 1) + knots + [knots[-1]] * (degree - 1) def bspline_ (x, knots, i, degree = 3): if degree == 0: return (x >= knots[i]) * (x < knots[i+1]) left_component = bspline_(x, knots, i, degree - 1) right_component = bspline_(x, knots, i + 1, degree - 1) left_proportion = (x - knots[i]) / (knots[i + degree] - knots[i]) right_proportion = (knots[i + degree + 1] - x) / (knots[i + degree + 1] - knots[i + 1]) return left_component * left_proportion + right_component * right_proportion return bspline_(x, knots=padded_knots, i = i + degree - 1, degree = degree) ``` ```python fold ax = plt.subplot() plot_x = np.linspace(-0.5, 4.5, 100) knots = list(range(8)) for i, label in zip(range(2), ["B(i, k-1)", "B(i+1, k-1)"]): low_order_spline = bspline(plot_x, knots, i, degree = 2) sns.lineplot(x = plot_x, y = low_order_spline, alpha = 0.3, color = "#aaaaaa", ax = ax) plt.fill_between(x = plot_x, y1 = low_order_spline, alpha = 0.2, color = "#aaaaaa") highest_point_idx = np.argmax(low_order_spline) ax.text(x = plot_x[highest_point_idx], y = 0.01 + low_order_spline[highest_point_idx], s = label, ha = 'center', va = 'bottom') high_order_spline = bspline(plot_x, knots, 0, degree = 3) sns.lineplot(x = plot_x, y = high_order_spline, ax = ax) plt.fill_between(x = plot_x, y1 = high_order_spline, alpha = 0.3) ax.text(x = 2, y = 0.5, s = "B(i, k)", ha = 'center', va = 'bottom', color = "#233f7d", size = 14, fontweight = 'bold') ax.set_xticks([0, 1, 3, 4]) ax.set_xticklabels(["x(i)", "x(i+1)", "x(i+k)", "x(i+k+1)"]) ax.set_xlim(-0.5, 4.5) ax.spines[["left", 'right', 'top']].set_visible(False) ax.set_yticks([]) ``` As a result, $B_{i,k}$ is continuous up to the $(k-1)$th derivative iff its components $B_{i,k-1}$ and $B_{i+1,k-1}$ is. By induction it holds for the knots shared by both $B_{i,k-1}$ and $B_{i+1,k-1}$ (i.e. $t_{i+1},\dots,t_{i+k}$) Moreover, it has the [excellent property](https://doi.org/10.1006/jath.1998.3310) that > [!lemma|*] Bound on condition number of the augmented data > The augmented design matrix $\mathbf{B}(\mathbf{x})$ has [[Numerical Stability#^af2303|condition number]] $\kappa_{2}(\mathbf{B}(\mathbf{x})) < k 2^{k},$where $k$ is the order of the basis (i.e. $k=3$ for cubic splines). - As a result, the Gram matrix $\mathbf{B}^{T}\mathbf{B}$ has condition number bounded by $\kappa_{2}(\mathbf{B})^{2}=k^{2}2^{2k}$, which is usually small enough for most purposes. ### Splines for Regression Denoting the spline of response $Y$ and predictor $X$ as $f(X)=\sum_{l}\theta_{l}h_{l}(X)$, using $f(X)$ as a predictor provides a basis expansion. Splines can also be used for restricting regression coefficients: in this example a logistic regression is fitted to classify a phoneme based on certain measurements at different frequencies. ![[SplineCoeffSmoothing.png#invert|w80|center]] - If the predictors are meaningfully indexed (in this case by frequency), we can replace the raw coefficients (jagged grey line) with their estimations in the spline fitted against the index (smooth red curve). - This effectively forces the coefficients to change smoothly, which reduces variance and can give better predictions. ## Splines as Linear Smoothers If we fit the coefficients of $b$ using some form of linear regression $\mathbf{y} \sim \mathbf{X}$ ($\mathbf{X}$ being the augmented design matrix), spline regressions are a form of [[Linear Smoothers]], with some hat matrix $\mathbf{S}$ that maps the response $\mathbf{y}$ to their predicted values $\hat{\mathbf{y}}:= \mathbf{Sy}.$ ### Ranks Since $\mathbf{S}=\mathbf{X}(\cdots)$ as a matrix product, its rank equals that of $\mathbf{X}$ -- the dimension of the expanded basis. Therefore, if the size of the basis $J$ is smaller than the number of observations $n$, the sorted eigenvalues of $\mathbf{S}$ are $\lambda_{1}>\dots>\lambda_{\sim J}>\lambda_{\sim J+1}=\dots=\lambda_{n}=0.$ > [!warning]- Number of non-0 eigenvalues > Note that the number of non-$0$ eigenvalues are not exactly $J$, since the expanded basis maybe linearly dependent when evaluated on the dataset. > > For example, in a dataset with sorted $\mathbf{x}$, $(x-x_{1})_{+}^{3}$ is linearly independent as a function from $\{ 1,x,x^{2},x^{3} \}$, but when evaluated on the dataset, $(\mathbf{x}-x_{1})^{3}_{+}$ is the in the span of $\{ \mathbf{1},\mathbf{x},\mathbf{x}^{2},\mathbf{x}^{3} \}$. Therefore, one can consider $\mathbf{S}$ as a compression of the response $\mathbf{y}$ into a lower-dimensional representation. Example: for a dataset with $n=20$, we fit cubic splines with (1, left) knots at every data point, and (2, right) only 9 knots. ![[SplineBasesRank.png#invert|center]] We see that the 9-knot spline has eigenvalues that disappear after a rank of 11 ($9$ knots + $4$ bases in $\{ 1,x,x^2,x^3 \}$, but with dependence issues that remove 2 ranks). ### Degree of Freedom Following the definition of [[Degree of Freedom#Special Forms for Linear Estimators]], we can use $\mathrm{df}(\lambda):= \mathrm{tr}(\mathbf{S}_{\lambda}),$where $\mathbf{S}_{\lambda}$ is the hat matrix fitted using penalty $\lambda$ (see below). When $\lambda=0$, we have $\mathrm{df(0)}=\mathrm{tr}(\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X})=J=(p+1)+K,$i.e. the number of columns in $\mathbf{X}$, which has $p+1$ polynomial terms and $K$ knots/piecewise-polynomials. On the other hand, when $\lambda \to \infty$, only unpenalized terms can have non-0 coefficients, and the problem reduces to fitting OLS using those terms. If there are $r$ of them, then standard OLS results give $\mathrm{df(\infty)}=r$. On the other hand, the [[Degree of Freedom#Residual Degree of Freedom]] is $\mathrm{df}^{\ast}:= n-2\mathrm{tr}(\mathbf{S}_{\lambda})+\mathrm{tr}(\mathbf{S}_{\lambda}^{T}\mathbf{S}_{\lambda}).$ ## Inference Treating $\hat{f}$ as random (due to randomness in the response $\mathbf{y}$), we are interested in the accuracy of the fitted relationship $\hat{f}$. We can ask questions like: - What is the pointwise variability $\mathrm{Var}(\hat{f}(x))$ or MSE $\mathbb{E}(\hat{f}(x)-f(x))^{2}$? - What is the predictive variability $\mathbb{E}(\hat{f}(x)-Y)^{2}$ for some new, independent $Y~|~X=x$? We can also ask global inference questions like: ### Local Variability The local variability can be captured with some estimate of $\mathrm{sd}(\hat{f}(x)):= \sqrt{ \mathrm{Var}(\hat{f}(x)) },$and with this estimate, if we further assume $\hat{f}(x)$ is Gaussian (e.g. assuming the data $Y~|~X$ is Gaussian), we have the confidence interval $\hat{f}(x) \pm z_{\alpha / 2}\cdot\mathrm{sd}(\hat{f}(x))$ Now the task remains to find the estimate $\widehat{\mathrm{sd}}(\hat{f}(x))$. If $\hat{f}(x)$ is linear in the data $\mathbf{y}$, say $\hat{f}(x)=w(x)\cdot \mathbf{y}$ with some fixed weights $w(x)$, and we further homoscedasticity $\mathrm{Cov}(\mathbf{y})=\mathrm{Cov}(\pmb{\epsilon})=\sigma^{2}I$, we have $\begin{align*} \mathrm{Var}(\hat{f}(x))&= w(x)^{T}\mathrm{Cov}(\mathbf{y})w(x)=\sigma^{2}\| w(x) \| ^{2},\\ &\Longrightarrow \mathrm{sd}(\hat{f}(x))=\sigma \| w(x) \| \end{align*}$and the standard deviation can be estimated as $\widehat{\mathrm{sd}}(\hat{f}(x))=\hat{\sigma}\| w(x) \| $using some $\hat{\sigma}$ that estimates the variability of the noise term $\epsilon$. This approach has a few issues: - $w(x)$ is not usually fixed -- it can depend on penalty parameters that are chosen based on the dataset. - $\hat{\sigma}$ is also dependent on the data -- this can be accounted for with a $t$ distribution with $\mathrm{df}=[\mathrm{df}_{\text{residual}}]$ (the brackets indicates rounding) instead of a Gaussian. - It is not a confidence interval of $f(x)$ but of $\mathbb{E}\hat{f}(x)$, so it doesn't account for bias. Approximating the predictive interval (i.e. that of $Y~|~X=x$) only requires a small tweak to account for the extra variance of $Y$ (compared to $f(x)$): for conciseness let $\mathbb{E}$ is conditioned under $X=x$, and $\begin{align*} \mathbb{E}[(Y-\hat{f}(x))^{2}]&= \mathbb{E}[(Y-f(x))^{2}]+\mathbb{E}[(f(x)-\hat{f}(x))^{2}] &&[Y,\hat{f}\text{ are indep.}]\\ &\approx \mathrm{Var}(Y~|~X=x)+ \mathrm{Var}(\hat{f}(x)) && [f(x) \approx \mathbb{E}\hat{f}(x)]\\ &= \sigma^{2}+\sigma^{2}\| w(x) \| ^{2}, \end{align*}$so the predictive interval is $\hat{f}(x)\pm z_{\alpha / 2} \sqrt{ \hat{\mathbb{E}}[(Y-\hat{f}(x))^{2}]}=\hat{f}(x)\pm z_{\alpha / 2}\cdot\sigma \sqrt{ 1+\| w (x)\|^{2} }.$ In the case of splines fitted with the wiggliness penalty matrix $D$, $w(x)=\mathbf{X}^{\ast}(\mathbf{X}^{\ast T} \mathbf{X}^{\ast})^{-1}\mathbf{b}(x)^{T},$where $\mathbf{X}^{\ast}=\mathbf{B}(\mathbf{x})$ is the augmented design matrix, and $\mathbf{b}(x)=(b_{1}(x),\dots)$ is the basis evaluated at $x$. ### Inference Using Mixed Models In contrast, we don't need to ignore the bias $\mathbb{E} \hat{f}(x)-f(x)$ if we use the mixed model formulation. Suppose we have the formulation above: ![[#^09ce88]] Let $\tilde{f}(x)=\mathbf{b}(x)\begin{bmatrix}\tilde{\beta} \\ \tilde{b}\end{bmatrix}$ be the estimator given by the [[Mixed Linear Models#BLUP of $ beta,b$|BLUPs]] $\tilde{\gamma}=\begin{bmatrix}\tilde{\beta} \\\tilde{b}\end{bmatrix}$. We are interested in $\begin{align*} \mathbb{E}(\hat{f}(x)-f(x))^{2} &\approx \mathbb{E}(\tilde{f}(x)-f(x))^{2}\\ &= \mathrm{Var}(\tilde{f}(x))\\ &= \mathbf{b}(x)^{T}\mathrm{Cov}(\tilde{\gamma}-\gamma)\mathbf{b}(x) & ({\dagger}) \end{align*}$ In our case we have $\begin{align*}\Sigma&= \sigma^{2}_{b}D^{-1},\\ \mathbf{C}&= \mathbf{X}^{\ast},\end{align*}$ so plugging into the formula $\begin{align*}\tilde{\gamma}&= (\mathbf{C}^{T} \mathbf{C}+\sigma^{2}_{\epsilon}\cdot\widetilde{\Sigma^{-1}})^{-1}\mathbf{C}^{T}\mathbf{y}\\ &= (\mathbf{X}^{\ast T} \mathbf{X}^{\ast}+\lambda\cdot\widetilde{D})^{-1}\mathbf{X}^{\ast T}(\mathbf{X}^{\ast}\gamma+\pmb{\epsilon}) \end{align*}$ gives $\mathrm{Cov}(\tilde{\gamma}-\gamma)=\sigma^{2}_{\epsilon}(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda \tilde{D})^{-1}$(the algebra is omitted, IDK how it works). Therefore, plugging it into $({\dagger})$ gives $\begin{align*} \mathbb{E}(\hat{f}(x)-f(x))^{2}&\approx \sigma^{2}_{\epsilon}\mathbf{b}(x)^{T}(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda \tilde{D})^{-1}\mathbf{b}(x),\\ \widehat{\mathrm{sd}}(\hat{f}(x)-f(x))&\approx \sigma_{\epsilon} \sqrt{ \mathbf{b}(x)^{T}(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda \tilde{D})^{-1}\mathbf{b}(x) }. \end{align*}$ ### Simultaneous Confidence Intervals Let $L(x),U(x)$ be some lower- and upper-bounds on $f$ we want to study. They are ordinary confidence intervals if $\mathbb{P}[L(x) \le f(x) \le U(x)] \ge 1-\alpha~~\forall x \in \mathcal{X}$where $\alpha$ is some fixed number. Here $L, U$ are the random quantities, since they are estimated from the data. They are **simultaneous confidence intervals**, in contrast, if $\mathbb{P}[L(x) \le f(x) \le U(x) ~~ \forall x \in \mathcal{X}] \ge 1-\alpha.$Note how this is far stricter than the definition above for ordinary CIs. For splines, simultaneous CI can be approximated with a grid $\mathbf{g}:=(g_{1},\dots,g_{M})$ over $\mathcal{X}$, and requiring $\{ L(x) \le f(x)\le U(x) ~~\forall x \in \mathcal{X} \} \approx \{ L(x) \le f(x)\le U(x) ~~\forall x \in \mathbf{g} \},$by appealing to the continuity of $f$ and by choosing a small mesh in $\mathbf{g}$. Therefore, we look for $L,U$ such that $\mathbb{P}[L(x) \le f(x) \le U(x) ~~\forall x \in \mathbf{g}] \gtrsim 1-\alpha.$ Let our estimation $\hat{f}$ be the estimated BLUP of $f$, so over the grid, $\hat{f}(\mathbf{g})-f(\mathbf{g})=\mathbf{B}(\mathbf{g})(\hat{\gamma}-\gamma),$and the coefficients have distribution $\hat{\gamma}-\gamma \sim N(\mathbf{0}, ~\sigma^{2}_{\epsilon}(\mathbf{X}^{\ast T}\mathbf{X}^{\ast}+\lambda \tilde{D})^{-1}),$so after transforming by pre-applying $\mathbf{B}(\mathbf{g})$, $\hat{f}(\mathbf{g})-f(\mathbf{g}) \sim N(\mathbf{0}, \mathbf{B}(\mathbf{g})^{T}\mathrm{Cov}(\hat{\gamma}-\gamma)\mathbf{B}(\mathbf{g})).$ Now for the simultaneous interval, we only need a bound such that $| \hat{f}(x)-f(x) | \le \text{the bound}~~\forall x \in \mathbf{g}$with high probability. We can obtain a relatively sharp bound using the largest, variance-adjusted error at each $x \in \mathbf{g}$: $\max_{x \in \mathbf{g}}\left| \frac{\hat{f}(x)-f(x)}{\widehat{\mathrm{sd}}(\hat{f}(x)-f(x))} \right|,$and take its (approximate) upper-$\alpha$ quantile $q_{1-\alpha / 2}$ using simulation, giving the bound $(L(x),U(x))=\hat{f}(x) \pm q_{1-\alpha / 2}\cdot \widehat{\mathrm{sd}}(\hat{f}(x)-f(x)).$ ## References [The R package crs's primer document](https://cran.r-project.org/web/packages/crs/vignettes/spline_primer.pdf) [paper by Buja, Hastie, and Tibshirani](https://www.stat.cmu.edu/~rnugent/PCMI2016/papers/LinearSmoothersBuja.pdf)