Consider a linear regression model $Y=\beta^{T}X + \epsilon$ with prior $\beta \sim N(\mathbf{0}, \Sigma)$. We can conduct inference directly on $Y ~|~ X$ as $p(y ~|~ \mathbf{X})=\int \pi(\beta) p(y ~|~ \mathbf{X},\beta)~ d\beta=N(\mathbf{0}, \sigma^{2}I+\mathbf{X}\Sigma \mathbf{X}^{T}).$
Therefore, we can remove the parametrisation by integrating out $\beta$ .
A Gaussian process captures this non-parametric idea by directly modelling some function/process $f: \mathcal{X} \to \mathbb{R}$, with the rather general assumption that
> [!definition|*] Gaussian Process
> A function $f$ is a **Gaussian process** if its pointwise values $\{ F_{x} := f(x) ~|~ x \in \mathcal{X} \}$ are jointly Gaussian.
>
> This means when sampled at any finite subset of points $(x_{1},\dots,x_{n})$ and any $n \ge 1$, they are distributed as $(f(x_{1}),\dots,f(x_{n})):=:(F_{x_{1}},\dots, F_{x_{n}}) \sim N$for some suitable mean and covariance.
## GP Regression with Gaussian Noise
In a regression problem, we can model $f(X):=\mathbb{E}[Y ~|~ X]$ as a GP with our choice of $\Sigma$, then use observed data $\mathbf{X},\mathbf{y}$ to conduct inference on $\mathbf{f}:=(f(x_{1}),\dots,f(x_{n}))$. Since we haven't introduced new data (yet), this is an in-sample problem.
### In-Sample GP Posterior
Suppose $\mathbf{y}$ is a corrupted version of $\mathbf{f}$, say
$\mathbf{y}=\mathbf{f}+\pmb{\epsilon},$
where $\pmb{\epsilon} \sim N(0, \Sigma_{\epsilon})$ is assumed to be independent Gaussian noise.
To conduct inference on $\mathbf{f}$, we have
$\begin{bmatrix}
\mathbf{y} \\ \mathbf{f}
\end{bmatrix} \sim N\left( 0, \begin{bmatrix}
\Sigma_{\epsilon} + \Sigma & \Sigma \\ \Sigma & \Sigma
\end{bmatrix} \right),$
so the posterior is
$\mathbf{f} ~|~ \mathbf{y} \sim N(\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\mathbf{y}, \Sigma-\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\Sigma).$
### Maximising Joint Distribution $p(\mathbf{f}, \mathbf{y})$
Maximising the posterior is equivalent to maximising the joint likelihood $p(\mathbf{f},\mathbf{y})$ (since the marginal $p(\mathbf{y})$ is a constant), whose log is $\begin{align*}
l(\mathbf{f},\mathbf{y})&= l(\mathbf{y} ~|~ \mathbf{f})+l(\mathbf{f})\\
&= \mathrm{const.}-\frac{1}{2}\| \mathbf{y}-\mathbf{f} \| _{\Sigma_{\epsilon}^{-1}}^{2} -\frac{1}{2}\mathbf{f}^{T}\Sigma^{-1} \mathbf{f}.
\end{align*}$
By writing $\mathbf{f}:=\Sigma\beta$ for some $\beta$ (recall that $\Sigma$ is positive definite hence invertible), optimising this is equivalent to the problem
$\min_{\beta} \|\mathbf{y}-\Sigma\beta \|_{\Sigma_{\epsilon}^{-1}}^{2}+ \beta^{T}\Sigma\beta.$
A quick differentiation gives $\nabla_{\!\beta}\left( \| \mathbf{y}-\Sigma \beta \|^{2}_{\Sigma^{-1}_{\epsilon}}+\beta^{T}\Sigma \beta \right)=-2\Sigma \Sigma_{\epsilon}^{-1}\mathbf{y}+2\Sigma \Sigma_{\epsilon}^{-1}\Sigma \beta+2\Sigma \beta,$so setting it to zero gives the closed form solution $\hat{\beta}=(\Sigma_{\epsilon}^{-1}\Sigma+I)^{-1}\Sigma^{-1}_{\epsilon}\mathbf{y}=(\Sigma+\Sigma_{\epsilon})^{-1}\mathbf{y},$which agrees with the previous section, and with homoscedastic noise, it is $\hat{\beta}=(\Sigma+\sigma^{2}I)^{-1}\mathbf{y}.$
Moreover, since the posterior mean is also the MAP of $\mathbf{f}$, we have
$\mathbb{E}[\mathbf{f} ~|~ \mathbf{y}]= \Sigma \hat{\beta}.$
One implication is that *the posterior mean is a linear combination of the covariance function $k$ evaluated at $\mathbf{x}$*.
### Predictive GP Posterior
Now suppose we collected new data points $x^{\ast}=(x^{\ast}_{1},\dots,x^{\ast}_{k})$, and wish to make predictions $\mathbf{f}^{\ast}:=(f(x_{1}^{\ast}),\dots,f(x_{k}^{\ast}))$.
To keep things simple, first suppose we can directly observe $\mathbf{f}$, and use that to infer its distribution at arbitrary new inputs $x^{\ast}$. We have $\begin{bmatrix} \mathbf{f} \\ \mathbf{f}^{\ast}\end{bmatrix} \sim N\left(0, \begin{bmatrix}\Sigma & C\\ C^{T} & \Sigma^{\ast}\end{bmatrix}\right),$where the entries are $\Sigma:=\mathrm{Cov}(\mathbf{f})$, $\Sigma ^{\ast}:= \mathrm{Cov}(\mathbf{f}^{\ast})$, and $C:=C(\mathbf{f},\mathbf{f}^{\ast})=(k(x_{i}, x^{\ast}_{j}))_{i,j}$ the $n \times m$ **cross covariance matrix** of $\mathbf{f},\mathbf{f}^{\ast}$; it will be abbreviated as $C$ for now. Then standard Gaussian results give $\mathbf{f}^{\ast} ~|~ \mathbf{f} \sim N\left( C^{T}\Sigma^{-1}\mathbf{f}, \Sigma^{\ast}-C^T \Sigma^{-1}C \right).$
Now we add back the noise, and we observe $\mathbf{y}$ where $\mathbf{y}~|~ \mathbf{f} \sim N(0, \Sigma_{\epsilon})$ instead of $\mathbf{f}$ itself. Since the noise is Gaussian, we do not need to explicitly marginalize $(\mathbf{y},\mathbf{f}^{\ast})$ to know that they are also jointly Gaussian. Their cross-covariance is simply $\mathrm{Cov}(\mathbf{y},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f}+\pmb{\epsilon},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f},\mathbf{f}^{\ast})=C$ as the noise is independent of all else.
Therefore
$\begin{bmatrix}
\mathbf{y} \\ \mathbf{f}^{\ast}
\end{bmatrix}\sim N\left(0, \begin{bmatrix}\Sigma_{\epsilon}+\Sigma & C\\ C^{T} & \Sigma^{\ast}\end{bmatrix}\right).$
The predictive distribution is
$\mathbf{f}^{\ast} ~|~ \mathbf{y} \sim N(C^{T}(\Sigma_{\epsilon}+\Sigma(x))^{-1}\mathbf{y}, \Sigma(x^{\ast})- C^T (\Sigma_{\epsilon}+\Sigma(x))^{-1}C).$
### GP Posterior is GP
For predicting at a single test point $x'$, we have $f(x^{\ast}) ~|~ \mathbf{y} \sim N\left( \sum_{i=1}^{n}k(x',x_{i})((\Sigma_{\epsilon}+\Sigma(x))^{-1}\mathbf{y})_{i} ,\dots\right),$
so *the (predictive) posterior mean is a linear combination of the covariance functions $k(x',x_{i})$.*
For multiple test points $x, x'$, the predictions' covariance is $\begin{pmatrix}
k(x,x) & k(x,x') \\ k(x,x') & k(x',x')
\end{pmatrix} - C^{T}(x,x')(\Sigma_{\epsilon}+\Sigma)^{-1}C(x,x'),$
with $C(x,x'):= \begin{pmatrix}k(x,x_{1}) & k(x',x_{1}) \\ \vdots & \vdots \\ k(x,x_{n}) & k(x',x_{n})\end{pmatrix}$. Therefore, they also form a GP with covariance function $k_{\text{posterior}}(x,x')= k(x,x')- k(x,\mathbf{x})^{T}(\Sigma_{\epsilon}+\Sigma)^{-1}k(x', \mathbf{x}).$
## Covariance Models
It is common to assume the jointly Gaussian distribution has mean $0$ (if we believe otherwise, we can WLOG shift the $f(x)$ that has a non-0 mean), so the main task is to select the **covariance function** $k(x,x'):= \mathrm{Cov}(f(x),f(x'))$.
But what makes a good covariance function, and how do they determine the shape of the GP estimates?
### GP Inference in RKHS
Notice how similar the posterior log-likelihood is to the kernel ridge regression objective in an [[Reproducing Kernel Hilbert Space|RKHS]]:
> [!embed]
> ![[Reproducing Kernel Hilbert Space#^3f51e8]]
The two problems are equivalent if (1) we have homoscedastic noise $\Sigma_{\epsilon}=\sigma^{2}I$, and $\lambda= \sigma^{2}$ is the amount of noise, and (2) $K=\Sigma$ (i.e. the RKHS's kernel $\kappa$ is the GP's cross-covariance $k$).
In terms of the original RKHS penalised ERM, we get $\min_{f}~ \| \mathbf{y}-f(\mathbf{X}) \|^{2}+{\sigma^{2}}\| f \|_{\mathcal{H}(k)}^{2}, $i.e. a generic functional optimisation problem, regularised by its "complexity" in the RKHS, i.e. how well its values $\mathbf{f}$ fit into a GP likelihood.
- A high $\sigma^{2}$ means that we can explain most of the variance in $\mathbf{y}$ with noise, hence we heavily penalise $f