Consider a linear regression model $Y=\beta^{T}X + \epsilon$ with prior $\beta \sim N(\mathbf{0}, \Sigma)$. We can conduct inference directly on $Y ~|~ X$ as $p(y ~|~ \mathbf{X})=\int \pi(\beta) p(y ~|~ \mathbf{X},\beta)~ d\beta=N(\mathbf{0}, \sigma^{2}I+\mathbf{X}\Sigma \mathbf{X}^{T}).$
Therefore, we can remove the parametrization by integrating out $\beta$ .
A Gaussian process captures this non-parametric idea by directly modeling some function/process $f: \mathcal{X} \to \mathbb{R}$, with the rather general assumption that
> [!definition|*] Gaussian Process
> A function $f$ is a **Gaussian process** if its pointwise values $\{ F_{x} := f(x) ~|~ x \in \mathcal{X} \}$ are jointly Gaussian.
>
> This means when sampled at any finite subset of points $(x_{1},\dots,x_{n})$ and any $n \ge 1$, they are distributed as $(f(x_{1}),\dots,f(x_{n})):=:(F_{x_{1}},\dots, F_{x_{n}}) \sim N$for some suitable mean and covariance.
## In-Sample GP Regression
In a regression problem, we can model $f(X):=\mathbb{E}[Y ~|~ X]$ as a GP with our choice of $\Sigma$, then use observed data $\mathbf{X},\mathbf{y}$ to conduct inference on $\mathbf{f}:=(f(x_{1}),\dots,f(x_{n}))$. Since we haven't introduced new data (yet), this is an in-sample problem.
### GP Posterior with Gaussian Noise
If we instead observe a corrupted version of $\mathbf{f}$, say $\mathbf{y}=\mathbf{f}+\pmb{\epsilon},$where $\pmb{\epsilon} \sim N(0, \Sigma_{\epsilon})$ is assumed to be independent Gaussian noise.
To conduct inference on $\mathbf{f}$, we have $\begin{bmatrix}
\mathbf{y} \\ \mathbf{f}
\end{bmatrix} \sim N\left( 0, \begin{bmatrix}
\Sigma_{\epsilon} + \Sigma & \Sigma \\ \Sigma & \Sigma
\end{bmatrix} \right),$so the posterior is $\mathbf{f} ~|~ \mathbf{y} \sim N(\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\mathbf{y}, \Sigma-\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\Sigma).$
### Maximizing Joint Distribution $p(\mathbf{f}, \mathbf{y})$
Maximizing the posterior is equivalent to maximizing the joint likelihood $p(\mathbf{f},\mathbf{y})$ (since the marginal $p(\mathbf{y})$ is a constant), whose log is $\begin{align*}
l(\mathbf{f},\mathbf{y})&= l(\mathbf{y} ~|~ \mathbf{f})+l(\mathbf{f})\\
&= \mathrm{const.}-\frac{1}{2}\| \mathbf{y}-\mathbf{f} \| _{\Sigma_{\epsilon}^{-1}}^{2} -\frac{1}{2}\mathbf{f}^{T}\Sigma^{-1} \mathbf{f}.
\end{align*}$
By writing $\mathbf{f}:=\Sigma\beta$ for some $\beta$ (recall that $\Sigma$ is positive defin
ite hence invertible), optimizing this is equivalent to the problem $\min_{\beta} \|\mathbf{y}-\Sigma\beta \|_{\Sigma_{\epsilon}^{-1}}^{2}+ \beta^{T}\Sigma\beta.$
A quick differentiation gives $\nabla_{\!\beta}\left( \| \mathbf{y}-\Sigma \beta \|^{2}_{\Sigma^{-1}_{\epsilon}}+\beta^{T}\Sigma \beta \right)=-2\Sigma \Sigma_{\epsilon}^{-1}\mathbf{y}+2\Sigma \Sigma_{\epsilon}^{-1}\Sigma \beta+2\Sigma \beta,$so setting it to zero gives the closed form solution $\hat{\beta}=(\Sigma_{\epsilon}^{-1}\Sigma+I)^{-1}\Sigma^{-1}_{\epsilon}\mathbf{y}=(\Sigma+\Sigma_{\epsilon})^{-1}\mathbf{y},$which agrees with the previous section, and with homoscedastic noise, it is $\hat{\beta}=(\Sigma+\sigma^{2}I)^{-1}\mathbf{y}.$
## Covariance Models
It is common to assume the jointly Gaussian distribution has mean $0$ (if we believe otherwise, we can WLOG shift the $f(x)$ that has a non-0 mean), so the main task is to select the **covariance function** $k(x,x'):= \mathrm{Cov}(f(x),f(x'))$.
But what makes a good covariance function, and how do they determine the shape of the
GP estimates?
### GP Inference in RKHS
Notice how similar the posterior log-likelihood is to the kernel ridge regression objective in an [[Reproducing Kernel Hilbert Space|RKHS]]:
> [!embed]
> ![[Reproducing Kernel Hilbert Space#^3f51e8]]
The two problems are equivalent if (1) we have homoscedastic noise $\Sigma_{\epsilon}=\sigma^{2}I$, and $\lambda= \sigma^{2}$ is the amount of noise, and (2) $K=\Sigma$ (i.e. the RKHS's kernel $\kappa$ is the GP's cross-covariance $k$).
In terms of the original RKHS penalised ERM, we get $\min_{f}~ \| \mathbf{y}-f(\mathbf{X}) \|^{2}+{\sigma^{2}}\| f \|_{\mathcal{H}(k)}^{2}, $i.e. a generic functional optimisation problem, regularised by its "complexity" in the RKHS, i.e. how well its values $\mathbf{f}$ fit into a GP likelihood.
- A high $\sigma^{2}$ means that we can explain most of the variance in $\mathbf{y}$ with noise, hence we heavily penalise $f