Consider a linear regression model $Y=\beta^{T}X + \epsilon$ with prior $\beta \sim N(\mathbf{0}, \Sigma)$. We can conduct inference directly on $Y ~|~ X$ as $p(y ~|~ \mathbf{X})=\int \pi(\beta) p(y ~|~ \mathbf{X},\beta)~ d\beta=N(\mathbf{0}, \sigma^{2}I+\mathbf{X}\Sigma \mathbf{X}^{T}).$ Therefore, we can remove the parametrization by integrating out $\beta$ . A Gaussian process captures this non-parametric idea by directly modeling some function/process $f: \mathcal{X} \to \mathbb{R}$, with the rather general assumption that > [!definition|*] Gaussian Process > A function $f$ is a **Gaussian process** if its pointwise values $\{ F_{x} := f(x) ~|~ x \in \mathcal{X} \}$ are jointly Gaussian. > > This means when sampled at any finite subset of points $(x_{1},\dots,x_{n})$ and any $n \ge 1$, they are distributed as $(f(x_{1}),\dots,f(x_{n})):=:(F_{x_{1}},\dots, F_{x_{n}}) \sim N$for some suitable mean and covariance. ## In-Sample GP Regression In a regression problem, we can model $f(X):=\mathbb{E}[Y ~|~ X]$ as a GP with our choice of $\Sigma$, then use observed data $\mathbf{X},\mathbf{y}$ to conduct inference on $\mathbf{f}:=(f(x_{1}),\dots,f(x_{n}))$. Since we haven't introduced new data (yet), this is an in-sample problem. ### GP Posterior with Gaussian Noise If we instead observe a corrupted version of $\mathbf{f}$, say $\mathbf{y}=\mathbf{f}+\pmb{\epsilon},$where $\pmb{\epsilon} \sim N(0, \Sigma_{\epsilon})$ is assumed to be independent Gaussian noise. To conduct inference on $\mathbf{f}$, we have $\begin{bmatrix} \mathbf{y} \\ \mathbf{f} \end{bmatrix} \sim N\left( 0, \begin{bmatrix} \Sigma_{\epsilon} + \Sigma & \Sigma \\ \Sigma & \Sigma \end{bmatrix} \right),$so the posterior is $\mathbf{f} ~|~ \mathbf{y} \sim N(\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\mathbf{y}, \Sigma-\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\Sigma).$ ### Maximizing Joint Distribution $p(\mathbf{f}, \mathbf{y})$ Maximizing the posterior is equivalent to maximizing the joint likelihood $p(\mathbf{f},\mathbf{y})$ (since the marginal $p(\mathbf{y})$ is a constant), whose log is $\begin{align*} l(\mathbf{f},\mathbf{y})&= l(\mathbf{y} ~|~ \mathbf{f})+l(\mathbf{f})\\ &= \mathrm{const.}-\frac{1}{2}\| \mathbf{y}-\mathbf{f} \| _{\Sigma_{\epsilon}^{-1}}^{2} -\frac{1}{2}\mathbf{f}^{T}\Sigma^{-1} \mathbf{f}. \end{align*}$ By writing $\mathbf{f}:=\Sigma\beta$ for some $\beta$ (recall that $\Sigma$ is positive defin ite hence invertible), optimizing this is equivalent to the problem $\min_{\beta} \|\mathbf{y}-\Sigma\beta \|_{\Sigma_{\epsilon}^{-1}}^{2}+ \beta^{T}\Sigma\beta.$ A quick differentiation gives $\nabla_{\!\beta}\left( \| \mathbf{y}-\Sigma \beta \|^{2}_{\Sigma^{-1}_{\epsilon}}+\beta^{T}\Sigma \beta \right)=-2\Sigma \Sigma_{\epsilon}^{-1}\mathbf{y}+2\Sigma \Sigma_{\epsilon}^{-1}\Sigma \beta+2\Sigma \beta,$so setting it to zero gives the closed form solution $\hat{\beta}=(\Sigma_{\epsilon}^{-1}\Sigma+I)^{-1}\Sigma^{-1}_{\epsilon}\mathbf{y}=(\Sigma+\Sigma_{\epsilon})^{-1}\mathbf{y},$which agrees with the previous section, and with homoscedastic noise, it is $\hat{\beta}=(\Sigma+\sigma^{2}I)^{-1}\mathbf{y}.$ ## Covariance Models It is common to assume the jointly Gaussian distribution has mean $0$ (if we believe otherwise, we can WLOG shift the $f(x)$ that has a non-0 mean), so the main task is to select the **covariance function** $k(x,x'):= \mathrm{Cov}(f(x),f(x'))$. But what makes a good covariance function, and how do they determine the shape of the GP estimates? ### GP Inference in RKHS Notice how similar the posterior log-likelihood is to the kernel ridge regression objective in an [[Reproducing Kernel Hilbert Space|RKHS]]: > [!embed] > ![[Reproducing Kernel Hilbert Space#^3f51e8]] The two problems are equivalent if (1) we have homoscedastic noise $\Sigma_{\epsilon}=\sigma^{2}I$, and $\lambda= \sigma^{2}$ is the amount of noise, and (2) $K=\Sigma$ (i.e. the RKHS's kernel $\kappa$ is the GP's cross-covariance $k$). In terms of the original RKHS penalised ERM, we get $\min_{f}~ \| \mathbf{y}-f(\mathbf{X}) \|^{2}+{\sigma^{2}}\| f \|_{\mathcal{H}(k)}^{2}, $i.e. a generic functional optimisation problem, regularised by its "complexity" in the RKHS, i.e. how well its values $\mathbf{f}$ fit into a GP likelihood. - A high $\sigma^{2}$ means that we can explain most of the variance in $\mathbf{y}$ with noise, hence we heavily penalise $fs complexity, encouraging it to be near-deterministic. - Conversely, a low $\sigma^{2}$ means that we really need $f(\mathbf{X})$ to fit $\mathbf{y}$ very well. Therefore, > [!idea] > The GP estimate is a linear combination of the covariance function/kernels $\kappa(\cdot, x_{i}), ~i=1,\dots n$. So if we choose a linear kernel, we get linear estimates (and recover the Bayesian LR at the start of the note, since the Gram matrix reduces to $\mathbf{X}^{T}\mathbf{X}$), etc. ### Choosing Covariance Functions The **Gaussian/RBF** kernel $\exp(-\lambda \| x_{1}-x_{2} \|^{2})$ is a basic choice for $k(x_{1},x_{2})$, but it is not scale-invariant. Therefore, it is common to use the **automatic relevance determination (ARD)** kernel, which has hyperparameters $\tau$ (overall scale) and $\eta\in \mathbb{R}^{p}$ (scale of each coordinate). It is essentially a Gaussian density with covariance matrix $\mathrm{diag}(\eta)^2$: $k_{\tau,\eta}(x,x'):= \tau \exp\left( -\frac{1}{2}\| x-x' \|_{\mathrm{diag}(\eta)^{-2}}^2 \right)= \tau \exp\left( -\frac{1}{2}\sum_{i} \frac{(x_{i}-x'_{i})^{2}}{\eta_{i}^{2}} \right).$ We can perform hyperparameter selection on $\tau$ and $\eta$, hence "automatically" (not really) determining the relevance of each coordinate -- a large $\eta_{i}$ means that the $i$th coordinate is not too important. One downside of RBF/ARD is that they are giga smooth -- they (hence their estimates) are infinitely differentiable. This may cause too much bias when fitting rougher patterns. Therefore, another common choice is the **Matérn kernel**. ## Prediction with GP Regression Now suppose we collected new data points $x^{\ast}=(x^{\ast}_{1},\dots,x^{\ast}_{k})$, and wish to make predictions $\mathbf{f}^{\ast}:=(f(x_{1}^{\ast}),\dots,f(x_{k}^{\ast}))$. To keep things simple, first suppose we can directly observe $\mathbf{f}$, and use that to infer its distribution at arbitrary new inputs $x^{\ast}$. We have $\begin{bmatrix} \mathbf{f} \\ \mathbf{f}^{\ast}\end{bmatrix} \sim N\left(0, \begin{bmatrix}\Sigma & C\\ C^{T} & \Sigma^{\ast}\end{bmatrix}\right),$where the entries are $\Sigma:=\mathrm{Cov}(\mathbf{f})$, $\Sigma ^{\ast}:= \mathrm{Cov}(\mathbf{f}^{\ast})$, and $C:=C(\mathbf{f},\mathbf{f}^{\ast})=(k(x_{i}, x^{\ast}_{j}))_{i,j}$ the $n \times m$ **cross covariance matrix** of $\mathbf{f},\mathbf{f}^{\ast}$; it will be abbreviated as $C$ for now. Then standard Gaussian results give $\mathbf{f}^{\ast} ~|~ \mathbf{f} \sim N\left( C^{T}\Sigma^{-1}\mathbf{f}, \Sigma^{\ast}-C^T \Sigma^{-1}C \right).$ Now we add back the noise, and we observe $\mathbf{y}$ where $\mathbf{y}~|~ \mathbf{f} \sim N(0, \Sigma_{\epsilon})$ instead of $\mathbf{f}$ itself. Since the noise is Gaussian, we do not need to explicitly marginalize $(\mathbf{y},\mathbf{f}^{\ast})$ to know that they are also jointly Gaussian. Their cross-covariance is simply $\mathrm{Cov}(\mathbf{y},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f}+\pmb{\epsilon},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f},\mathbf{f}^{\ast})=C$ as the noise is independent of all else. Therefore $\begin{bmatrix} \mathbf{y} \\ \mathbf{f}^{\ast} \end{bmatrix}\sim N\left(0, \begin{bmatrix}\Sigma_{\epsilon}+\Sigma(x) & C\\ C^{T} & \Sigma(x^{\ast})\end{bmatrix}\right).$ ## GP with Non-Gaussian Likelihoods Things quickly breakdown when the likelihood is not Gaussian: - When the noise $\epsilon$ is not Gaussian, $\mathbf{y} ~|~ \mathbf{f}$ is not Gaussian, and instead $p(\mathbf{y} ~|~ \mathbf{f})=\prod_{i}f_{\epsilon}(y_{i}-f_{i})$if we assume $\epsilon_{i} \overset{\mathrm{iid.}}{\sim} f_{\epsilon}$. - If the additive noise $y_{i}=f_{i}+\epsilon_{i}$ does not hold, for example in a classification setting with a link function $\psi$, we have $\psi(f(x))=\mathbb{P}[Y=1 ~|~ X=x]$ instead, so $p(\mathbf{y} ~|~ \mathbf{f})=\prod_{i} \psi(f_{i})^{y_{i}} \cdot (1-\psi(f_{i}))^{1-y_{i}}.$ In that case, we have to compute the Bayes posteior: $p(\mathbf{f} ~|~ \mathbf{y}) = \frac{p(\mathbf{y} ~|~ \mathbf{f})\cdot \pi(\mathbf{f})}{p(\mathbf{y})}.$ For most non-Gaussian likelihoods, $p(\mathbf{y})$ (so by extension the posterior $p(\mathbf{f} ~|~ \mathbf{y})$) has no closed-form solution. There are a few ways forward: - Approximate the normalizing constant $p(\mathbf{y})$, - Resort to using [[Simulation|various sampling methods]] on $p(\mathbf{f} ~|~ \mathbf{y})$, many of which does not require knowing the normalizing constant. - Approximate $p(\mathbf{y} ~|~ \mathbf{f})$, e.g. with the [[Laplace Approximation]] to make it Gaussian, hence conjugate with the prior. To conduct predictions, we have $p(\mathbf{f}^{\ast} ~|~ \mathbf{y})=\int \underset{\text{noise-free GP}}{p(\mathbf{f}^{\ast} ~|~ \mathbf{f})}\cdot \underset{\text{posterior}}{p(\mathbf{f}~|~\mathbf{y})} ~ d\mathbf{f}, $ and either of the three methods above allow us to evaluate the integral. Other alternatives include replacing all intractable terms with point estimates, here being $p(\mathbf{f}^{\ast} ~|~ \mathbf{y}) \approx p(\mathbf{f}^{\ast} ~|~ \hat{\mathbf{f}}_{\text{MAP}}),$where we replaced $p(\mathbf{f} ~|~ \mathbf{y})$ with a unit mass placed on $\hat{\mathbf{f}}_{\text{MAP}}:= \underset{\mathbf{f}}{\arg\max}~p(\mathbf{f} ~|~ \mathbf{y})=\underset{\mathbf{f}}{\arg\max}~\pi(\mathbf{f})\cdot p(\mathbf{y} ~|~ \mathbf{f}),$the maximum a posteriori (MAP) estimate of $\mathbf{f}$. It's easy to solve for using numerical methods. ### GP Regression with Activation Functions Similar to [[Generalized Linear Models|GLMs]] like logistic regression, we can plug the GP into link functions, e.g. the logit map $\psi:f \mapsto \frac{e^{f}}{1+e^{f}}$ to squish it into $[0,1]$ and output it as a probability, say $\mathbb{P}[Y=1 ~|~ X=x]$. For simplicity we consider prediction on one point $x^{\ast}\in \mathbb{R}$, but the computations extent to vector-values $x^{\ast}$ easily. We have the prediction $\begin{align*} \mathbb{P}[Y =1 ~|~ x^{\ast}]&= \mathbb{E}_{f^{\ast}}[\psi(f^{\ast})]\\ &= \int \psi(f^{\ast})\cdot p(f^{\ast} ~|~ \mathbf{y})~ df^{\ast} \\ &= \iint \psi(f^{\ast})\cdot p(f^{\ast} ~|~ \mathbf{f})\cdot p(\mathbf{f} ~|~ \mathbf{y}) ~df^{\ast}d\mathbf{f}. \end{align*}$ This is even worse than the non-Gaussian noise, as