Gaussian Process - Random Notes Go Brrrrrrr

Consider a linear regression model $Y=\beta^{T}X + \epsilon$ with prior $\beta \sim N(\mathbf{0}, \Sigma)$. We can conduct inference directly on $Y ~|~ X$ as $p(y ~|~ \mathbf{X})=\int \pi(\beta) p(y ~|~ \mathbf{X},\beta)~ d\beta=N(\mathbf{0}, \sigma^{2}I+\mathbf{X}\Sigma \mathbf{X}^{T}).$ Therefore, we can remove the parametrisation by integrating out $\beta$ . A Gaussian process captures this non-parametric idea by directly modelling some function/process $f: \mathcal{X} \to \mathbb{R}$, with the rather general assumption that > [!definition|*] Gaussian Process > A function $f$ is a **Gaussian process** if its pointwise values $\{ F_{x} := f(x) ~|~ x \in \mathcal{X} \}$ are jointly Gaussian. > > This means when sampled at any finite subset of points $(x_{1},\dots,x_{n})$ and any $n \ge 1$, they are distributed as $(f(x_{1}),\dots,f(x_{n})):=:(F_{x_{1}},\dots, F_{x_{n}}) \sim N$for some suitable mean and covariance. ## GP Regression with Gaussian Noise In a regression problem, we can model $f(X):=\mathbb{E}[Y ~|~ X]$ as a GP with our choice of $\Sigma$, then use observed data $\mathbf{X},\mathbf{y}$ to conduct inference on $\mathbf{f}:=(f(x_{1}),\dots,f(x_{n}))$. Since we haven't introduced new data (yet), this is an in-sample problem. ### In-Sample GP Posterior Suppose $\mathbf{y}$ is a corrupted version of $\mathbf{f}$, say $\mathbf{y}=\mathbf{f}+\pmb{\epsilon},$ where $\pmb{\epsilon} \sim N(0, \Sigma_{\epsilon})$ is assumed to be independent Gaussian noise. To conduct inference on $\mathbf{f}$, we have $\begin{bmatrix} \mathbf{y} \\ \mathbf{f} \end{bmatrix} \sim N\left( 0, \begin{bmatrix} \Sigma_{\epsilon} + \Sigma & \Sigma \\ \Sigma & \Sigma \end{bmatrix} \right),$ so the posterior is $\mathbf{f} ~|~ \mathbf{y} \sim N(\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\mathbf{y}, \Sigma-\Sigma(\Sigma_{\epsilon}+\Sigma)^{-1}\Sigma).$ ### Maximising Joint Distribution $p(\mathbf{f}, \mathbf{y})$ Maximising the posterior is equivalent to maximising the joint likelihood $p(\mathbf{f},\mathbf{y})$ (since the marginal $p(\mathbf{y})$ is a constant), whose log is $\begin{align*} l(\mathbf{f},\mathbf{y})&= l(\mathbf{y} ~|~ \mathbf{f})+l(\mathbf{f})\\ &= \mathrm{const.}-\frac{1}{2}\| \mathbf{y}-\mathbf{f} \| _{\Sigma_{\epsilon}^{-1}}^{2} -\frac{1}{2}\mathbf{f}^{T}\Sigma^{-1} \mathbf{f}. \end{align*}$ By writing $\mathbf{f}:=\Sigma\beta$ for some $\beta$ (recall that $\Sigma$ is positive definite hence invertible), optimising this is equivalent to the problem $\min_{\beta} \|\mathbf{y}-\Sigma\beta \|_{\Sigma_{\epsilon}^{-1}}^{2}+ \beta^{T}\Sigma\beta.$ A quick differentiation gives $\nabla_{\!\beta}\left( \| \mathbf{y}-\Sigma \beta \|^{2}_{\Sigma^{-1}_{\epsilon}}+\beta^{T}\Sigma \beta \right)=-2\Sigma \Sigma_{\epsilon}^{-1}\mathbf{y}+2\Sigma \Sigma_{\epsilon}^{-1}\Sigma \beta+2\Sigma \beta,$so setting it to zero gives the closed form solution $\hat{\beta}=(\Sigma_{\epsilon}^{-1}\Sigma+I)^{-1}\Sigma^{-1}_{\epsilon}\mathbf{y}=(\Sigma+\Sigma_{\epsilon})^{-1}\mathbf{y},$which agrees with the previous section, and with homoscedastic noise, it is $\hat{\beta}=(\Sigma+\sigma^{2}I)^{-1}\mathbf{y}.$ Moreover, since the posterior mean is also the MAP of $\mathbf{f}$, we have $\mathbb{E}[\mathbf{f} ~|~ \mathbf{y}]= \Sigma \hat{\beta}.$ One implication is that *the posterior mean is a linear combination of the covariance function $k$ evaluated at $\mathbf{x}$*. ### Predictive GP Posterior Now suppose we collected new data points $x^{\ast}=(x^{\ast}_{1},\dots,x^{\ast}_{k})$, and wish to make predictions $\mathbf{f}^{\ast}:=(f(x_{1}^{\ast}),\dots,f(x_{k}^{\ast}))$. To keep things simple, first suppose we can directly observe $\mathbf{f}$, and use that to infer its distribution at arbitrary new inputs $x^{\ast}$. We have $\begin{bmatrix} \mathbf{f} \\ \mathbf{f}^{\ast}\end{bmatrix} \sim N\left(0, \begin{bmatrix}\Sigma & C\\ C^{T} & \Sigma^{\ast}\end{bmatrix}\right),$where the entries are $\Sigma:=\mathrm{Cov}(\mathbf{f})$, $\Sigma ^{\ast}:= \mathrm{Cov}(\mathbf{f}^{\ast})$, and $C:=C(\mathbf{f},\mathbf{f}^{\ast})=(k(x_{i}, x^{\ast}_{j}))_{i,j}$ the $n \times m$ **cross covariance matrix** of $\mathbf{f},\mathbf{f}^{\ast}$; it will be abbreviated as $C$ for now. Then standard Gaussian results give $\mathbf{f}^{\ast} ~|~ \mathbf{f} \sim N\left( C^{T}\Sigma^{-1}\mathbf{f}, \Sigma^{\ast}-C^T \Sigma^{-1}C \right).$ Now we add back the noise, and we observe $\mathbf{y}$ where $\mathbf{y}~|~ \mathbf{f} \sim N(0, \Sigma_{\epsilon})$ instead of $\mathbf{f}$ itself. Since the noise is Gaussian, we do not need to explicitly marginalize $(\mathbf{y},\mathbf{f}^{\ast})$ to know that they are also jointly Gaussian. Their cross-covariance is simply $\mathrm{Cov}(\mathbf{y},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f}+\pmb{\epsilon},\mathbf{f}^{\ast})=\mathrm{Cov}(\mathbf{f},\mathbf{f}^{\ast})=C$ as the noise is independent of all else. Therefore $\begin{bmatrix} \mathbf{y} \\ \mathbf{f}^{\ast} \end{bmatrix}\sim N\left(0, \begin{bmatrix}\Sigma_{\epsilon}+\Sigma & C\\ C^{T} & \Sigma^{\ast}\end{bmatrix}\right).$ The predictive distribution is $\mathbf{f}^{\ast} ~|~ \mathbf{y} \sim N(C^{T}(\Sigma_{\epsilon}+\Sigma(x))^{-1}\mathbf{y}, \Sigma(x^{\ast})- C^T (\Sigma_{\epsilon}+\Sigma(x))^{-1}C).$ ### GP Posterior is GP For predicting at a single test point $x'$, we have $f(x^{\ast}) ~|~ \mathbf{y} \sim N\left( \sum_{i=1}^{n}k(x',x_{i})((\Sigma_{\epsilon}+\Sigma(x))^{-1}\mathbf{y})_{i} ,\dots\right),$ so *the (predictive) posterior mean is a linear combination of the covariance functions $k(x',x_{i})$.* For multiple test points $x, x'$, the predictions' covariance is $\begin{pmatrix} k(x,x) & k(x,x') \\ k(x,x') & k(x',x') \end{pmatrix} - C^{T}(x,x')(\Sigma_{\epsilon}+\Sigma)^{-1}C(x,x'),$ with $C(x,x'):= \begin{pmatrix}k(x,x_{1}) & k(x',x_{1}) \\ \vdots & \vdots \\ k(x,x_{n}) & k(x',x_{n})\end{pmatrix}$. Therefore, they also form a GP with covariance function $k_{\text{posterior}}(x,x')= k(x,x')- k(x,\mathbf{x})^{T}(\Sigma_{\epsilon}+\Sigma)^{-1}k(x', \mathbf{x}).$ ## Covariance Models It is common to assume the jointly Gaussian distribution has mean $0$ (if we believe otherwise, we can WLOG shift the $f(x)$ that has a non-0 mean), so the main task is to select the **covariance function** $k(x,x'):= \mathrm{Cov}(f(x),f(x'))$. But what makes a good covariance function, and how do they determine the shape of the GP estimates? ### GP Inference in RKHS Notice how similar the posterior log-likelihood is to the kernel ridge regression objective in an [[Reproducing Kernel Hilbert Space|RKHS]]: > [!embed] > ![[Reproducing Kernel Hilbert Space#^3f51e8]] The two problems are equivalent if (1) we have homoscedastic noise $\Sigma_{\epsilon}=\sigma^{2}I$, and $\lambda= \sigma^{2}$ is the amount of noise, and (2) $K=\Sigma$ (i.e. the RKHS's kernel $\kappa$ is the GP's cross-covariance $k$). In terms of the original RKHS penalised ERM, we get $\min_{f}~ \| \mathbf{y}-f(\mathbf{X}) \|^{2}+{\sigma^{2}}\| f \|_{\mathcal{H}(k)}^{2}, $i.e. a generic functional optimisation problem, regularised by its "complexity" in the RKHS, i.e. how well its values $\mathbf{f}$ fit into a GP likelihood. - A high $\sigma^{2}$ means that we can explain most of the variance in $\mathbf{y}$ with noise, hence we heavily penalise $fs complexity, encouraging it to be near-deterministic. - Conversely, a low $\sigma^{2}$ means that we really need $f(\mathbf{X})$ to fit $\mathbf{y}$ very well. Therefore, > [!idea] The GP estimate is a linear combination of the covariance function/kernels $\kappa(\cdot, x_{i}), ~i=1,\dots n$. So if we choose a linear kernel, we get linear estimates (and recover the Bayesian LR at the start of the note, since the Gram matrix reduces to $\mathbf{X}^{T}\mathbf{X}$), etc. ### Choosing Covariance Functions The **Gaussian/RBF** kernel $\exp(-\lambda \| x_{1}-x_{2} \|^{2})$ is a basic choice for $k(x_{1},x_{2})$, but it is not scale-invariant. Therefore, it is common to use the **automatic relevance determination (ARD)** kernel, which has hyperparameters $\tau$ (overall scale) and $\eta\in \mathbb{R}^{p}$ (scale of each coordinate). It is essentially a Gaussian density with covariance matrix $\mathrm{diag}(\eta)^2$: $k_{\tau,\eta}(x,x'):= \tau \exp\left( -\frac{1}{2}\| x-x' \|_{\mathrm{diag}(\eta)^{-2}}^2 \right)= \tau \exp\left( -\frac{1}{2}\sum_{i} \frac{(x_{i}-x'_{i})^{2}}{\eta_{i}^{2}} \right).$ We can perform hyperparameter selection on $\tau$ and $\eta$, hence "automatically" (not really) determining the relevance of each coordinate -- a large $\eta_{i}$ means that the $i$th coordinate is not too important. One downside of RBF/ARD is that they are giga smooth -- they (hence their estimates, whose mean are linear combinations of the kernels) are infinitely differentiable. This may cause too much bias when fitting rougher patterns. Therefore, another common choice is the **Matérn kernel**. ## GP with Non-Gaussian Likelihoods Things quickly breakdown when the likelihood is not Gaussian: - When the noise $\epsilon$ is not Gaussian, $\mathbf{y} ~|~ \mathbf{f}$ is not Gaussian, and instead $p(\mathbf{y} ~|~ \mathbf{f})=\prod_{i}f_{\epsilon}(y_{i}-f_{i})$if we assume $\epsilon_{i} \overset{\mathrm{iid.}}{\sim} f_{\epsilon}$. - If the additive noise $y_{i}=f_{i}+\epsilon_{i}$ does not hold, for example in a classification setting with a link function $\psi$, we have $\psi(f(x))=\mathbb{P}[Y=1 ~|~ X=x]$ instead, so $p(\mathbf{y} ~|~ \mathbf{f})=\prod_{i} \psi(f_{i})^{y_{i}} \cdot (1-\psi(f_{i}))^{1-y_{i}}.$ In that case, we have to compute the Bayes posterior: $p(\mathbf{f} ~|~ \mathbf{y}) = \frac{p(\mathbf{y} ~|~ \mathbf{f})\cdot \pi(\mathbf{f})}{p(\mathbf{y})}.$ For most non-Gaussian likelihoods, $p(\mathbf{y})$ (so by extension the posterior $p(\mathbf{f} ~|~ \mathbf{y})$) has no closed-form solution. There are a few ways forward: - Approximate the normalizing constant $p(\mathbf{y})$, - Resort to using [[Simulation|various sampling methods]] on $p(\mathbf{f} ~|~ \mathbf{y})$, many of which does not require knowing the normalizing constant. - Approximate $p(\mathbf{y} ~|~ \mathbf{f})$, e.g. with the [[Laplace Approximation]] to make it Gaussian, hence conjugate with the prior. To conduct predictions, we have $p(\mathbf{f}^{\ast} ~|~ \mathbf{y})=\int \underset{\text{noise-free GP}}{p(\mathbf{f}^{\ast} ~|~ \mathbf{f})}\cdot \underset{\text{posterior}}{p(\mathbf{f}~|~\mathbf{y})} ~ d\mathbf{f}, $ and either of the three methods above allow us to evaluate the integral. Other alternatives include replacing all intractable terms with point estimates, here being $p(\mathbf{f}^{\ast} ~|~ \mathbf{y}) \approx p(\mathbf{f}^{\ast} ~|~ \hat{\mathbf{f}}_{\text{MAP}}),$where we replaced $p(\mathbf{f} ~|~ \mathbf{y})$ with a unit mass placed on $\hat{\mathbf{f}}_{\text{MAP}}:= \underset{\mathbf{f}}{\arg\max}~p(\mathbf{f} ~|~ \mathbf{y})=\underset{\mathbf{f}}{\arg\max}~\pi(\mathbf{f})\cdot p(\mathbf{y} ~|~ \mathbf{f}),$the maximum a posteriori (MAP) estimate of $\mathbf{f}$. It's easy to solve for using numerical methods. ### GP Regression with Activation Functions Similar to [[Generalized Linear Models|GLMs]] like logistic regression, we can plug the GP into link functions, e.g. the logit map $\psi:f \mapsto \frac{e^{f}}{1+e^{f}}$ to squish it into $[0,1]$ and output it as a probability, say $\mathbb{P}[Y=1 ~|~ X=x]$. For simplicity we consider prediction on one point $x^{\ast}\in \mathbb{R}$, but the computations extent to vector-values $x^{\ast}$ easily. We have the prediction $\begin{align*} \mathbb{P}[Y =1 ~|~ x^{\ast}]&= \mathbb{E}_{f^{\ast}}[\psi(f^{\ast})]\\ &= \int \psi(f^{\ast})\cdot p(f^{\ast} ~|~ \mathbf{y})~ df^{\ast} \\ &= \iint \psi(f^{\ast})\cdot p(f^{\ast} ~|~ \mathbf{f})\cdot p(\mathbf{f} ~|~ \mathbf{y}) ~df^{\ast}d\mathbf{f}. \end{align*}$ This is even worse than the non-Gaussian noise, as