Lecture 12.5

GPs: Regression

GP regression uses the Gaussian conditioning property to produce a posterior distribution over functions given training data. The result is a closed-form predictive mean and variance: uncertainty is small near observed data and grows in regions far from it.

Learning Objectives

Set up the GP regression model with a Gaussian noise observation model.
Write the joint distribution over training targets and test function values as a block-structured Gaussian.
Derive the closed-form posterior predictive mean $\mu(\mathbf{x}^*)$ and variance $\sigma^2(\mathbf{x}^*)$ by Gaussian conditioning.
Interpret the predictive mean as a kernel-weighted average of training targets.
Explain how uncertainty grows away from training data and how hyperparameters are tuned by marginal likelihood maximization.

1. Model

Assume observations arise from an unknown function $f$ perturbed by i.i.d. Gaussian noise:

$$t_n = f(\mathbf{x}_n) + \varepsilon_n, \quad \varepsilon_n \sim \mathcal{N}(0, \beta^{-1}).$$

Place a zero-mean GP prior on $f$: $f \sim \mathcal{GP}(0, k(\cdot,\cdot))$. The noisy observations then have covariance:

$$\mathrm{Cov}[t_n, t_m] = k(\mathbf{x}_n, \mathbf{x}_m) + \beta^{-1}\delta_{nm},$$

so $\mathbf{t} \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$ where $\mathbf{C} = \mathbf{K} + \beta^{-1}\mathbf{I}$ and $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$.

2. Joint Distribution Over Training and Test Points

Let $\mathbf{x}^*$ be a new test point and $f^* = f(\mathbf{x}^*)$. The joint distribution of training targets and the test function value is a block-structured Gaussian:

$$\begin{pmatrix}\mathbf{t} \\ f^*\end{pmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{pmatrix}\mathbf{C} & \mathbf{k}_* \\ \mathbf{k}_*^\top & c_{**}\end{pmatrix}\right),$$

where $(\mathbf{k}_*)_n = k(\mathbf{x}_n, \mathbf{x}^*)$ and $c_{**} = k(\mathbf{x}^*, \mathbf{x}^*) + \beta^{-1}$.

3. Posterior Predictive Distribution

Applying the Gaussian conditioning formula (Lecture 12.1) to condition on the observed $\mathbf{t}$:

GP Regression Predictive Distribution $$p(f^* \mid \mathbf{x}^*, \mathbf{t}) = \mathcal{N}(f^* \mid \mu(\mathbf{x}^*), \sigma^2(\mathbf{x}^*)),$$

where

$$\mu(\mathbf{x}^*) = \mathbf{k}_*^\top \mathbf{C}^{-1} \mathbf{t},$$ $$\sigma^2(\mathbf{x}^*) = c_{**} - \mathbf{k}_*^\top \mathbf{C}^{-1} \mathbf{k}_*.$$

Both the mean and variance are available in closed form. Training requires computing $\mathbf{C}^{-1}$, an $O(N^3)$ operation done once; prediction is $O(N)$ per test point.

4. Interpretation

Predictive mean: $\mu(\mathbf{x}^*) = \mathbf{k}_*^\top \mathbf{C}^{-1}\mathbf{t}$ is a weighted sum of the training targets $t_n$, with weights proportional to the kernel similarity between $\mathbf{x}^*$ and each $\mathbf{x}_n$. Points close to $\mathbf{x}^*$ in kernel space pull the prediction toward their target values.

Predictive variance: $c_{**}$ is the prior variance at $\mathbf{x}^*$; subtracting $\mathbf{k}_*^\top \mathbf{C}^{-1}\mathbf{k}_*$ measures how much the training data has reduced our uncertainty. When $\mathbf{x}^*$ is near many training points, $\mathbf{k}_*$ is large, the subtracted term is large, and the posterior variance is small.

Uncertainty Grows Away from Data

At a training point $\mathbf{x}^*= \mathbf{x}_n$, $\mathbf{k}_*$ has a large component aligned with the $n$-th row of $\mathbf{C}$, and the posterior variance approaches $\beta^{-1}$ (the noise level). Far from all training points, $\mathbf{k}_* \approx \mathbf{0}$, and the posterior variance approaches the prior variance $c_{**} = k(\mathbf{x}^*, \mathbf{x}^*) + \beta^{-1}$. This calibrated uncertainty enables active learning: gather new observations in high-uncertainty regions.

5. Hyperparameter Learning

The kernel parameters $\boldsymbol{\theta}$ (and noise $\beta$) enter through $\mathbf{C}(\boldsymbol{\theta})$. They are tuned by maximizing the log marginal likelihood:

$$\ln p(\mathbf{t} \mid \boldsymbol{\theta}) = -\frac{1}{2}\ln|\mathbf{C}| - \frac{1}{2}\mathbf{t}^\top\mathbf{C}^{-1}\mathbf{t} - \frac{N}{2}\ln 2\pi.$$

This trades off data fit (the quadratic term) against model complexity (the log-determinant). Gradient descent in $\boldsymbol{\theta}$-space is practical since both $\mathbf{C}^{-1}$ and gradients $\partial\mathbf{C}/\partial\theta_i$ are computable. This is the GP analogue of empirical Bayes (Lecture 5.3).

GP Regression vs. Parametric Bayesian Regression

Parametric Bayesian regression (Lectures 4.3–4.5) places a prior on $\mathbf{w}$ and marginalizes over it. GP regression places a prior directly on functions and marginalizes over all functions consistent with the kernel. The parametric model is a special case: with $k(\mathbf{x},\mathbf{x}') = \frac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$, the two approaches yield identical predictive distributions. GPs are strictly more general.