Lecture 12.5

GPs: Regression

GP regression uses the Gaussian conditioning property to produce a posterior distribution over functions given training data. The result is a closed-form predictive mean and variance: uncertainty is small near observed data and grows in regions far from it.

Learning Objectives
  • Set up the GP regression model with a Gaussian noise observation model.
  • Write the joint distribution over training targets and test function values as a block-structured Gaussian.
  • Derive the closed-form posterior predictive mean $\mu(\mathbf{x}^*)$ and variance $\sigma^2(\mathbf{x}^*)$ by Gaussian conditioning.
  • Interpret the predictive mean as a kernel-weighted average of training targets.
  • Explain how uncertainty grows away from training data and how hyperparameters are tuned by marginal likelihood maximization.

1. Model

Assume observations arise from an unknown function $f$ perturbed by i.i.d. Gaussian noise:

$$t_n = f(\mathbf{x}_n) + \varepsilon_n, \quad \varepsilon_n \sim \mathcal{N}(0, \beta^{-1}).$$

Place a zero-mean GP prior on $f$: $f \sim \mathcal{GP}(0, k(\cdot,\cdot))$. The noisy observations then have covariance:

$$\mathrm{Cov}[t_n, t_m] = k(\mathbf{x}_n, \mathbf{x}_m) + \beta^{-1}\delta_{nm},$$

so $\mathbf{t} \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$ where $\mathbf{C} = \mathbf{K} + \beta^{-1}\mathbf{I}$ and $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$.

2. Joint Distribution Over Training and Test Points

Let $\mathbf{x}^*$ be a new test point and $f^* = f(\mathbf{x}^*)$. The joint distribution of training targets and the test function value is a block-structured Gaussian:

$$\begin{pmatrix}\mathbf{t} \\ f^*\end{pmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{pmatrix}\mathbf{C} & \mathbf{k}_* \\ \mathbf{k}_*^\top & c_{**}\end{pmatrix}\right),$$

where $(\mathbf{k}_*)_n = k(\mathbf{x}_n, \mathbf{x}^*)$ and $c_{**} = k(\mathbf{x}^*, \mathbf{x}^*) + \beta^{-1}$.

3. Posterior Predictive Distribution

Applying the Gaussian conditioning formula (Lecture 12.1) to condition on the observed $\mathbf{t}$:

GP Regression Predictive Distribution $$p(f^* \mid \mathbf{x}^*, \mathbf{t}) = \mathcal{N}(f^* \mid \mu(\mathbf{x}^*), \sigma^2(\mathbf{x}^*)),$$

where

$$\mu(\mathbf{x}^*) = \mathbf{k}_*^\top \mathbf{C}^{-1} \mathbf{t},$$ $$\sigma^2(\mathbf{x}^*) = c_{**} - \mathbf{k}_*^\top \mathbf{C}^{-1} \mathbf{k}_*.$$

Both the mean and variance are available in closed form. Training requires computing $\mathbf{C}^{-1}$, an $O(N^3)$ operation done once; prediction is $O(N)$ per test point.

4. Interpretation

Predictive mean: $\mu(\mathbf{x}^*) = \mathbf{k}_*^\top \mathbf{C}^{-1}\mathbf{t}$ is a weighted sum of the training targets $t_n$, with weights proportional to the kernel similarity between $\mathbf{x}^*$ and each $\mathbf{x}_n$. Points close to $\mathbf{x}^*$ in kernel space pull the prediction toward their target values.

Predictive variance: $c_{**}$ is the prior variance at $\mathbf{x}^*$; subtracting $\mathbf{k}_*^\top \mathbf{C}^{-1}\mathbf{k}_*$ measures how much the training data has reduced our uncertainty. When $\mathbf{x}^*$ is near many training points, $\mathbf{k}_*$ is large, the subtracted term is large, and the posterior variance is small.

Uncertainty Grows Away from Data

At a training point $\mathbf{x}^*= \mathbf{x}_n$, $\mathbf{k}_*$ has a large component aligned with the $n$-th row of $\mathbf{C}$, and the posterior variance approaches $\beta^{-1}$ (the noise level). Far from all training points, $\mathbf{k}_* \approx \mathbf{0}$, and the posterior variance approaches the prior variance $c_{**} = k(\mathbf{x}^*, \mathbf{x}^*) + \beta^{-1}$. This calibrated uncertainty enables active learning: gather new observations in high-uncertainty regions.

5. Hyperparameter Learning

The kernel parameters $\boldsymbol{\theta}$ (and noise $\beta$) enter through $\mathbf{C}(\boldsymbol{\theta})$. They are tuned by maximizing the log marginal likelihood:

$$\ln p(\mathbf{t} \mid \boldsymbol{\theta}) = -\frac{1}{2}\ln|\mathbf{C}| - \frac{1}{2}\mathbf{t}^\top\mathbf{C}^{-1}\mathbf{t} - \frac{N}{2}\ln 2\pi.$$

This trades off data fit (the quadratic term) against model complexity (the log-determinant). Gradient descent in $\boldsymbol{\theta}$-space is practical since both $\mathbf{C}^{-1}$ and gradients $\partial\mathbf{C}/\partial\theta_i$ are computable. This is the GP analogue of empirical Bayes (Lecture 5.3).

GP Regression vs. Parametric Bayesian Regression

Parametric Bayesian regression (Lectures 4.3–4.5) places a prior on $\mathbf{w}$ and marginalizes over it. GP regression places a prior directly on functions and marginalizes over all functions consistent with the kernel. The parametric model is a special case: with $k(\mathbf{x},\mathbf{x}') = \frac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$, the two approaches yield identical predictive distributions. GPs are strictly more general.