Lecture 4.5

Bayesian Predictive Distributions

Computing predictive distributions that properly propagate weight uncertainty, yielding calibrated confidence intervals on new predictions.

Learning Objectives

Define the Bayesian predictive distribution as a marginalization over the weight posterior.
State the closed-form Gaussian result for the predictive mean and variance.
Interpret the two terms in the predictive variance: irreducible noise and model uncertainty.
Explain why predictive uncertainty is smaller near training data and larger in unobserved regions.

1. Bayesian Model Averaging

A point estimate $\hat{\mathbf{w}}$ (MLE or MAP) commits to a single model. The Bayesian approach instead averages predictions over all weight values, weighted by how probable each is given the data:

Predictive Distribution

The Bayesian predictive distribution for a new target $t'$ at input $x'$ marginalizes out the weights:

$$p(t' \mid x', \mathbf{X}, \mathbf{t}) = \int p(t' \mid x', \mathbf{w}, \beta)\, p(\mathbf{w} \mid \mathbf{X}, \mathbf{t}, \alpha, \beta)\, d\mathbf{w}.$$

Each model $p(t'|x', \mathbf{w}, \beta)$ is weighted by the posterior probability $p(\mathbf{w}|\mathbf{X}, \mathbf{t}, \alpha, \beta)$ of that model given the data. The result is a distribution over $t'$ that reflects both measurement noise and uncertainty about the model weights.

2. Closed-Form Gaussian Predictive Distribution

Because both the likelihood and the posterior are Gaussian, this integral can be evaluated analytically (Bishop §2.3). The result is also Gaussian:

Gaussian Predictive Distribution $$p(t' \mid x', \mathbf{X}, \mathbf{t}) = \mathcal{N}(t' \mid \mu_N(x'),\, \sigma_N^2(x')),$$

with

$$\mu_N(x') = \mathbf{m}_N^\top \boldsymbol{\phi}(x'),$$ $$\sigma_N^2(x') = \underbrace{\beta^{-1}}_{\text{noise}} + \underbrace{\boldsymbol{\phi}(x')^\top \mathbf{S}_N\, \boldsymbol{\phi}(x')}_{\text{model uncertainty}},$$

where $\mathbf{m}_N$ and $\mathbf{S}_N$ are the posterior mean and covariance from Lecture 4.3.

3. Understanding the Predictive Variance

The predictive variance $\sigma_N^2(x')$ has two additive contributions:

$\beta^{-1}$: irreducible measurement noise — present even with infinitely many training observations.
$\boldsymbol{\phi}(x')^\top \mathbf{S}_N \boldsymbol{\phi}(x')$: uncertainty in the weights, encoded in the posterior covariance. This term vanishes as $N \to \infty$ since $\mathbf{S}_N \to \mathbf{0}$ (Lecture 4.4).

Limit $N \to \infty$

As $N \to \infty$, $\mathbf{S}_N \to \mathbf{0}$, so $\sigma_N^2(x') \to \beta^{-1}$ for all $x'$. The predictive uncertainty collapses to the irreducible noise floor — the best any model can achieve.

4. Uncertainty Grows Away from Data

The model uncertainty term $\boldsymbol{\phi}(x')^\top \mathbf{S}_N \boldsymbol{\phi}(x')$ is small when $x'$ lies near training inputs and large when $x'$ is far from them.

Simulated Example (Sine Curve)

Fit a Gaussian-basis Bayesian model to observations from $t = \sin(2\pi x) + \epsilon$. With $N = 1$ observation, the predictive band is wide everywhere except very near that data point. With $N = 4$, bands tighten around the observed region. With $N = 25$, the band is narrow across the whole domain and the mean closely tracks the true sine.

The intuition: the posterior $p(\mathbf{w}|\mathcal{D})$ is informed by the data near $x'$. Where data exists, the posterior is sharp and predictions are confident. Where no data exists, the model falls back on the prior, leading to wider bands.

Two Equivalent Views of Uncertainty

(1) Plot $\mu_N(x') \pm \sigma_N(x')$ as a shaded confidence band around the mean prediction. (2) Sample multiple weight vectors $\mathbf{w} \sim p(\mathbf{w}|\mathcal{D})$ and plot the corresponding curves $y(x) = \boldsymbol{\phi}(x)^\top \mathbf{w}$. Both reflect the same underlying posterior: the spread of curves in the second view matches the width of the band in the first.