Lecture 4.5
Bayesian Predictive Distributions
Computing predictive distributions that properly propagate weight uncertainty, yielding calibrated confidence intervals on new predictions.
- Define the Bayesian predictive distribution as a marginalization over the weight posterior.
- State the closed-form Gaussian result for the predictive mean and variance.
- Interpret the two terms in the predictive variance: irreducible noise and model uncertainty.
- Explain why predictive uncertainty is smaller near training data and larger in unobserved regions.
1. Bayesian Model Averaging
A point estimate $\hat{\mathbf{w}}$ (MLE or MAP) commits to a single model. The Bayesian approach instead averages predictions over all weight values, weighted by how probable each is given the data:
The Bayesian predictive distribution for a new target $t'$ at input $x'$ marginalizes out the weights:
$$p(t' \mid x', \mathbf{X}, \mathbf{t}) = \int p(t' \mid x', \mathbf{w}, \beta)\, p(\mathbf{w} \mid \mathbf{X}, \mathbf{t}, \alpha, \beta)\, d\mathbf{w}.$$Each model $p(t'|x', \mathbf{w}, \beta)$ is weighted by the posterior probability $p(\mathbf{w}|\mathbf{X}, \mathbf{t}, \alpha, \beta)$ of that model given the data. The result is a distribution over $t'$ that reflects both measurement noise and uncertainty about the model weights.
2. Closed-Form Gaussian Predictive Distribution
Because both the likelihood and the posterior are Gaussian, this integral can be evaluated analytically (Bishop ยง2.3). The result is also Gaussian:
with
$$\mu_N(x') = \mathbf{m}_N^\top \boldsymbol{\phi}(x'),$$ $$\sigma_N^2(x') = \underbrace{\beta^{-1}}_{\text{noise}} + \underbrace{\boldsymbol{\phi}(x')^\top \mathbf{S}_N\, \boldsymbol{\phi}(x')}_{\text{model uncertainty}},$$where $\mathbf{m}_N$ and $\mathbf{S}_N$ are the posterior mean and covariance from Lecture 4.3.
3. Understanding the Predictive Variance
The predictive variance $\sigma_N^2(x')$ has two additive contributions:
- $\beta^{-1}$: irreducible measurement noise โ present even with infinitely many training observations.
- $\boldsymbol{\phi}(x')^\top \mathbf{S}_N \boldsymbol{\phi}(x')$: uncertainty in the weights, encoded in the posterior covariance. This term vanishes as $N \to \infty$ since $\mathbf{S}_N \to \mathbf{0}$ (Lecture 4.4).
As $N \to \infty$, $\mathbf{S}_N \to \mathbf{0}$, so $\sigma_N^2(x') \to \beta^{-1}$ for all $x'$. The predictive uncertainty collapses to the irreducible noise floor โ the best any model can achieve.
4. Uncertainty Grows Away from Data
The model uncertainty term $\boldsymbol{\phi}(x')^\top \mathbf{S}_N \boldsymbol{\phi}(x')$ is small when $x'$ lies near training inputs and large when $x'$ is far from them.
Fit a Gaussian-basis Bayesian model to observations from $t = \sin(2\pi x) + \epsilon$. With $N = 1$ observation, the predictive band is wide everywhere except very near that data point. With $N = 4$, bands tighten around the observed region. With $N = 25$, the band is narrow across the whole domain and the mean closely tracks the true sine.
The intuition: the posterior $p(\mathbf{w}|\mathcal{D})$ is informed by the data near $x'$. Where data exists, the posterior is sharp and predictions are confident. Where no data exists, the model falls back on the prior, leading to wider bands.
(1) Plot $\mu_N(x') \pm \sigma_N(x')$ as a shaded confidence band around the mean prediction. (2) Sample multiple weight vectors $\mathbf{w} \sim p(\mathbf{w}|\mathcal{D})$ and plot the corresponding curves $y(x) = \boldsymbol{\phi}(x)^\top \mathbf{w}$. Both reflect the same underlying posterior: the spread of curves in the second view matches the width of the band in the first.