Lecture 2.6

Bayesian Prediction

Learning Objectives

After this lecture you should be able to:

Explain what makes a modeling approach Bayesian: consistent application of the sum and product rules to maintain uncertainty at all levels.
Compare MLE, MAP, and Bayesian prediction: what each optimizes or marginalizes, and at what level uncertainty is retained.
Write down the Bayesian predictive distribution as a marginalization of the predictive likelihood over the posterior on $\mathbf{w}$, and interpret it as Bayesian model averaging.
Explain why data that is conditionally independent given $\mathbf{w}$ is generally not marginally independent after $\mathbf{w}$ is integrated out.
Identify the main practical challenge of Bayesian inference (intractable evidence integral) and explain when an analytic solution is possible (Gaussian conjugacy).

MLE and MAP both commit to a single set of parameters $\mathbf{w}^*$ and discard all uncertainty about the model. The fully Bayesian approach never makes such a commitment — it retains the entire posterior distribution over $\mathbf{w}$ and propagates that uncertainty into every prediction.

1. What Makes an Approach Bayesian?

A modeling approach is Bayesian if it applies the sum and product rules of probability consistently, at every level of the model. In practice this means:

Uncertainty over targets is captured by predictive distributions $p(t|\mathbf{x}, \mathbf{w})$ — MLE and MAP both do this.
Uncertainty over model parameters is captured by the posterior $p(\mathbf{w}|\mathcal{D})$ — MAP acknowledges this but then discards it by taking the mode. The fully Bayesian approach keeps it.

2. The Three Approaches Side by Side

MLE, MAP, and Bayesian Prediction

MLE: find $\mathbf{w}^*_{ML} = \arg\max p(\mathcal{D}|\mathbf{w})$; use it as a fixed point estimate. No uncertainty over $\mathbf{w}$.
MAP: find $\mathbf{w}^*_{MAP} = \arg\max p(\mathbf{w}|\mathcal{D})$; use it as a fixed point estimate. Acknowledges the posterior but takes only its mode.
Bayesian: retain the full posterior $p(\mathbf{w}|\mathcal{D})$ and marginalize over $\mathbf{w}$ when making predictions. No point estimate is ever committed to.

MLE and MAP are sometimes called frequentist in spirit: they select a single model and treat it as fixed. The Bayesian approach is fundamentally different: it treats $\mathbf{w}$ as a random variable throughout.

3. The Bayesian Predictive Distribution

Given a new input $\mathbf{x}^*$, we want $p(t^* | \mathbf{x}^*, \mathcal{D})$ — the distribution over the target after seeing all training data, without committing to any particular $\mathbf{w}$. We obtain it by marginalizing the joint distribution over $\mathbf{w}$:

$$p(t^* \mid \mathbf{x}^*, \mathcal{D}) = \int p(t^*, \mathbf{w} \mid \mathbf{x}^*, \mathcal{D})\, d\mathbf{w}$$

Applying the product rule to the integrand, and noting that $t^*$ depends on $\mathcal{D}$ only through $\mathbf{w}$ (the new test point is independent of the training data given the model):

Bayesian Predictive Distribution $$p(t^* \mid \mathbf{x}^*, \mathcal{D}) = \int p(t^* \mid \mathbf{x}^*, \mathbf{w})\; p(\mathbf{w} \mid \mathcal{D})\; d\mathbf{w}$$

This is a weighted average of predictive distributions, one per model $\mathbf{w}$, each weighted by how plausible that model is given the data. It is called Bayesian model averaging.

Models that fit the data well (high posterior $p(\mathbf{w}|\mathcal{D})$) contribute more to the average; implausible models are down-weighted automatically.

4. Conditional Independence Does Not Imply Marginal Independence

Under the i.i.d. assumption, data points are conditionally independent given $\mathbf{w}$: $$p(x_1, \ldots, x_N \mid \mathbf{w}) = \prod_{i=1}^{N} p(x_i \mid \mathbf{w})$$

It might seem that after marginalizing out $\mathbf{w}$ the data would remain independent. It does not. The marginal joint distribution is:

$$p(x_1, \ldots, x_N) = \int \prod_{i=1}^{N} p(x_i \mid \mathbf{w})\; p(\mathbf{w})\; d\mathbf{w}$$

This is a single integral over one shared $\mathbf{w}$ — the same parameters appear in every factor. In contrast, if the $x_i$ were truly marginally independent, we could write $p(x_1,\ldots,x_N) = \prod_i p(x_i)$ where each $p(x_i) = \int p(x_i|\mathbf{w}_i)p(\mathbf{w}_i)d\mathbf{w}_i$ with its own independent $\mathbf{w}_i$. These two expressions are not equal in general.

Intuition: even if individual coin flips are independent for a fixed coin, once you are uncertain about whether the coin is fair, observing several heads in a row updates your belief about the coin — and that updated belief makes future heads more likely. The observations become correlated through the shared unknown $\mathbf{w}$.

5. The Evidence Integral and Gaussian Conjugacy

The posterior requires computing the evidence: $$p(\mathcal{D}) = \int p(\mathcal{D} \mid \mathbf{w})\; p(\mathbf{w})\; d\mathbf{w}$$

This integral is typically intractable in closed form for complex models. Practitioners either approximate it (variational inference, MCMC) or sidestep it.

One important exception: when both the likelihood and the prior are Gaussian, the posterior is also Gaussian — this is called conjugacy. Products of Gaussians are Gaussian, so the integral can be evaluated analytically. This is one reason Gaussian distributions are used so heavily throughout this course: they are not just a good model for noise, they are mathematically convenient for Bayesian inference.

Caveat: choosing Gaussians partly for mathematical convenience is worth acknowledging. It is generally a good model for measurement noise (by the central limit theorem), but in other settings the Gaussian choice is more a matter of tractability than accuracy.

6. Two Levels of Uncertainty

The Bayesian predictive distribution captures uncertainty at two levels simultaneously:

Observation noise — each $p(t^*|\mathbf{x}^*, \mathbf{w})$ is a distribution, not a point. This is present in MLE and MAP too.
Model uncertainty — the integral over $p(\mathbf{w}|\mathcal{D})$ reflects that we are not sure which $\mathbf{w}$ is correct. This is unique to the Bayesian approach and typically widens the predictive distribution, especially in regions with little training data.