Lecture 5.2
Bayesian Model Comparison
Using the marginal likelihood (model evidence) to compare models of different complexity without a held-out test set.
- Define model evidence (marginal likelihood) and explain why it uses all available data.
- State the Bayes factor as the principled criterion for comparing two models.
- Derive the log-evidence approximation and identify the likelihood and complexity terms.
- Explain intuitively why the model evidence favors models of medium complexity.
1. Model Posterior and Model Evidence
We consider a set of candidate models $\{\mathcal{M}_i\}$ indexed, for example, by the number or type of basis functions. Given data $\mathcal{D}$, the posterior over models is
$$p(\mathcal{M}_i \mid \mathcal{D}) \propto p(\mathcal{D} \mid \mathcal{M}_i)\, p(\mathcal{M}_i).$$The term $p(\mathcal{D} \mid \mathcal{M}_i)$ is called the model evidence or marginal likelihood. It is the likelihood of the data under model $\mathcal{M}_i$ with the weight parameters $\mathbf{w}$ marginalized out:
$$p(\mathcal{D} \mid \mathcal{M}_i) = \int p(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_i)\, p(\mathbf{w} \mid \mathcal{M}_i)\, d\mathbf{w}.$$With a flat model prior $p(\mathcal{M}_i) \propto \text{const}$, selecting the most probable model reduces to maximizing the model evidence. The comparison between two models is quantified by the Bayes factor $p(\mathcal{D}|\mathcal{M}_1)/p(\mathcal{D}|\mathcal{M}_2)$; a value above 1 favors $\mathcal{M}_1$.
Model evidence uses the full dataset $\mathcal{D}$ — no data is withheld for a separate validation set. This is especially valuable when data is scarce.
2. Log-Evidence Approximation and Complexity Penalty
The marginal likelihood is the normalization constant of the weight posterior: $p(\mathcal{D}|\mathcal{M}) = p(\mathbf{t}|\mathbf{w})p(\mathbf{w}) / p(\mathbf{w}|\mathbf{t})$. Approximating the posterior as Gaussian with width $\Delta w_{\text{post}}$ concentrated around $\mathbf{w}_{\text{MAP}}$, and the prior as uniform with width $\Delta w_{\text{prior}}$, gives:
$$\ln p(\mathcal{D} \mid \mathcal{M}) \approx \underbrace{\ln p(\mathcal{D} \mid \mathbf{w}_{\text{MAP}})}_{\text{fit quality}} + \underbrace{M \ln\!\left(\frac{\Delta w_{\text{post}}}{\Delta w_{\text{prior}}}\right)}_{\text{complexity penalty}}.$$Since the posterior is narrower than the prior ($\Delta w_{\text{post}} < \Delta w_{\text{prior}}$), the ratio is less than 1 and its logarithm is negative. The penalty grows with model dimension $M$: more parameters incur a larger penalty. The model evidence therefore balances two competing forces:
- Maximizing data fit (first term) pushes toward more complex models.
- The complexity penalty (second term) pushes toward simpler models.
3. Why Medium Complexity Wins
Consider three models of increasing complexity. A simple model $\mathcal{M}_1$ can generate only a narrow range of datasets; within that range, each dataset has relatively high probability. A complex model $\mathcal{M}_3$ can generate many different datasets, but the probability of any particular dataset is spread thin across all those possibilities.
For an observed dataset $\mathcal{D}$:
- $\mathcal{M}_1$ has near-zero evidence if $\mathcal{D}$ lies outside its representable range.
- $\mathcal{M}_3$ assigns low probability to $\mathcal{D}$ because it spreads probability mass over too many alternatives.
- The model $\mathcal{M}_2$ that is just complex enough to represent $\mathcal{D}$ achieves the highest evidence — a built-in form of Occam's Razor.