Lecture 5.2

Bayesian Model Comparison

Using the marginal likelihood (model evidence) to compare models of different complexity without a held-out test set.

Learning Objectives
  • Define model evidence (marginal likelihood) and explain why it uses all available data.
  • State the Bayes factor as the principled criterion for comparing two models.
  • Derive the log-evidence approximation and identify the likelihood and complexity terms.
  • Explain intuitively why the model evidence favors models of medium complexity.

1. Model Posterior and Model Evidence

We consider a set of candidate models $\{\mathcal{M}_i\}$ indexed, for example, by the number or type of basis functions. Given data $\mathcal{D}$, the posterior over models is

$$p(\mathcal{M}_i \mid \mathcal{D}) \propto p(\mathcal{D} \mid \mathcal{M}_i)\, p(\mathcal{M}_i).$$
Model Evidence (Marginal Likelihood)

The term $p(\mathcal{D} \mid \mathcal{M}_i)$ is called the model evidence or marginal likelihood. It is the likelihood of the data under model $\mathcal{M}_i$ with the weight parameters $\mathbf{w}$ marginalized out:

$$p(\mathcal{D} \mid \mathcal{M}_i) = \int p(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_i)\, p(\mathbf{w} \mid \mathcal{M}_i)\, d\mathbf{w}.$$

With a flat model prior $p(\mathcal{M}_i) \propto \text{const}$, selecting the most probable model reduces to maximizing the model evidence. The comparison between two models is quantified by the Bayes factor $p(\mathcal{D}|\mathcal{M}_1)/p(\mathcal{D}|\mathcal{M}_2)$; a value above 1 favors $\mathcal{M}_1$.

Advantage Over Cross-Validation

Model evidence uses the full dataset $\mathcal{D}$ — no data is withheld for a separate validation set. This is especially valuable when data is scarce.

2. Log-Evidence Approximation and Complexity Penalty

The marginal likelihood is the normalization constant of the weight posterior: $p(\mathcal{D}|\mathcal{M}) = p(\mathbf{t}|\mathbf{w})p(\mathbf{w}) / p(\mathbf{w}|\mathbf{t})$. Approximating the posterior as Gaussian with width $\Delta w_{\text{post}}$ concentrated around $\mathbf{w}_{\text{MAP}}$, and the prior as uniform with width $\Delta w_{\text{prior}}$, gives:

$$\ln p(\mathcal{D} \mid \mathcal{M}) \approx \underbrace{\ln p(\mathcal{D} \mid \mathbf{w}_{\text{MAP}})}_{\text{fit quality}} + \underbrace{M \ln\!\left(\frac{\Delta w_{\text{post}}}{\Delta w_{\text{prior}}}\right)}_{\text{complexity penalty}}.$$
Complexity Penalty

Since the posterior is narrower than the prior ($\Delta w_{\text{post}} < \Delta w_{\text{prior}}$), the ratio is less than 1 and its logarithm is negative. The penalty grows with model dimension $M$: more parameters incur a larger penalty. The model evidence therefore balances two competing forces:

  • Maximizing data fit (first term) pushes toward more complex models.
  • The complexity penalty (second term) pushes toward simpler models.

3. Why Medium Complexity Wins

Thought Experiment: Simple, Medium, and Complex Models

Consider three models of increasing complexity. A simple model $\mathcal{M}_1$ can generate only a narrow range of datasets; within that range, each dataset has relatively high probability. A complex model $\mathcal{M}_3$ can generate many different datasets, but the probability of any particular dataset is spread thin across all those possibilities.

For an observed dataset $\mathcal{D}$:

  • $\mathcal{M}_1$ has near-zero evidence if $\mathcal{D}$ lies outside its representable range.
  • $\mathcal{M}_3$ assigns low probability to $\mathcal{D}$ because it spreads probability mass over too many alternatives.
  • The model $\mathcal{M}_2$ that is just complex enough to represent $\mathcal{D}$ achieves the highest evidence — a built-in form of Occam's Razor.