Lecture 5.2

Bayesian Model Comparison

Using the marginal likelihood (model evidence) to compare models of different complexity without a held-out test set.

Learning Objectives

Define model evidence (marginal likelihood) and explain why it uses all available data.
State the Bayes factor as the principled criterion for comparing two models.
Derive the log-evidence approximation and identify the likelihood and complexity terms.
Explain intuitively why the model evidence favors models of medium complexity.

1. Model Posterior and Model Evidence

We consider a set of candidate models $\{\mathcal{M}_i\}$ indexed, for example, by the number or type of basis functions. Given data $\mathcal{D}$, the posterior over models is

$$p(\mathcal{M}_i \mid \mathcal{D}) \propto p(\mathcal{D} \mid \mathcal{M}_i)\, p(\mathcal{M}_i).$$

Model Evidence (Marginal Likelihood)

The term $p(\mathcal{D} \mid \mathcal{M}_i)$ is called the model evidence or marginal likelihood. It is the likelihood of the data under model $\mathcal{M}_i$ with the weight parameters $\mathbf{w}$ marginalized out:

$$p(\mathcal{D} \mid \mathcal{M}_i) = \int p(\mathcal{D} \mid \mathbf{w}, \mathcal{M}_i)\, p(\mathbf{w} \mid \mathcal{M}_i)\, d\mathbf{w}.$$

With a flat model prior $p(\mathcal{M}_i) \propto \text{const}$, selecting the most probable model reduces to maximizing the model evidence. The comparison between two models is quantified by the Bayes factor $p(\mathcal{D}|\mathcal{M}_1)/p(\mathcal{D}|\mathcal{M}_2)$; a value above 1 favors $\mathcal{M}_1$.

Advantage Over Cross-Validation

Model evidence uses the full dataset $\mathcal{D}$ — no data is withheld for a separate validation set. This is especially valuable when data is scarce.

2. Log-Evidence Approximation and Complexity Penalty

The marginal likelihood is the normalization constant of the weight posterior: $p(\mathcal{D}|\mathcal{M}) = p(\mathbf{t}|\mathbf{w})p(\mathbf{w}) / p(\mathbf{w}|\mathbf{t})$. Approximating the posterior as Gaussian with width $\Delta w_{\text{post}}$ concentrated around $\mathbf{w}_{\text{MAP}}$, and the prior as uniform with width $\Delta w_{\text{prior}}$, gives:

$$\ln p(\mathcal{D} \mid \mathcal{M}) \approx \underbrace{\ln p(\mathcal{D} \mid \mathbf{w}_{\text{MAP}})}_{\text{fit quality}} + \underbrace{M \ln\!\left(\frac{\Delta w_{\text{post}}}{\Delta w_{\text{prior}}}\right)}_{\text{complexity penalty}}.$$

Complexity Penalty

Since the posterior is narrower than the prior ($\Delta w_{\text{post}} < \Delta w_{\text{prior}}$), the ratio is less than 1 and its logarithm is negative. The penalty grows with model dimension $M$: more parameters incur a larger penalty. The model evidence therefore balances two competing forces:

Maximizing data fit (first term) pushes toward more complex models.
The complexity penalty (second term) pushes toward simpler models.

3. Why Medium Complexity Wins

Thought Experiment: Simple, Medium, and Complex Models

Consider three models of increasing complexity. A simple model $\mathcal{M}_1$ can generate only a narrow range of datasets; within that range, each dataset has relatively high probability. A complex model $\mathcal{M}_3$ can generate many different datasets, but the probability of any particular dataset is spread thin across all those possibilities.

For an observed dataset $\mathcal{D}$:

$\mathcal{M}_1$ has near-zero evidence if $\mathcal{D}$ lies outside its representable range.
$\mathcal{M}_3$ assigns low probability to $\mathcal{D}$ because it spreads probability mass over too many alternatives.
The model $\mathcal{M}_2$ that is just complex enough to represent $\mathcal{D}$ achieves the highest evidence — a built-in form of Occam's Razor.