Lecture 5.3

Model Evidence Approximation & Empirical Bayes

The evidence approximation (empirical Bayes): optimizing hyperparameters by maximizing the marginal likelihood.

Learning Objectives
  • Identify why full Bayesian integration over hyperparameters is intractable.
  • State the evidence approximation and explain what it assumes about the hyperparameter posterior.
  • Connect hyperparameter optimization to maximizing the marginal likelihood.
  • Interpret a model-evidence vs. polynomial-order plot to automatically select model complexity.

1. The Full Bayesian Treatment โ€” and Why It Is Intractable

A fully Bayesian predictive distribution marginalizes over everything: model weights $\mathbf{w}$, noise precision $\beta$, and prior precision $\alpha$:

$$p(t' \mid x', \mathcal{D}) = \iiint p(t' \mid x', \mathbf{w}, \beta)\, p(\mathbf{w} \mid \mathcal{D}, \alpha, \beta)\, p(\alpha, \beta \mid \mathcal{D})\, d\mathbf{w}\, d\alpha\, d\beta.$$

The inner integral over $\mathbf{w}$ is tractable in the Gaussian case (Lecture 4.5). The outer integrals over $\alpha$ and $\beta$ are not: the hyperparameter posterior $p(\alpha, \beta \mid \mathcal{D})$ requires priors on hyperparameters, which in turn may require hyperpriors, and so on. In practice this hierarchy is truncated.

2. The Evidence Approximation (Empirical Bayes)

The key approximation is to assume the hyperparameter posterior $p(\alpha, \beta \mid \mathcal{D})$ is sharply peaked at optimal values $\alpha^*$, $\beta^*$. Under this assumption the outer integrals reduce to a point evaluation:

Evidence Approximation

Approximate the full predictive distribution by the Bayesian predictive distribution evaluated at the optimal hyperparameters:

$$p(t' \mid x', \mathcal{D}) \approx p(t' \mid x', \mathcal{D}, \alpha^*, \beta^*),$$

where $\alpha^*$, $\beta^*$ are chosen to maximize the marginal likelihood (model evidence):

$$(\alpha^*, \beta^*) = \arg\max_{\alpha,\beta}\, p(\mathcal{D} \mid \alpha, \beta, \mathcal{M}) = \arg\max_{\alpha,\beta} \int p(\mathcal{D} \mid \mathbf{w}, \beta)\, p(\mathbf{w} \mid \alpha)\, d\mathbf{w}.$$

This procedure โ€” using the data to set hyperparameters by maximizing the marginal likelihood โ€” is called empirical Bayes or the evidence approximation. Iterative algorithms for computing $\alpha^*$, $\beta^*$ in the Gaussian setting are described in Bishop ยง3.5.2.

Relationship to Cross-Validation

Both cross-validation (Lecture 4.1) and the evidence approximation are methods for selecting hyperparameters. Cross-validation withholds data for a validation set; the evidence approximation uses all data by integrating out $\mathbf{w}$ analytically. The evidence approximation is therefore more data-efficient, but relies on the Gaussian modeling assumptions being correct.

3. Automatic Model Order Selection

Polynomial Basis Functions: Evidence vs. Order

Fit polynomial models of increasing order $M$ to observations from a sine wave. Plotting the log-evidence against $M$ reveals:

  • $M = 0$ (constant): very low evidence โ€” cannot represent the sine's shape.
  • $M = 1$ (linear): better, but still limited.
  • $M = 2$ (quadratic): evidence drops below $M=1$. A quadratic adds an even-function term, which cannot improve the fit to an odd-function target but does increase the complexity penalty.
  • $M = 3$ (cubic): evidence peaks โ€” the cubic term enables a much better fit to the sine, outweighing the added complexity.
  • $M > 3$: evidence decreases again โ€” additional terms increase complexity without improving the likelihood.

The model with the highest evidence ($M=3$) is selected automatically, without any held-out data.