Lecture 13.1

Model Combination vs. Bayesian Model Averaging

Predictive performance can often be improved by combining multiple models into a committee. This lecture introduces the ensemble idea, contrasts hard versus soft model selection, and clarifies the fundamental difference between Bayesian model averaging and model combination methods.

Learning Objectives

Explain what a committee of models is and why combining models can reduce error.
Distinguish averaging committees (parallel training) from boosting (sequential training).
Describe hard model selection (decision trees) and soft model selection (mixtures of experts).
State the key conceptual difference between Bayesian model averaging and general model combination methods.

1. Why Combine Models?

Training a single model produces one set of predictions. Training many models and combining their predictions — forming a committee — can achieve lower error than any individual member. This observation is consistent across machine learning competitions: top-ranked solutions almost universally employ some form of ensemble.

We already saw a version of this in the bias-variance analysis (Lecture 4.2): averaging many regression models trained on different datasets reduces variance while preserving low bias. Ensemble methods make this practical when only one dataset is available.

Committee of Models

A committee consists of $M$ individual predictors $\{y_m(\mathbf{x})\}_{m=1}^M$ whose outputs are combined to form a final prediction. Two strategies for combination:

Averaging (parallel): train each model independently, then average outputs. Bootstrap aggregation (bagging) falls here.
Boosting (sequential): train models one at a time; each new model targets the errors of its predecessors.

2. Hard vs. Soft Model Selection

Rather than averaging all committee members uniformly, we can select or weight them based on the input $\mathbf{x}$.

Hard Selection: Decision Trees

A decision tree routes each input through a sequence of binary decisions to a single "expert" leaf node. Each internal node asks a simple threshold question (e.g., is $x_1 > 4.5$?); the final leaf makes the prediction. Only one expert is consulted per input — hence hard selection.

Soft Selection: Mixtures of Experts

A mixture of experts computes a weighted combination of $K$ specialist models:

$$p(t \mid \mathbf{x}) = \sum_{k=1}^{K} \pi_k(\mathbf{x})\, p(t \mid \mathbf{x}, \text{expert}_k),$$

where $\pi_k(\mathbf{x}) \geq 0$ and $\sum_k \pi_k(\mathbf{x}) = 1$ are input-dependent mixing coefficients (analogous to the responsibilities in a Gaussian mixture model). All experts contribute softly to every prediction.

3. Bayesian Model Averaging vs. Model Combination

Bayesian model averaging (BMA) looks superficially like model combination — both marginalize over a set of models — but they rest on different generative assumptions.

The Key Distinction

Bayesian model averaging assumes the entire dataset was generated by a single model $h$; we are merely uncertain which one:

$$p(\mathbf{x}) = \sum_h p(\mathbf{x} \mid h)\, p(h).$$

As data accumulate, the posterior $p(h \mid \mathcal{D})$ sharpens onto one model. BMA is a way of expressing uncertainty about the true model.

Model combination (e.g., Gaussian mixture models) assumes different data points may have been generated by different processes. Each point $\mathbf{x}_n$ has its own latent variable $z_n$:

$$p(\mathbf{x}_n) = \sum_{z_n} p(\mathbf{x}_n \mid z_n)\, p(z_n).$$

More data does not collapse the mixture onto a single component; heterogeneity in the data is genuine, not a reflection of our uncertainty about the model.

Intuition

In BMA, two data points $\mathbf{x}$ and $\mathbf{x}'$ are both explained by whichever single model $h^*$ the posterior eventually selects. In a mixture model, $\mathbf{x}$ might come from component $z = 1$ (e.g., one cluster) and $\mathbf{x}'$ from $z = 2$ (another cluster) — the two points genuinely arise from different processes.