Lecture 13.1
Model Combination vs. Bayesian Model Averaging
Predictive performance can often be improved by combining multiple models into a committee. This lecture introduces the ensemble idea, contrasts hard versus soft model selection, and clarifies the fundamental difference between Bayesian model averaging and model combination methods.
- Explain what a committee of models is and why combining models can reduce error.
- Distinguish averaging committees (parallel training) from boosting (sequential training).
- Describe hard model selection (decision trees) and soft model selection (mixtures of experts).
- State the key conceptual difference between Bayesian model averaging and general model combination methods.
1. Why Combine Models?
Training a single model produces one set of predictions. Training many models and combining their predictions — forming a committee — can achieve lower error than any individual member. This observation is consistent across machine learning competitions: top-ranked solutions almost universally employ some form of ensemble.
We already saw a version of this in the bias-variance analysis (Lecture 4.2): averaging many regression models trained on different datasets reduces variance while preserving low bias. Ensemble methods make this practical when only one dataset is available.
A committee consists of $M$ individual predictors $\{y_m(\mathbf{x})\}_{m=1}^M$ whose outputs are combined to form a final prediction. Two strategies for combination:
- Averaging (parallel): train each model independently, then average outputs. Bootstrap aggregation (bagging) falls here.
- Boosting (sequential): train models one at a time; each new model targets the errors of its predecessors.
2. Hard vs. Soft Model Selection
Rather than averaging all committee members uniformly, we can select or weight them based on the input $\mathbf{x}$.
A decision tree routes each input through a sequence of binary decisions to a single "expert" leaf node. Each internal node asks a simple threshold question (e.g., is $x_1 > 4.5$?); the final leaf makes the prediction. Only one expert is consulted per input — hence hard selection.
A mixture of experts computes a weighted combination of $K$ specialist models:
$$p(t \mid \mathbf{x}) = \sum_{k=1}^{K} \pi_k(\mathbf{x})\, p(t \mid \mathbf{x}, \text{expert}_k),$$where $\pi_k(\mathbf{x}) \geq 0$ and $\sum_k \pi_k(\mathbf{x}) = 1$ are input-dependent mixing coefficients (analogous to the responsibilities in a Gaussian mixture model). All experts contribute softly to every prediction.
3. Bayesian Model Averaging vs. Model Combination
Bayesian model averaging (BMA) looks superficially like model combination — both marginalize over a set of models — but they rest on different generative assumptions.
Bayesian model averaging assumes the entire dataset was generated by a single model $h$; we are merely uncertain which one:
$$p(\mathbf{x}) = \sum_h p(\mathbf{x} \mid h)\, p(h).$$As data accumulate, the posterior $p(h \mid \mathcal{D})$ sharpens onto one model. BMA is a way of expressing uncertainty about the true model.
Model combination (e.g., Gaussian mixture models) assumes different data points may have been generated by different processes. Each point $\mathbf{x}_n$ has its own latent variable $z_n$:
$$p(\mathbf{x}_n) = \sum_{z_n} p(\mathbf{x}_n \mid z_n)\, p(z_n).$$More data does not collapse the mixture onto a single component; heterogeneity in the data is genuine, not a reflection of our uncertainty about the model.
In BMA, two data points $\mathbf{x}$ and $\mathbf{x}'$ are both explained by whichever single model $h^*$ the posterior eventually selects. In a mixture model, $\mathbf{x}$ might come from component $z = 1$ (e.g., one cluster) and $\mathbf{x}'$ from $z = 2$ (another cluster) — the two points genuinely arise from different processes.