Lecture 13.2
Bagging & Feature Bagging
Averaging many complex models trained on different datasets reduces prediction variance without increasing bias. Bootstrap aggregation manufactures dataset variety from a single observed dataset; the random subspace method decorrelates learners further by randomly selecting features at training time.
- Explain why averaging low-bias models reduces variance but averaging high-bias models does not help much.
- Describe the bootstrap procedure (sampling with replacement) and how it generates diverse training sets.
- Prove that committee error $\leq \frac{1}{B}$ (average model error) under an independence assumption.
- Describe the random subspace method (feature bagging) and explain why it further reduces inter-model correlation.
1. Averaging Reduces Variance
Recall (Lecture 4.2) that expected prediction error decomposes as biasΒ² + variance + noise. Complex models have low bias (they can represent the true function) but high variance (they change substantially with different training sets). If we could train $L$ models on $L$ independent datasets and average them, the average model would inherit the low bias of the individual models while its variance collapses:
$$\bar{y}(\mathbf{x}) = \frac{1}{L}\sum_{\ell=1}^{L} y_\ell(\mathbf{x}).$$This only helps when the individual models already have low bias. Averaging high-bias models merely gives a high-bias average.
We usually have exactly one training dataset. Training many models on disjoint subsets wastes data and produces even poorer individual models. The bootstrap solves this by artificially generating variability.
2. Bootstrap Aggregation (Bagging)
Given an original dataset of $N$ points, generate $B$ new datasets $\{\mathcal{D}_1, \dots, \mathcal{D}_B\}$ by sampling $N$ points with replacement from the original. Each $\mathcal{D}_b$ is the same size as the original but differs: some points appear multiple times; others do not appear at all. Train one model $y_b$ on each $\mathcal{D}_b$. The committee prediction is
$$y_{\text{com}}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^{B} y_b(\mathbf{x}).$$This is called bootstrap aggregation or bagging.
3. Error Reduction: Derivation
Let $\epsilon_b(\mathbf{x}) = y_b(\mathbf{x}) - h(\mathbf{x})$ be the error of model $b$ at $\mathbf{x}$, where $h(\mathbf{x})$ is the true target function. The average error of an individual model and the committee error (both averaged over $\mathbf{x}$) are
$$E_{\text{av}} = \frac{1}{B}\sum_{b=1}^{B}\mathbb{E}_{\mathbf{x}}[\epsilon_b^2], \qquad E_{\text{com}} = \mathbb{E}_{\mathbf{x}}\!\left[\left(\frac{1}{B}\sum_{b=1}^{B}\epsilon_b\right)^{\!2}\right].$$Assume (i) $\mathbb{E}_{\mathbf{x}}[\epsilon_b] = 0$ for each model (zero mean error) and (ii) errors across models are uncorrelated: $\mathbb{E}_{\mathbf{x}}[\epsilon_b \epsilon_{b'}] = 0$ for $b \neq b'$. Then
$$E_{\text{com}} = \mathbb{E}_{\mathbf{x}}\!\left[\frac{1}{B^2}\sum_{b,b'}\epsilon_b\epsilon_{b'}\right] = \frac{1}{B^2}\sum_b \mathbb{E}_{\mathbf{x}}[\epsilon_b^2] = \frac{1}{B}\,E_{\text{av}}.$$Under the uncorrelated-error assumption:
$$E_{\text{com}} = \frac{1}{B}\,E_{\text{av}}.$$In general (even when errors are correlated), it can be shown that $E_{\text{com}} \leq E_{\text{av}}$ always holds β ensembles never hurt. The gain is maximized when individual model errors are uncorrelated.
Bootstrap datasets overlap substantially β the same original points often recur across datasets β so models are never truly uncorrelated. The reduction in variance is real but smaller than $1/B$. The bound $E_{\text{com}} \leq E_{\text{av}}$ still holds.
4. Feature Bagging (Random Subspace Method)
An additional strategy to decorrelate committee members: rather than varying which data points are used, vary which features are used.
At each training run, randomly select a subset of $R < D$ features and train the model on only those features. Repeat $B$ times with different feature subsets. Average the resulting models.
When features are uncorrelated, models trained on different subsets make independent errors β the committee error approaches $E_{\text{av}}/B$. Feature bagging is especially beneficial when $D \gg N$, where a full-feature model can easily memorize the training set via a dominant feature that does not generalize.
Bootstrapping and feature bagging can be combined: resample both data points and features simultaneously. When applied to decision trees, this combination produces Random Forests (Lecture 13.4).