Lecture 13.3

Boosting & AdaBoost

Boosting constructs a strong ensemble by training simple models sequentially, with each new model concentrating on the examples that previous models got wrong. AdaBoost is the canonical boosting algorithm and can be derived as sequential minimization of an exponential loss.

Learning Objectives

Contrast boosting (reduces bias) with bagging (reduces variance).
State the AdaBoost algorithm: data-weight initialization, sequential classifier training, weight update, and final weighted-majority-vote prediction.
Show that the AdaBoost update rules are derived by sequentially minimizing an exponential loss function.
List the limitations of AdaBoost.

1. Boosting vs. Bagging

Bagging starts with flexible, low-bias models and reduces their variance by averaging. Boosting takes the opposite direction: it starts with weak learners — simple models that barely outperform random guessing — and reduces bias by combining them sequentially so that each new model corrects the mistakes of its predecessors.

Weak Learner

A weak learner is a classifier whose error rate $\epsilon_m$ satisfies $\epsilon_m < 0.5$ — i.e., it is slightly better than chance. AdaBoost requires only this weak condition yet can produce a committee that achieves near-zero training error.

2. The AdaBoost Algorithm

Let $\{(\mathbf{x}_n, t_n)\}_{n=1}^N$ be the training set with $t_n \in \{-1, +1\}$. Each base classifier $y_m(\mathbf{x}) \in \{-1, +1\}$ is trained on a weighted version of the data.

AdaBoost Algorithm

Initialize: $w_n^{(1)} = 1/N$ for all $n$.

For $m = 1, \dots, M$:

Train base classifier $y_m$ to minimize the weighted misclassification count: $\displaystyle\min_{y_m} \sum_{n=1}^N w_n^{(m)}\,\mathbf{1}[y_m(\mathbf{x}_n) \neq t_n]$.
Compute the weighted error rate: $\displaystyle\epsilon_m = \frac{\sum_n w_n^{(m)}\,\mathbf{1}[y_m(\mathbf{x}_n) \neq t_n]}{\sum_n w_n^{(m)}}$.
Compute the model weight: $\alpha_m = \ln\!\dfrac{1-\epsilon_m}{\epsilon_m}$.
Update data weights: $w_n^{(m+1)} = w_n^{(m)} \exp\!\bigl(\alpha_m\,\mathbf{1}[y_m(\mathbf{x}_n) \neq t_n]\bigr)$.

Final prediction: $y(\mathbf{x}) = \operatorname{sign}\!\left(\sum_{m=1}^M \alpha_m\, y_m(\mathbf{x})\right).$

Reading the Update Rule

When $\epsilon_m$ is small (good model), $\alpha_m$ is large, so misclassified points receive a large weight boost. When $\epsilon_m \approx 0.5$ (weak model), $\alpha_m \approx 0$ and weights barely change. Correctly classified points are not updated — only the hard cases grow in importance for the next round.

3. Derivation: Exponential Loss

AdaBoost can be derived as greedy sequential minimization of the exponential loss:

$$E = \sum_{n=1}^{N} \exp\!\bigl(-t_n f_M(\mathbf{x}_n)\bigr), \qquad f_M(\mathbf{x}) = \tfrac{1}{2}\sum_{m=1}^{M}\alpha_m\, y_m(\mathbf{x}).$$

At stage $m$, all previous parameters $\{y_l, \alpha_l\}_{l < m}$ are fixed. Grouping their contribution into weights $w_n^{(m)} = \exp(-t_n f_{m-1}(\mathbf{x}_n))$, the loss becomes

$$E = \sum_n w_n^{(m)} \exp\!\bigl(-\tfrac{1}{2}\alpha_m\, t_n\, y_m(\mathbf{x}_n)\bigr).$$

Splitting into correctly ($t_n y_m = +1$) and incorrectly ($t_n y_m = -1$) classified points:

$$E = e^{-\alpha_m/2}\!\!\sum_{n:\,\text{correct}} w_n^{(m)} + e^{+\alpha_m/2}\!\!\sum_{n:\,\text{wrong}} w_n^{(m)}.$$

This can be rewritten using the indicator function as

$$E = \Bigl(e^{\alpha_m/2} - e^{-\alpha_m/2}\Bigr)\sum_n w_n^{(m)}\,\mathbf{1}[y_m \neq t_n] + e^{-\alpha_m/2}\sum_n w_n^{(m)}.$$

Optimizing $y_m$: the second term does not depend on $y_m$, so minimizing $E$ over $y_m$ is exactly the weighted misclassification objective in step 1 of AdaBoost. ✓

Optimizing $\alpha_m$: setting $\partial E / \partial \alpha_m = 0$ and solving gives

$$\alpha_m = \ln\frac{1-\epsilon_m}{\epsilon_m},$$

matching step 3 of AdaBoost. ✓

Weight update: expanding $w_n^{(m+1)} = \exp(-t_n f_m(\mathbf{x}_n))$ using $f_m = f_{m-1} + \tfrac{1}{2}\alpha_m y_m$ and absorbing a constant normalization factor yields

$$w_n^{(m+1)} \propto w_n^{(m)}\exp\!\bigl(\alpha_m\,\mathbf{1}[y_m(\mathbf{x}_n)\neq t_n]\bigr),$$

matching step 4 of AdaBoost. ✓

Summary: AdaBoost = Sequential Exponential Loss Minimization

The AdaBoost rules for (a) fitting each base classifier, (b) computing $\alpha_m$, and (c) updating data weights all emerge from greedily minimizing $\sum_n \exp(-t_n f_M(\mathbf{x}_n))$ one stage at a time.

4. Limitations

Sensitive to outliers. The exponential loss grows unboundedly for large errors, so heavily mislabeled points dominate the objective and distort the committee.
No probabilistic output. The exponential loss has no direct probabilistic interpretation, unlike cross-entropy. Uncertainty estimates are not naturally available.
Sequential training. Each model must wait for the previous one to finish, so parallelization is limited (unlike bagging). This makes using fast weak learners desirable.
Multi-class extension is non-trivial. The binary $\{-1,+1\}$ formulation does not generalize straightforwardly to $K > 2$ classes.