Lecture 13.4

Decision Trees & Random Forests

Decision trees partition input space into rectangular regions using a sequence of greedy binary splits, producing highly interpretable models. Random forests — decision trees combined with bootstrap aggregation and per-split feature bagging — are among the strongest general-purpose classifiers for tabular data.

Learning Objectives

Describe how a decision tree partitions input space and makes predictions (regression: region mean; classification: majority vote).
Explain the greedy tree-building algorithm and its connection to minimizing within-region heterogeneity.
Explain why large trees overfit and describe cost-complexity pruning.
State why the misclassification rate is a poor splitting criterion and prefer the Gini index or cross-entropy.
Describe how Random Forests combine bootstrap aggregation with per-split feature bagging to decorrelate trees.

1. Decision Trees: The Core Idea

A decision tree partitions $\mathbb{R}^d$ into $J$ axis-aligned rectangular regions $\{R_1, \dots, R_J\}$ via a binary tree of threshold questions. Each internal node asks: "is feature $x_i$ above threshold $\tau$?" and routes the input left or right. Each leaf node assigns a prediction to the region it covers.

Predictions

Regression: predict the mean target value of training points in the leaf region: $\hat{y}_{R_j} = \dfrac{1}{|R_j|}\displaystyle\sum_{n:\,\mathbf{x}_n \in R_j} t_n$.
Classification: predict the majority class in the leaf region (or use class probabilities).

Baseball Salary Example

Predict a player's salary from Years (experience) and Hits. The first split on Years < 4.5 identifies juniors (uniformly low salary → Region 1) vs. seniors. Among seniors, a split on Hits separates low-hit players (Region 2) from high-hit players (Region 3, highest salary). The tree is interpretable: Years is the most important feature (top of tree), with Hits as a secondary discriminator for experienced players.

2. Greedy Tree Building

A decision tree is built by recursively finding the best split from the current set of regions. At each step, for every candidate feature $i$ and threshold $\tau$, evaluate the quality of splitting the current region on ($x_i \leq \tau$) vs. ($x_i > \tau$).

Splitting Criterion (Regression)

Choose the split $(i^*, \tau^*)$ that minimizes the total within-region sum of squared errors:

$$\min_{i,\,\tau}\; \sum_{j=1}^{J}\sum_{\mathbf{x}_n \in R_j}\bigl(t_n - \hat{y}_{R_j}\bigr)^2.$$

Because both candidate regions are re-evaluated after every split, the algorithm is greedy: it picks the locally best split without planning ahead.

Tree building stops when a user-defined criterion is met (e.g., minimum $N_{\min}$ points per leaf, or maximum tree depth).

3. Overfitting and Pruning

A deep tree can create one leaf per training point — perfect training error, catastrophic generalization. Shallow trees underfit by making only coarse splits. The standard solution is to grow a large tree then prune it back.

Cost-Complexity Pruning

For a subtree $T$ with $|T|$ leaf nodes, minimize

$$C_\alpha(T) = \sum_{j=1}^{|T|}\sum_{\mathbf{x}_n \in R_j}\bigl(t_n - \hat{y}_{R_j}\bigr)^2 + \alpha\,|T|.$$

The regularization parameter $\alpha \geq 0$ penalizes tree complexity; it is chosen by cross-validation. Increasing $\alpha$ prunes more aggressively, producing simpler trees.

Greedy Limitation

A split that yields only a small immediate improvement in sum of squared errors may enable large improvements in subsequent splits. Pruning partially compensates by allowing the tree to grow deep and then removing unnecessary nodes post-hoc.

4. Classification Trees: Splitting Criteria

For classification, the targets are class labels, not real numbers. The splitting criterion must measure within-region purity. Let $\hat{p}_{jk}$ be the proportion of class-$k$ points in region $R_j$. Two preferred criteria:

Gini Index and Cross-Entropy

Gini index: $\displaystyle G_j = \sum_k \hat{p}_{jk}(1 - \hat{p}_{jk}).$

Cross-entropy: $\displaystyle H_j = -\sum_k \hat{p}_{jk}\ln \hat{p}_{jk}.$

Both are minimized when a region is pure (all points belong to one class). Choose the split minimizing $\sum_j |R_j| \cdot G_j$ (or $\sum_j |R_j| \cdot H_j$).

Why Not Use Misclassification Rate?

Suppose a split produces regions with class fractions $(0.15, 0.85)$ and $(0.85, 0.15)$. Both misclassification rates are $0.15$; total error $= 0.30$. An alternative split gives $(0.0, 1.0)$ and $(0.70, 0.30)$ — total misclassification rate still $0.30$. The second split is clearly better (one pure region), but the misclassification rate cannot distinguish them. The Gini index and cross-entropy both favor purity and distinguish these cases.

5. Random Forests

Decision trees are highly interpretable but weak predictors on their own: small changes in the data can flip high-level splits, producing very different trees (high variance). Random forests ensemble many trees to suppress this variance.

Random Forest Algorithm

Bootstrap: for $b = 1, \dots, B$, draw a bootstrap dataset $\mathcal{D}_b$ (sample $N$ points with replacement).
Build a deep tree on $\mathcal{D}_b$, but at each split, only consider a random subset of $m \approx \sqrt{D}$ features (where $D$ is the total number of features).
Aggregate: average predictions (regression) or take majority vote (classification) over all $B$ trees.

Why Per-Split Feature Bagging?

If one feature is strongly predictive, all bootstrapped trees will choose the same feature at the root — they will be nearly identical, and averaging yields no variance reduction. By randomizing the feature subset at every split, trees are forced to make different decisions throughout their structure, not just at initialization. This produces genuinely decorrelated trees and maximizes the ensemble's variance reduction.

Empirical Benefit

On a cancer gene-expression classification task, a single decision tree achieves high error. A random forest (with $m = \sqrt{D}$ features per split) reduces the error substantially. Using all $D$ features per split (pure bootstrap, no feature bagging) is strictly worse, confirming that feature decorrelation — not just data resampling — is the critical ingredient.

Random forests require no feature scaling, handle mixed feature types naturally, provide built-in variable importance estimates, and are competitive with gradient boosting methods on most tabular benchmarks.