Lecture 5.4
Classification With Decision Regions
Framing classification as partitioning input space into decision regions, and the role of decision boundaries.
- Distinguish classification targets from regression targets; define one-hot encoding.
- Define decision regions and decision boundaries.
- Explain why one-vs-rest and one-vs-one binary classifiers fail in the multi-class case.
- Name the three classification strategies: discriminant functions, probabilistic discriminative, and probabilistic generative.
1. Classification: Discrete Targets
In classification, the target $t$ belongs to one of $K$ discrete classes $\{C_1, \dots, C_K\}$ rather than taking a continuous value. For binary classification ($K=2$) we encode the target as $t \in \{0, 1\}$. For $K > 2$, a simple integer label carries a spurious ordering between classes, so we use one-hot encoding:
For $K$ classes, encode class $C_k$ as a $K$-dimensional binary vector $\mathbf{t}$ with a 1 in position $k$ and 0 elsewhere. For example, class $C_3$ of five classes becomes $\mathbf{t} = (0,0,1,0,0)^\top$. This encoding is numerical and avoids any implied ordering between classes.
2. Decision Regions and Decision Boundaries
Partition the $d$-dimensional input space $\mathbb{R}^d$ into $K$ disjoint regions $\mathcal{R}_1, \dots, \mathcal{R}_K$. Any input $\mathbf{x} \in \mathcal{R}_k$ is assigned to class $C_k$. The surfaces separating adjacent regions are called decision boundaries.
A classifier is called linear if its decision boundaries are $(d-1)$-dimensional hyperplanes. A dataset is linearly separable if a linear classifier can perfectly separate all classes.
3. Pitfalls of Binary Classifiers for Multi-Class Problems
Two natural strategies for extending binary classifiers to $K > 2$ classes both produce problematic ambiguous regions:
- One-vs-rest: train $K-1$ classifiers, each distinguishing one class from all others. Ambiguous regions arise where more than one classifier claims a point belongs to its class.
- One-vs-one: train one classifier per pair of classes ($K(K-1)/2$ total), decide by majority vote. Again, central regions receive conflicting votes from all $K$ classifiers simultaneously.
The correct approach is a single classifier that maps $\mathbf{x}$ directly to one of $K$ classes, eliminating ambiguous regions by construction. The upcoming lectures develop this via probabilistic models.
4. Three Classification Strategies
- Discriminant functions: learn a direct mapping $f(\mathbf{x}; \mathbf{w}) \to C_k$ without any probabilistic interpretation. Decisions rely on heuristic loss functions.
- Probabilistic discriminative models: model the posterior class probabilities $p(C_k \mid \mathbf{x})$ directly. Decision theory then gives the optimal classifier (Lecture 5.5).
- Probabilistic generative models: model the class-conditional densities $p(\mathbf{x} \mid C_k)$ and class priors $p(C_k)$. Bayes' theorem then gives $p(C_k \mid \mathbf{x})$. Additionally, the joint distribution enables generating synthetic data (Lecture 5.6).