Lecture 5.4

Classification With Decision Regions

Framing classification as partitioning input space into decision regions, and the role of decision boundaries.

Learning Objectives

Distinguish classification targets from regression targets; define one-hot encoding.
Define decision regions and decision boundaries.
Explain why one-vs-rest and one-vs-one binary classifiers fail in the multi-class case.
Name the three classification strategies: discriminant functions, probabilistic discriminative, and probabilistic generative.

1. Classification: Discrete Targets

In classification, the target $t$ belongs to one of $K$ discrete classes $\{C_1, \dots, C_K\}$ rather than taking a continuous value. For binary classification ($K=2$) we encode the target as $t \in \{0, 1\}$. For $K > 2$, a simple integer label carries a spurious ordering between classes, so we use one-hot encoding:

One-Hot Encoding

For $K$ classes, encode class $C_k$ as a $K$-dimensional binary vector $\mathbf{t}$ with a 1 in position $k$ and 0 elsewhere. For example, class $C_3$ of five classes becomes $\mathbf{t} = (0,0,1,0,0)^\top$. This encoding is numerical and avoids any implied ordering between classes.

2. Decision Regions and Decision Boundaries

Decision Regions and Boundaries

Partition the $d$-dimensional input space $\mathbb{R}^d$ into $K$ disjoint regions $\mathcal{R}_1, \dots, \mathcal{R}_K$. Any input $\mathbf{x} \in \mathcal{R}_k$ is assigned to class $C_k$. The surfaces separating adjacent regions are called decision boundaries.

A classifier is called linear if its decision boundaries are $(d-1)$-dimensional hyperplanes. A dataset is linearly separable if a linear classifier can perfectly separate all classes.

3. Pitfalls of Binary Classifiers for Multi-Class Problems

Two natural strategies for extending binary classifiers to $K > 2$ classes both produce problematic ambiguous regions:

One-vs-rest: train $K-1$ classifiers, each distinguishing one class from all others. Ambiguous regions arise where more than one classifier claims a point belongs to its class.
One-vs-one: train one classifier per pair of classes ($K(K-1)/2$ total), decide by majority vote. Again, central regions receive conflicting votes from all $K$ classifiers simultaneously.

Solution: A Single $K$-Class Classifier

The correct approach is a single classifier that maps $\mathbf{x}$ directly to one of $K$ classes, eliminating ambiguous regions by construction. The upcoming lectures develop this via probabilistic models.

4. Three Classification Strategies

Classification Strategies

Discriminant functions: learn a direct mapping $f(\mathbf{x}; \mathbf{w}) \to C_k$ without any probabilistic interpretation. Decisions rely on heuristic loss functions.
Probabilistic discriminative models: model the posterior class probabilities $p(C_k \mid \mathbf{x})$ directly. Decision theory then gives the optimal classifier (Lecture 5.5).
Probabilistic generative models: model the class-conditional densities $p(\mathbf{x} \mid C_k)$ and class priors $p(C_k)$. Bayes' theorem then gives $p(C_k \mid \mathbf{x})$. Additionally, the joint distribution enables generating synthetic data (Lecture 5.6).