Lecture 5.5

Decision Theory

Minimizing expected loss: asymmetric costs, the confusion matrix, and why optimal Bayes classifiers minimize total expected risk.

Learning Objectives

Read a confusion matrix and identify correct classifications versus misclassifications.
Derive the optimal decision rule that minimizes the misclassification rate.
Show that the optimal rule selects the class with the highest posterior probability $p(C_k \mid \mathbf{x})$.
Generalize to asymmetric loss via the loss matrix and expected loss minimization.

1. Measuring Classifier Performance: The Confusion Matrix

For a classifier with $K$ classes, the confusion matrix has entry $(k, j)$ equal to the number of times a point with true class $C_j$ was predicted as $C_k$. Diagonal entries are correct predictions; off-diagonal entries are misclassifications.

Reading a Confusion Matrix

If the classifier predicts class $C_1$ for 47 points that actually belong to $C_1$ (diagonal: correct) but predicts $C_1$ for 5 points that actually belong to $C_2$ (off-diagonal: error), then the row for $C_1$ shows this split. Minimizing total misclassification rate means minimizing the sum of all off-diagonal entries.

2. Optimal Decision Rule: Maximum Posterior

Assume data $(\mathbf{x}, C_k)$ is drawn from a joint distribution $p(\mathbf{x}, C_k)$. The probability of a correct classification is

$$p(\text{correct}) = \sum_{k=1}^{K} p(\mathbf{x} \in \mathcal{R}_k,\, C_k) = \sum_{k=1}^{K} \int_{\mathcal{R}_k} p(\mathbf{x}, C_k)\, d\mathbf{x}.$$

This is maximized by assigning each $\mathbf{x}$ to the class $C_k$ for which $p(\mathbf{x}, C_k)$ is largest. Since $p(\mathbf{x}, C_k) = p(C_k \mid \mathbf{x})\, p(\mathbf{x})$ and $p(\mathbf{x})$ does not depend on $k$:

Optimal Bayes Classifier

Assign input $\mathbf{x}$ to the class with the highest posterior probability:

$$\hat{k} = \arg\max_k\, p(C_k \mid \mathbf{x}).$$

This minimizes the misclassification rate. Overlapping class-conditional distributions guarantee some irreducible error; the Bayes classifier achieves the minimum possible error rate (the Bayes error rate).

Optimal Decision Boundary (1D)

In the binary 1D case, the decision boundary $\hat{x}$ is where $p(C_1 \mid x) = p(C_2 \mid x)$, or equivalently where $p(x, C_1) = p(x, C_2)$. Moving the boundary to this point minimizes the overlap error; the irreducible error remains due to the overlap between the two class-conditional densities.

3. Asymmetric Losses and the Loss Matrix

Not all misclassifications have equal consequences. In medical diagnosis, classifying a sick patient as healthy (false negative) is far more harmful than the reverse. Decision theory handles this via a loss matrix:

Loss Matrix and Expected Loss

Define $L_{kj}$ as the loss incurred when the true class is $C_j$ but the classifier predicts $C_k$. The expected loss is

$$\mathbb{E}[L] = \sum_{k} \sum_{j} L_{kj} \int_{\mathcal{R}_k} p(\mathbf{x}, C_j)\, d\mathbf{x}.$$

The optimal classifier minimizes $\mathbb{E}[L]$ by assigning $\mathbf{x}$ to the region $\mathcal{R}_k$ that minimizes $\sum_j L_{kj}\, p(C_j \mid \mathbf{x})$.

Class Imbalance

When one class is rare (e.g., cancer affects 1% of patients), a classifier that always predicts "healthy" achieves 99% accuracy — a misleading metric. Proper evaluation requires examining the full confusion matrix or using class-weighted metrics, particularly when combined with an asymmetric loss matrix.