Lecture 5.5

Decision Theory

Minimizing expected loss: asymmetric costs, the confusion matrix, and why optimal Bayes classifiers minimize total expected risk.

Learning Objectives
  • Read a confusion matrix and identify correct classifications versus misclassifications.
  • Derive the optimal decision rule that minimizes the misclassification rate.
  • Show that the optimal rule selects the class with the highest posterior probability $p(C_k \mid \mathbf{x})$.
  • Generalize to asymmetric loss via the loss matrix and expected loss minimization.

1. Measuring Classifier Performance: The Confusion Matrix

For a classifier with $K$ classes, the confusion matrix has entry $(k, j)$ equal to the number of times a point with true class $C_j$ was predicted as $C_k$. Diagonal entries are correct predictions; off-diagonal entries are misclassifications.

Reading a Confusion Matrix

If the classifier predicts class $C_1$ for 47 points that actually belong to $C_1$ (diagonal: correct) but predicts $C_1$ for 5 points that actually belong to $C_2$ (off-diagonal: error), then the row for $C_1$ shows this split. Minimizing total misclassification rate means minimizing the sum of all off-diagonal entries.

2. Optimal Decision Rule: Maximum Posterior

Assume data $(\mathbf{x}, C_k)$ is drawn from a joint distribution $p(\mathbf{x}, C_k)$. The probability of a correct classification is

$$p(\text{correct}) = \sum_{k=1}^{K} p(\mathbf{x} \in \mathcal{R}_k,\, C_k) = \sum_{k=1}^{K} \int_{\mathcal{R}_k} p(\mathbf{x}, C_k)\, d\mathbf{x}.$$

This is maximized by assigning each $\mathbf{x}$ to the class $C_k$ for which $p(\mathbf{x}, C_k)$ is largest. Since $p(\mathbf{x}, C_k) = p(C_k \mid \mathbf{x})\, p(\mathbf{x})$ and $p(\mathbf{x})$ does not depend on $k$:

Optimal Bayes Classifier

Assign input $\mathbf{x}$ to the class with the highest posterior probability:

$$\hat{k} = \arg\max_k\, p(C_k \mid \mathbf{x}).$$

This minimizes the misclassification rate. Overlapping class-conditional distributions guarantee some irreducible error; the Bayes classifier achieves the minimum possible error rate (the Bayes error rate).

Optimal Decision Boundary (1D)

In the binary 1D case, the decision boundary $\hat{x}$ is where $p(C_1 \mid x) = p(C_2 \mid x)$, or equivalently where $p(x, C_1) = p(x, C_2)$. Moving the boundary to this point minimizes the overlap error; the irreducible error remains due to the overlap between the two class-conditional densities.

3. Asymmetric Losses and the Loss Matrix

Not all misclassifications have equal consequences. In medical diagnosis, classifying a sick patient as healthy (false negative) is far more harmful than the reverse. Decision theory handles this via a loss matrix:

Loss Matrix and Expected Loss

Define $L_{kj}$ as the loss incurred when the true class is $C_j$ but the classifier predicts $C_k$. The expected loss is

$$\mathbb{E}[L] = \sum_{k} \sum_{j} L_{kj} \int_{\mathcal{R}_k} p(\mathbf{x}, C_j)\, d\mathbf{x}.$$

The optimal classifier minimizes $\mathbb{E}[L]$ by assigning $\mathbf{x}$ to the region $\mathcal{R}_k$ that minimizes $\sum_j L_{kj}\, p(C_j \mid \mathbf{x})$.

Class Imbalance

When one class is rare (e.g., cancer affects 1% of patients), a classifier that always predicts "healthy" achieves 99% accuracy — a misleading metric. Proper evaluation requires examining the full confusion matrix or using class-weighted metrics, particularly when combined with an asymmetric loss matrix.