Lecture 5.5
Decision Theory
Minimizing expected loss: asymmetric costs, the confusion matrix, and why optimal Bayes classifiers minimize total expected risk.
- Read a confusion matrix and identify correct classifications versus misclassifications.
- Derive the optimal decision rule that minimizes the misclassification rate.
- Show that the optimal rule selects the class with the highest posterior probability $p(C_k \mid \mathbf{x})$.
- Generalize to asymmetric loss via the loss matrix and expected loss minimization.
1. Measuring Classifier Performance: The Confusion Matrix
For a classifier with $K$ classes, the confusion matrix has entry $(k, j)$ equal to the number of times a point with true class $C_j$ was predicted as $C_k$. Diagonal entries are correct predictions; off-diagonal entries are misclassifications.
If the classifier predicts class $C_1$ for 47 points that actually belong to $C_1$ (diagonal: correct) but predicts $C_1$ for 5 points that actually belong to $C_2$ (off-diagonal: error), then the row for $C_1$ shows this split. Minimizing total misclassification rate means minimizing the sum of all off-diagonal entries.
2. Optimal Decision Rule: Maximum Posterior
Assume data $(\mathbf{x}, C_k)$ is drawn from a joint distribution $p(\mathbf{x}, C_k)$. The probability of a correct classification is
$$p(\text{correct}) = \sum_{k=1}^{K} p(\mathbf{x} \in \mathcal{R}_k,\, C_k) = \sum_{k=1}^{K} \int_{\mathcal{R}_k} p(\mathbf{x}, C_k)\, d\mathbf{x}.$$This is maximized by assigning each $\mathbf{x}$ to the class $C_k$ for which $p(\mathbf{x}, C_k)$ is largest. Since $p(\mathbf{x}, C_k) = p(C_k \mid \mathbf{x})\, p(\mathbf{x})$ and $p(\mathbf{x})$ does not depend on $k$:
Assign input $\mathbf{x}$ to the class with the highest posterior probability:
$$\hat{k} = \arg\max_k\, p(C_k \mid \mathbf{x}).$$This minimizes the misclassification rate. Overlapping class-conditional distributions guarantee some irreducible error; the Bayes classifier achieves the minimum possible error rate (the Bayes error rate).
In the binary 1D case, the decision boundary $\hat{x}$ is where $p(C_1 \mid x) = p(C_2 \mid x)$, or equivalently where $p(x, C_1) = p(x, C_2)$. Moving the boundary to this point minimizes the overlap error; the irreducible error remains due to the overlap between the two class-conditional densities.
3. Asymmetric Losses and the Loss Matrix
Not all misclassifications have equal consequences. In medical diagnosis, classifying a sick patient as healthy (false negative) is far more harmful than the reverse. Decision theory handles this via a loss matrix:
Define $L_{kj}$ as the loss incurred when the true class is $C_j$ but the classifier predicts $C_k$. The expected loss is
$$\mathbb{E}[L] = \sum_{k} \sum_{j} L_{kj} \int_{\mathcal{R}_k} p(\mathbf{x}, C_j)\, d\mathbf{x}.$$The optimal classifier minimizes $\mathbb{E}[L]$ by assigning $\mathbf{x}$ to the region $\mathcal{R}_k$ that minimizes $\sum_j L_{kj}\, p(C_j \mid \mathbf{x})$.
When one class is rare (e.g., cancer affects 1% of patients), a classifier that always predicts "healthy" achieves 99% accuracy — a misleading metric. Proper evaluation requires examining the full confusion matrix or using class-weighted metrics, particularly when combined with an asymmetric loss matrix.