Lecture 6.4

Discriminant Functions: Least Squares

Applying least squares regression to the multi-class classification problem: a closed-form discriminant with appealing simplicity but well-known pathologies for classification tasks.

Learning Objectives

Set up the least squares classification problem using one-hot target encoding and matrix notation.
Derive the closed-form solution $\widetilde{\mathbf{W}} = \widetilde{\mathbf{X}}^\dagger \mathbf{T}$ for the weight matrix.
Explain why least squares is sensitive to outliers in the classification setting.
Describe the masking problem that arises in multi-class least squares.

1. Setup: One Class Per Discriminant

We assign each of $K$ classes its own linear discriminant $y_k(\mathbf{x}) = \mathbf{w}_k^\top \widetilde{\mathbf{x}}$, where $\widetilde{\mathbf{x}} = (1, \mathbf{x}^\top)^\top$ includes the bias. Stacking the $K$ weight vectors as columns of $\widetilde{\mathbf{W}}$, all predictions are obtained at once:

$$\mathbf{y}(\mathbf{x}) = \widetilde{\mathbf{W}}^\top \widetilde{\mathbf{x}} \in \mathbb{R}^K.$$

We assign $\mathbf{x}$ to class $C_k$ where $y_k(\mathbf{x})$ is largest.

2. One-Hot Targets and the Sum of Squared Errors

Use one-hot encoding: the target for a class-$k$ example is the vector $\mathbf{t}$ with $t_k = 1$ and all other entries 0. Collect all $N$ targets row-wise into an $N \times K$ target matrix $\mathbf{T}$, and all inputs row-wise into $\widetilde{\mathbf{X}}$ (the $N \times (d{+}1)$ design matrix with a leading column of 1s). The sum of squared errors is

$$E(\widetilde{\mathbf{W}}) = \tfrac{1}{2} \operatorname{tr}\!\Bigl\{\bigl(\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T}\bigr)^\top \bigl(\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T}\bigr)\Bigr\}.$$

The trace sums the squared errors for all $N$ data points and all $K$ class outputs simultaneously.

3. Closed-Form Solution

Setting $\partial E / \partial \widetilde{\mathbf{W}} = \mathbf{0}$ gives the multi-output version of the normal equations:

Least Squares Discriminant (Multi-class) $$\widetilde{\mathbf{W}} = \widetilde{\mathbf{X}}^\dagger \mathbf{T} = \bigl(\widetilde{\mathbf{X}}^\top \widetilde{\mathbf{X}}\bigr)^{-1} \widetilde{\mathbf{X}}^\top \mathbf{T}.$$

The solution has exactly the same pseudo-inverse form as in linear regression (Lecture 3.2), now applied to each output column of $\mathbf{T}$ simultaneously. The entire weight matrix is obtained in one shot.

4. Problems with Least Squares for Classification

Despite the elegant closed-form solution, least squares classification has three serious drawbacks.

Sensitivity to Outliers

The regression targets are 0 or 1, but the model predicts real-valued $y_k(\mathbf{x})$. Because $y_k(\mathbf{x})$ is also proportional to the distance from $\mathbf{x}$ to the decision boundary (Lecture 6.3), a target of 1 effectively demands that each class-$k$ example be placed at a fixed distance of $1/\|\mathbf{w}_k\|$ from the boundary. Data points that are well-separated from the boundary are penalized for being "too easily classified," distorting the boundary to pull them closer.

Masking in Multi-class Problems

When one class lies geometrically between two others (e.g., three collinear class clouds), least squares often produces a weight matrix for which the middle class's discriminant is dominated by the two neighboring discriminants everywhere — the middle class "disappears." This masking effect is a fundamental failure mode, not fixable by tuning.

Outputs Are Not Probabilities

The $y_k$ values can be negative or greater than 1 — they are not proper probabilities. A weak guarantee: because the one-hot targets sum to 1 across classes, the model predictions $\sum_k y_k(\mathbf{x})$ also sum to 1 (a sum-to-one constraint). But individual $y_k$ can still be negative, so the analogy to probabilities breaks down.