Lecture 6.4
Discriminant Functions: Least Squares
Applying least squares regression to the multi-class classification problem: a closed-form discriminant with appealing simplicity but well-known pathologies for classification tasks.
- Set up the least squares classification problem using one-hot target encoding and matrix notation.
- Derive the closed-form solution $\widetilde{\mathbf{W}} = \widetilde{\mathbf{X}}^\dagger \mathbf{T}$ for the weight matrix.
- Explain why least squares is sensitive to outliers in the classification setting.
- Describe the masking problem that arises in multi-class least squares.
1. Setup: One Class Per Discriminant
We assign each of $K$ classes its own linear discriminant $y_k(\mathbf{x}) = \mathbf{w}_k^\top \widetilde{\mathbf{x}}$, where $\widetilde{\mathbf{x}} = (1, \mathbf{x}^\top)^\top$ includes the bias. Stacking the $K$ weight vectors as columns of $\widetilde{\mathbf{W}}$, all predictions are obtained at once:
$$\mathbf{y}(\mathbf{x}) = \widetilde{\mathbf{W}}^\top \widetilde{\mathbf{x}} \in \mathbb{R}^K.$$We assign $\mathbf{x}$ to class $C_k$ where $y_k(\mathbf{x})$ is largest.
2. One-Hot Targets and the Sum of Squared Errors
Use one-hot encoding: the target for a class-$k$ example is the vector $\mathbf{t}$ with $t_k = 1$ and all other entries 0. Collect all $N$ targets row-wise into an $N \times K$ target matrix $\mathbf{T}$, and all inputs row-wise into $\widetilde{\mathbf{X}}$ (the $N \times (d{+}1)$ design matrix with a leading column of 1s). The sum of squared errors is
$$E(\widetilde{\mathbf{W}}) = \tfrac{1}{2} \operatorname{tr}\!\Bigl\{\bigl(\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T}\bigr)^\top \bigl(\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T}\bigr)\Bigr\}.$$The trace sums the squared errors for all $N$ data points and all $K$ class outputs simultaneously.
3. Closed-Form Solution
Setting $\partial E / \partial \widetilde{\mathbf{W}} = \mathbf{0}$ gives the multi-output version of the normal equations:
The solution has exactly the same pseudo-inverse form as in linear regression (Lecture 3.2), now applied to each output column of $\mathbf{T}$ simultaneously. The entire weight matrix is obtained in one shot.
4. Problems with Least Squares for Classification
Despite the elegant closed-form solution, least squares classification has three serious drawbacks.
The regression targets are 0 or 1, but the model predicts real-valued $y_k(\mathbf{x})$. Because $y_k(\mathbf{x})$ is also proportional to the distance from $\mathbf{x}$ to the decision boundary (Lecture 6.3), a target of 1 effectively demands that each class-$k$ example be placed at a fixed distance of $1/\|\mathbf{w}_k\|$ from the boundary. Data points that are well-separated from the boundary are penalized for being "too easily classified," distorting the boundary to pull them closer.
When one class lies geometrically between two others (e.g., three collinear class clouds), least squares often produces a weight matrix for which the middle class's discriminant is dominated by the two neighboring discriminants everywhere — the middle class "disappears." This masking effect is a fundamental failure mode, not fixable by tuning.
The $y_k$ values can be negative or greater than 1 — they are not proper probabilities. A weak guarantee: because the one-hot targets sum to 1 across classes, the model predictions $\sum_k y_k(\mathbf{x})$ also sum to 1 (a sum-to-one constraint). But individual $y_k$ can still be negative, so the analogy to probabilities breaks down.