Lecture 11.3
SVMs: Maximum Margin Classifiers
Among all linear classifiers that correctly separate the training data, the one with the maximum margin — the largest gap between the boundary and the nearest data points — is the most robust to perturbations. This geometric principle leads to support vector machines.
- Define the signed distance from a data point to a linear decision boundary.
- Define the margin and explain why maximizing it leads to better generalization.
- Derive the hard-margin SVM as a constrained quadratic program: minimize $\tfrac{1}{2}\|\mathbf{w}\|^2$ subject to $t_n(\mathbf{w}^\top\mathbf{x}_n + b) \geq 1$.
- Identify support vectors as the training points that lie exactly on the margin.
1. Linear Classifier and Signed Distance
Consider a binary classifier with targets $t_n \in \{-1, +1\}$ and the linear discriminant $y(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b$. From Lecture 6.3, the signed distance of a point $\mathbf{x}_n$ to the decision surface $\{y = 0\}$ is
$$r_n = \frac{t_n\, y(\mathbf{x}_n)}{\|\mathbf{w}\|}.$$$r_n > 0$ means the point is on the correct side; the magnitude measures how far.
2. The Margin
The margin is the perpendicular distance from the decision boundary to the closest training point:
$$\text{margin} = \min_n r_n = \min_n \frac{t_n\,y(\mathbf{x}_n)}{\|\mathbf{w}\|}.$$Maximizing the margin yields the most stable classifier: a small perturbation to the data or boundary is unlikely to flip a classification.
3. Canonical Form and the Hard-Margin Constraint
Scaling $\mathbf{w}$ and $b$ together does not change the decision boundary. We use this freedom to canonically normalize: require that the closest point(s) satisfy $t_n\,y(\mathbf{x}_n) = 1$. This implies $t_n\,y(\mathbf{x}_n) \geq 1$ for all $n$, and the margin becomes simply $1/\|\mathbf{w}\|$.
Maximizing $1/\|\mathbf{w}\|$ is equivalent to minimizing $\tfrac{1}{2}\|\mathbf{w}\|^2$, giving the hard-margin SVM:
This is a convex quadratic program with linear constraints — a globally unique minimum exists and standard solvers can find it.
4. Support Vectors
Training points for which the constraint holds with equality, $t_n\,y(\mathbf{x}_n) = 1$, are called support vectors. They lie exactly on the margin boundary. All other points satisfy a strict inequality and do not influence the decision boundary. The solution $(\mathbf{w}, b)$ is entirely determined by the support vectors.
Any linear classifier that separates the classes perfectly is valid, but classifiers with small margins have decision boundaries very close to some training points. A single outlier slightly beyond those points causes misclassification. The maximum-margin classifier maximizes the buffer zone and is therefore much less sensitive to individual data points.
5. Towards Kernelized SVMs
The hard-margin primal involves $\mathbf{x}_n$ explicitly. By solving the corresponding dual problem (Lecture 11.4–11.5), predictions take the kernel form $y(\mathbf{x}) = \sum_n a_n t_n k(\mathbf{x}_n, \mathbf{x}) + b$, and the kernel trick enables nonlinear decision boundaries. The dual also reveals the sparsity of the solution: only support vectors have $a_n \neq 0$.