Lecture 11.3

SVMs: Maximum Margin Classifiers

Among all linear classifiers that correctly separate the training data, the one with the maximum margin — the largest gap between the boundary and the nearest data points — is the most robust to perturbations. This geometric principle leads to support vector machines.

Learning Objectives

Define the signed distance from a data point to a linear decision boundary.
Define the margin and explain why maximizing it leads to better generalization.
Derive the hard-margin SVM as a constrained quadratic program: minimize $\tfrac{1}{2}\|\mathbf{w}\|^2$ subject to $t_n(\mathbf{w}^\top\mathbf{x}_n + b) \geq 1$.
Identify support vectors as the training points that lie exactly on the margin.

1. Linear Classifier and Signed Distance

Consider a binary classifier with targets $t_n \in \{-1, +1\}$ and the linear discriminant $y(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b$. From Lecture 6.3, the signed distance of a point $\mathbf{x}_n$ to the decision surface $\{y = 0\}$ is

$$r_n = \frac{t_n\, y(\mathbf{x}_n)}{\|\mathbf{w}\|}.$$

$r_n > 0$ means the point is on the correct side; the magnitude measures how far.

2. The Margin

Margin

The margin is the perpendicular distance from the decision boundary to the closest training point:

$$\text{margin} = \min_n r_n = \min_n \frac{t_n\,y(\mathbf{x}_n)}{\|\mathbf{w}\|}.$$

Maximizing the margin yields the most stable classifier: a small perturbation to the data or boundary is unlikely to flip a classification.

3. Canonical Form and the Hard-Margin Constraint

Scaling $\mathbf{w}$ and $b$ together does not change the decision boundary. We use this freedom to canonically normalize: require that the closest point(s) satisfy $t_n\,y(\mathbf{x}_n) = 1$. This implies $t_n\,y(\mathbf{x}_n) \geq 1$ for all $n$, and the margin becomes simply $1/\|\mathbf{w}\|$.

Maximizing $1/\|\mathbf{w}\|$ is equivalent to minimizing $\tfrac{1}{2}\|\mathbf{w}\|^2$, giving the hard-margin SVM:

Hard-Margin SVM (Primal) $$\min_{\mathbf{w},\,b}\; \frac{1}{2}\|\mathbf{w}\|^2 \qquad \text{subject to} \quad t_n\bigl(\mathbf{w}^\top\mathbf{x}_n + b\bigr) \geq 1 \quad \forall\, n.$$

This is a convex quadratic program with linear constraints — a globally unique minimum exists and standard solvers can find it.

4. Support Vectors

Support Vectors

Training points for which the constraint holds with equality, $t_n\,y(\mathbf{x}_n) = 1$, are called support vectors. They lie exactly on the margin boundary. All other points satisfy a strict inequality and do not influence the decision boundary. The solution $(\mathbf{w}, b)$ is entirely determined by the support vectors.

Why the Margin Classifier Is Preferred

Any linear classifier that separates the classes perfectly is valid, but classifiers with small margins have decision boundaries very close to some training points. A single outlier slightly beyond those points causes misclassification. The maximum-margin classifier maximizes the buffer zone and is therefore much less sensitive to individual data points.

5. Towards Kernelized SVMs

The hard-margin primal involves $\mathbf{x}_n$ explicitly. By solving the corresponding dual problem (Lecture 11.4–11.5), predictions take the kernel form $y(\mathbf{x}) = \sum_n a_n t_n k(\mathbf{x}_n, \mathbf{x}) + b$, and the kernel trick enables nonlinear decision boundaries. The dual also reveals the sparsity of the solution: only support vectors have $a_n \neq 0$.