Lecture 11.6
SVMs: Soft-Margin Classifiers
Real datasets are rarely perfectly separable. The soft-margin SVM introduces per-point slack variables that allow some training points to violate the margin, controlled by the hyperparameter $C$. The dual problem and kernel trick still apply, but the dual variables are now bounded.
- Motivate slack variables as a way to handle overlapping class distributions.
- Write the soft-margin SVM primal objective and constraints.
- Derive the soft-margin dual and identify the box constraint $0 \leq a_n \leq C$.
- Classify training points into three categories based on their dual variable value.
- Explain the role of $C$ as a trade-off between margin width and misclassification penalty.
1. Motivation: Overlapping Distributions
When class-conditional distributions overlap, no linear boundary can separate the data perfectly. Forcing perfect separation either fails (no feasible solution) or produces a very tight, wiggly boundary that overfits. The soft-margin SVM allows controlled violations.
2. Slack Variables
Introduce a non-negative slack variable $\xi_n \geq 0$ for each training point:
- $\xi_n = 0$: point is on or beyond the correct side of the margin — no violation.
- $0 < \xi_n < 1$: point is within the margin but correctly classified.
- $\xi_n \geq 1$: point is misclassified.
3. Soft-Margin Primal
subject to $t_n(\mathbf{w}^\top\mathbf{x}_n + b) \geq 1 - \xi_n$ and $\xi_n \geq 0$ for all $n$.
The hyperparameter $C > 0$ penalizes slack: large $C$ demands few violations (approaching hard-margin); small $C$ allows many violations and a wider (possibly better-generalized) margin.
4. Dual and Box Constraints
Introducing Lagrange multipliers $a_n \geq 0$ for the margin constraints and $\mu_n \geq 0$ for the non-negativity of $\xi_n$, and applying the KKT stationarity conditions, gives the same dual Lagrangian as the hard-margin case:
$$\max_{\mathbf{a}}\; \sum_n a_n - \frac{1}{2}\sum_{n,m} a_n a_m t_n t_m\, k(\mathbf{x}_n, \mathbf{x}_m)$$but with the additional constraint from $\partial\mathcal{L}/\partial\xi_n = 0$ (which gives $a_n = C - \mu_n$) and dual feasibility ($\mu_n \geq 0$):
The dual is otherwise identical to the hard-margin case. The kernel trick applies unchanged, so nonlinear soft-margin SVMs are obtained by replacing $\mathbf{x}_n^\top\mathbf{x}_m$ with $k(\mathbf{x}_n, \mathbf{x}_m)$.
5. Three Types of Training Points
At the optimum, KKT conditions classify each training point into one of three regimes:
- $a_n = 0$: point lies outside the margin with $\xi_n = 0$. Does not contribute to predictions.
- $0 < a_n < C$: point lies exactly on the margin with $\xi_n = 0$ (by complementary slackness on $\mu_n$). These are the classical support vectors.
- $a_n = C$: point lies inside the margin or is misclassified ($\xi_n \geq 0$). These also contribute to predictions; they are "soft" support vectors.
$C \to \infty$: no slack allowed; recovers the hard-margin SVM (or fails if data are not separable). $C \to 0$: unlimited slack allowed; the margin grows without bound and all points become support vectors, losing the computational advantage of sparsity. In practice, $C$ is selected by cross-validation.