Lecture 11.6

SVMs: Soft-Margin Classifiers

Real datasets are rarely perfectly separable. The soft-margin SVM introduces per-point slack variables that allow some training points to violate the margin, controlled by the hyperparameter $C$. The dual problem and kernel trick still apply, but the dual variables are now bounded.

Learning Objectives

Motivate slack variables as a way to handle overlapping class distributions.
Write the soft-margin SVM primal objective and constraints.
Derive the soft-margin dual and identify the box constraint $0 \leq a_n \leq C$.
Classify training points into three categories based on their dual variable value.
Explain the role of $C$ as a trade-off between margin width and misclassification penalty.

1. Motivation: Overlapping Distributions

When class-conditional distributions overlap, no linear boundary can separate the data perfectly. Forcing perfect separation either fails (no feasible solution) or produces a very tight, wiggly boundary that overfits. The soft-margin SVM allows controlled violations.

2. Slack Variables

Introduce a non-negative slack variable $\xi_n \geq 0$ for each training point:

Slack Variables and Soft Constraints $$t_n\,y(\mathbf{x}_n) \geq 1 - \xi_n, \quad \xi_n \geq 0.$$

$\xi_n = 0$: point is on or beyond the correct side of the margin — no violation.
$0 < \xi_n < 1$: point is within the margin but correctly classified.
$\xi_n \geq 1$: point is misclassified.

3. Soft-Margin Primal

Soft-Margin SVM (Primal) $$\min_{\mathbf{w},\,b,\,\boldsymbol{\xi}}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{n=1}^{N}\xi_n$$

subject to $t_n(\mathbf{w}^\top\mathbf{x}_n + b) \geq 1 - \xi_n$ and $\xi_n \geq 0$ for all $n$.

The hyperparameter $C > 0$ penalizes slack: large $C$ demands few violations (approaching hard-margin); small $C$ allows many violations and a wider (possibly better-generalized) margin.

4. Dual and Box Constraints

Introducing Lagrange multipliers $a_n \geq 0$ for the margin constraints and $\mu_n \geq 0$ for the non-negativity of $\xi_n$, and applying the KKT stationarity conditions, gives the same dual Lagrangian as the hard-margin case:

$$\max_{\mathbf{a}}\; \sum_n a_n - \frac{1}{2}\sum_{n,m} a_n a_m t_n t_m\, k(\mathbf{x}_n, \mathbf{x}_m)$$

but with the additional constraint from $\partial\mathcal{L}/\partial\xi_n = 0$ (which gives $a_n = C - \mu_n$) and dual feasibility ($\mu_n \geq 0$):

Box Constraint $$0 \leq a_n \leq C \quad \forall\, n, \qquad \sum_n a_n t_n = 0.$$

The dual is otherwise identical to the hard-margin case. The kernel trick applies unchanged, so nonlinear soft-margin SVMs are obtained by replacing $\mathbf{x}_n^\top\mathbf{x}_m$ with $k(\mathbf{x}_n, \mathbf{x}_m)$.

5. Three Types of Training Points

At the optimum, KKT conditions classify each training point into one of three regimes:

$a_n = 0$: point lies outside the margin with $\xi_n = 0$. Does not contribute to predictions.
$0 < a_n < C$: point lies exactly on the margin with $\xi_n = 0$ (by complementary slackness on $\mu_n$). These are the classical support vectors.
$a_n = C$: point lies inside the margin or is misclassified ($\xi_n \geq 0$). These also contribute to predictions; they are "soft" support vectors.

Role of $C$

$C \to \infty$: no slack allowed; recovers the hard-margin SVM (or fails if data are not separable). $C \to 0$: unlimited slack allowed; the margin grows without bound and all points become support vectors, losing the computational advantage of sparsity. In practice, $C$ is selected by cross-validation.