Lecture 11.5

SVMs: Kernel SVM

Applying dual Lagrangian optimization to the maximum-margin classifier yields the SVM dual — a problem that naturally takes kernel form and produces a sparse solution depending only on the support vectors.

Learning Objectives

Derive the SVM dual Lagrangian from the primal by eliminating $\mathbf{w}$ and $b$.
State the dual optimization problem and identify its constraints.
Apply the kernel trick to obtain kernel SVMs with nonlinear decision boundaries.
Explain why KKT complementary slackness implies sparsity: only support vectors have $a_n \neq 0$.
Show how to recover $b$ from the support vectors.

1. Primal Lagrangian

Starting from the hard-margin primal (Lecture 11.3), introduce one Lagrange multiplier $a_n \geq 0$ per constraint:

$$\mathcal{L}(\mathbf{w}, b, \mathbf{a}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{n=1}^{N} a_n \bigl[t_n(\mathbf{w}^\top\mathbf{x}_n + b) - 1\bigr].$$

2. Stationarity Conditions

Setting the gradient with respect to the primal variables to zero:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 0 \;\Rightarrow\; \mathbf{w} = \sum_{n=1}^{N} a_n t_n \mathbf{x}_n,$$ $$\frac{\partial \mathcal{L}}{\partial b} = 0 \;\Rightarrow\; \sum_{n=1}^{N} a_n t_n = 0.$$

3. The Dual Lagrangian

Substituting $\mathbf{w} = \sum_n a_n t_n \mathbf{x}_n$ into $\mathcal{L}$ and using $\sum_n a_n t_n = 0$ to simplify yields the dual:

SVM Dual Problem $$\max_{\mathbf{a}}\; \ell(\mathbf{a}) = \sum_{n=1}^{N} a_n - \frac{1}{2}\sum_{n,m=1}^{N} a_n a_m t_n t_m\, \mathbf{x}_n^\top \mathbf{x}_m$$

subject to $a_n \geq 0$ for all $n$ and $\displaystyle\sum_{n=1}^N a_n t_n = 0$.

All data appear only through the inner product $\mathbf{x}_n^\top\mathbf{x}_m$. Replacing this with a kernel $k(\mathbf{x}_n, \mathbf{x}_m)$ gives the kernel SVM, enabling nonlinear decision boundaries:

$$\max_{\mathbf{a}}\; \sum_n a_n - \frac{1}{2}\sum_{n,m} a_n a_m t_n t_m\, k(\mathbf{x}_n, \mathbf{x}_m).$$

4. Prediction

After solving for $\mathbf{a}$, the classifier is:

$$y(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n\, k(\mathbf{x}_n, \mathbf{x}) + b.$$

5. Sparsity: Support Vectors

The KKT complementary slackness condition is $a_n\bigl[t_n y(\mathbf{x}_n) - 1\bigr] = 0$, which means at the optimum:

If $a_n = 0$: the point is outside the margin ($t_n y(\mathbf{x}_n) > 1$) and does not contribute to the prediction.
If $a_n > 0$: the point lies exactly on the margin ($t_n y(\mathbf{x}_n) = 1$) — it is a support vector and directly shapes the classifier.

Support Vectors and Sparsity

The vast majority of training points have $a_n = 0$. Only the support vectors (on the margin) have $a_n > 0$. Predictions at test time cost $O(N_{\mathrm{sv}} M)$ where $N_{\mathrm{sv}} \ll N$ — this is the key advantage of SVMs over general kernel methods.

6. Recovering the Bias $b$

For any support vector $(\mathbf{x}_s, t_s)$ with $a_s > 0$, we know $t_s y(\mathbf{x}_s) = 1$. Substituting:

$$b = t_s - \sum_{n:\,a_n>0} a_n t_n\, k(\mathbf{x}_n, \mathbf{x}_s).$$

In practice, averaging over all support vectors gives a more numerically stable estimate of $b$.

Nonlinear Boundaries via Gaussian Kernel

With the Gaussian kernel $k(\mathbf{x},\mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2/(2\sigma^2))$, the kernel SVM implicitly operates in an infinite-dimensional feature space, producing arbitrarily complex decision boundaries. The bandwidth $\sigma$ controls smoothness: large $\sigma$ gives smooth boundaries; small $\sigma$ allows sharp, locally fitted boundaries.