Lecture 11.1
Kernelizing Linear Models
Linear models can be recast in a dual form where predictions depend only on inner products between data points — the doorway to kernel methods and non-parametric modeling.
- Distinguish parametric from non-parametric models.
- Derive the dual formulation of ridge regression, expressing the solution through the kernel matrix.
- Define the kernel function $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ as a similarity measure in feature space.
- Compare primal versus dual computational costs, and explain why the dual opens up infinite-dimensional feature spaces.
1. Parametric vs. Non-Parametric Models
In all models covered so far, predictions are made via a fixed, finite set of weights $\mathbf{w}$. Once trained, the training data can in principle be discarded. These are parametric models.
Non-parametric models base their predictions directly on the training data via a kernel similarity measure. This permits implicitly working with infinite-dimensional feature spaces, as will become clear in Lecture 11.2.
2. The Kernel Function
Given a feature map $\boldsymbol{\phi}: \mathcal{X} \to \mathbb{R}^M$, the kernel is the inner product in feature space:
$$k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}').$$It measures the similarity between two inputs. The $N \times N$ matrix $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$ is the Gram matrix.
3. Dual Formulation of Ridge Regression
Minimizing the regularized least-squares objective
$$J(\mathbf{w}) = \sum_{n=1}^{N} \bigl(\mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_n) - t_n\bigr)^2 + \lambda \|\mathbf{w}\|^2$$and setting $\nabla_\mathbf{w} J = 0$ yields $\mathbf{w} = \boldsymbol{\Phi}^\top \mathbf{a}$, where $a_n = -\tfrac{1}{\lambda}(\mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_n) - t_n)$ and $\boldsymbol{\Phi}$ is the $N \times M$ design matrix. Substituting back and applying the matrix inversion lemma to convert the $M \times M$ inverse into an $N \times N$ inverse gives the dual solution:
Predictions at a new point $\mathbf{x}'$ are then
$$y(\mathbf{x}') = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}') = \sum_{n=1}^{N} a_n\, k(\mathbf{x}', \mathbf{x}_n) = \mathbf{k}(\mathbf{x}')^\top \mathbf{a}.$$The prediction is a weighted sum of kernel evaluations against all training points. No explicit $\mathbf{w}$ appears.
4. Computational Comparison
The two representations have different computational costs:
- Primal: training requires inverting an $M \times M$ matrix ($O(M^3)$); prediction is $O(M)$.
- Dual: training requires inverting an $N \times N$ matrix ($O(N^3)$); prediction is $O(NM)$.
For $N \gg M$, the primal is cheaper. Yet the dual is important for two reasons.
First: in the dual, $\boldsymbol{\phi}$ appears only through $k$. We can specify $k$ directly — even one corresponding to an infinite-dimensional feature space — and kernel evaluations remain tractable. This is the kernel trick (Lecture 11.2).
Second: support vector machines (Lectures 11.3–11.5) produce sparse dual solutions where only a small number of $a_n$ are non-zero (support vectors). This makes dual predictions as fast as $O(N' M)$ for $N' \ll N$.