Lecture 11.1

Kernelizing Linear Models

Linear models can be recast in a dual form where predictions depend only on inner products between data points — the doorway to kernel methods and non-parametric modeling.

Learning Objectives

Distinguish parametric from non-parametric models.
Derive the dual formulation of ridge regression, expressing the solution through the kernel matrix.
Define the kernel function $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ as a similarity measure in feature space.
Compare primal versus dual computational costs, and explain why the dual opens up infinite-dimensional feature spaces.

1. Parametric vs. Non-Parametric Models

In all models covered so far, predictions are made via a fixed, finite set of weights $\mathbf{w}$. Once trained, the training data can in principle be discarded. These are parametric models.

Non-parametric models base their predictions directly on the training data via a kernel similarity measure. This permits implicitly working with infinite-dimensional feature spaces, as will become clear in Lecture 11.2.

2. The Kernel Function

Kernel Function

Given a feature map $\boldsymbol{\phi}: \mathcal{X} \to \mathbb{R}^M$, the kernel is the inner product in feature space:

$$k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}').$$

It measures the similarity between two inputs. The $N \times N$ matrix $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$ is the Gram matrix.

3. Dual Formulation of Ridge Regression

Minimizing the regularized least-squares objective

$$J(\mathbf{w}) = \sum_{n=1}^{N} \bigl(\mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_n) - t_n\bigr)^2 + \lambda \|\mathbf{w}\|^2$$

and setting $\nabla_\mathbf{w} J = 0$ yields $\mathbf{w} = \boldsymbol{\Phi}^\top \mathbf{a}$, where $a_n = -\tfrac{1}{\lambda}(\mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_n) - t_n)$ and $\boldsymbol{\Phi}$ is the $N \times M$ design matrix. Substituting back and applying the matrix inversion lemma to convert the $M \times M$ inverse into an $N \times N$ inverse gives the dual solution:

Dual Solution (Ridge Regression) $$\mathbf{a} = \bigl(\mathbf{K} + \lambda \mathbf{I}_N\bigr)^{-1} \mathbf{t}.$$

Predictions at a new point $\mathbf{x}'$ are then

$$y(\mathbf{x}') = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}') = \sum_{n=1}^{N} a_n\, k(\mathbf{x}', \mathbf{x}_n) = \mathbf{k}(\mathbf{x}')^\top \mathbf{a}.$$

The prediction is a weighted sum of kernel evaluations against all training points. No explicit $\mathbf{w}$ appears.

4. Computational Comparison

The two representations have different computational costs:

Primal: training requires inverting an $M \times M$ matrix ($O(M^3)$); prediction is $O(M)$.
Dual: training requires inverting an $N \times N$ matrix ($O(N^3)$); prediction is $O(NM)$.

For $N \gg M$, the primal is cheaper. Yet the dual is important for two reasons.

Why Use the Dual?

First: in the dual, $\boldsymbol{\phi}$ appears only through $k$. We can specify $k$ directly — even one corresponding to an infinite-dimensional feature space — and kernel evaluations remain tractable. This is the kernel trick (Lecture 11.2).

Second: support vector machines (Lectures 11.3–11.5) produce sparse dual solutions where only a small number of $a_n$ are non-zero (support vectors). This makes dual predictions as fast as $O(N' M)$ for $N' \ll N$.