Lecture 11.2
The Kernel Trick
By replacing every occurrence of $\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$ with a kernel function $k(\mathbf{x}, \mathbf{x}')$, we can implicitly work in arbitrarily high-dimensional — or even infinite-dimensional — feature spaces at no extra computational cost.
- State the kernel trick and explain what it means to apply it to a dual formulation.
- Define the Gram matrix and state the conditions for a kernel to be valid (symmetric positive semi-definite).
- Prove that a kernel defined via $k(\mathbf{x},\mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$ is always valid.
- Identify common kernel families (polynomial, Gaussian/RBF) and state basic kernel construction rules.
1. The Kernel Trick
Formulate your model so that the input $\mathbf{x}$ appears only through inner products $\boldsymbol{\phi}(\mathbf{x}_n)^\top \boldsymbol{\phi}(\mathbf{x}_m)$ (or through a scalar product $\mathbf{x}_n^\top \mathbf{x}_m$ in the original space). Then replace every such inner product with a kernel function
$$k(\mathbf{x}_n, \mathbf{x}_m).$$All kernel values are collected in the $N \times N$ Gram matrix $\mathbf{K}$, with $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$. The kernel implicitly corresponds to an inner product in some (possibly infinite-dimensional) feature space, without ever computing $\boldsymbol{\phi}(\mathbf{x})$ explicitly.
2. Valid Kernels
A kernel $k$ is valid if its Gram matrix is symmetric positive semi-definite for every possible set of inputs: for all vectors $\mathbf{z} \in \mathbb{R}^N$,
$$\mathbf{z}^\top \mathbf{K} \mathbf{z} \geq 0.$$If $k(\mathbf{x}, \mathbf{x}') = \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ for some feature map $\boldsymbol{\phi}$, then $\mathbf{z}^\top \mathbf{K} \mathbf{z} = \|\boldsymbol{\Phi}^\top \mathbf{z}\|^2 \geq 0$ — so any kernel of this form is automatically valid.
Crucially, the converse also holds: every valid kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space. We can therefore work purely with $k$, knowing a $\boldsymbol{\phi}$ exists even if we never compute it.
3. Example: Polynomial Kernel
For $\mathbf{x}, \mathbf{z} \in \mathbb{R}^2$, define $k(\mathbf{x}, \mathbf{z}) = (1 + \mathbf{x}^\top \mathbf{z})^2$. Expanding:
$$k(\mathbf{x},\mathbf{z}) = 1 + 2x_1 z_1 + 2x_2 z_2 + x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2.$$This equals $\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{z})$ where
$$\boldsymbol{\phi}(\mathbf{x}) = \bigl(1,\; \sqrt{2}\,x_1,\; \sqrt{2}\,x_2,\; x_1^2,\; x_2^2,\; \sqrt{2}\,x_1 x_2\bigr)^\top.$$A 2D input is implicitly mapped to a 6D feature space. Evaluating $k(\mathbf{x}, \mathbf{z})$ directly requires only 3 multiplications — far cheaper than computing and dotting the 6D vectors. For degree-$M$ polynomial kernels in $D$ dimensions, the implicit feature space has dimension $\binom{D+M}{M}$, which is exponential in $M$.
4. Common Kernel Families
- Polynomial: $k(\mathbf{x}, \mathbf{x}') = (1 + \mathbf{x}^\top \mathbf{x}')^M$ — implicit $\binom{D+M}{M}$-dimensional feature space.
- Gaussian (RBF): $k(\mathbf{x}, \mathbf{x}') = \exp\!\bigl(-\|\mathbf{x}-\mathbf{x}'\|^2 / (2\sigma^2)\bigr)$ — implicit infinite-dimensional feature space. One of the most widely used kernels.
- Radial basis functions: any kernel of the form $k(\mathbf{x}, \mathbf{x}') = f(\|\mathbf{x}-\mathbf{x}'\|^2)$. The Gaussian is a special case.
5. Constructing New Kernels
Valid kernels can be combined to produce new valid kernels. If $k_1$ and $k_2$ are valid, so are:
- $c\,k_1(\mathbf{x}, \mathbf{x}')$ for any constant $c > 0$
- $k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
- $k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
- $\exp(k_1(\mathbf{x}, \mathbf{x}'))$
- $f(\mathbf{x})\,k_1(\mathbf{x}, \mathbf{x}')\,f(\mathbf{x}')$ for any function $f$
The Gaussian kernel $\exp(-\|\mathbf{x}-\mathbf{x}'\|^2/(2\sigma^2))$ can be derived from the linear kernel $\mathbf{x}^\top\mathbf{x}'$ by expanding the squared norm, then applying multiplication, summation, and exponentiation rules — each step preserving validity. This also provides a constructive proof that the Gaussian kernel is valid.