Lecture 5.1

The Equivalent Kernel

Interpreting Bayesian linear regression as a linear smoother via the equivalent kernel — a bridge toward Gaussian processes.

Learning Objectives
  • Rewrite the Bayesian predictive mean as a linear combination of training targets $t_n$.
  • Define the equivalent kernel and identify it as a similarity measure between inputs.
  • Show that the covariance between two predictions equals the kernel evaluated at those inputs.
  • Explain the locality property of the kernel and its connection to future kernel methods.

1. The Predictive Mean as a Linear Smoother

Recall from Lecture 4.5 that the Bayesian predictive mean at $x'$ is

$$\mu_N(x') = \mathbf{m}_N^\top \boldsymbol{\phi}(x'),$$

where $\mathbf{m}_N = \beta \mathbf{S}_N \boldsymbol{\Phi}^\top \mathbf{t}$ is the posterior mean (Lecture 4.3). Substituting and expanding the matrix-vector product:

$$\mu_N(x') = \beta\, \boldsymbol{\phi}(x')^\top \mathbf{S}_N \boldsymbol{\Phi}^\top \mathbf{t} = \beta\, \boldsymbol{\phi}(x')^\top \mathbf{S}_N \sum_{n=1}^{N} \boldsymbol{\phi}(x_n)\, t_n = \sum_{n=1}^{N} k(x', x_n)\, t_n.$$
Equivalent Kernel

The function

$$k(x', x_n) = \beta\, \boldsymbol{\phi}(x')^\top \mathbf{S}_N\, \boldsymbol{\phi}(x_n)$$

is called the equivalent kernel. The predictive mean is a weighted sum of training targets, with weight $k(x', x_n)$ determining how much target $t_n$ contributes to the prediction at $x'$.

2. Locality: Close Points Contribute More

Plotting $k(x', x)$ as a function of both arguments shows that the kernel takes large values whenever $x'$ and $x$ are close to each other, and smaller values as they move apart. The kernel is therefore localized: for a given $x'$, the dominant contributions to $\mu_N(x')$ come from training targets $t_n$ whose inputs $x_n$ lie nearby.

Gaussian Basis Functions

For Gaussian basis functions centered across the input domain, the equivalent kernel resembles a Gaussian bump centered at $x'$. Training points close to $x'$ receive large weights; those far away receive near-zero weights. This locality property is not unique to Gaussian basis functions — it holds for a broad class of continuous basis functions, including polynomials.

3. The Kernel as Covariance Between Predictions

The kernel also has a clean probabilistic interpretation. Consider two predictions at inputs $x_1$ and $x_2$, viewed as functions of the (random) weight vector $\mathbf{w}$ drawn from the posterior $p(\mathbf{w}|\mathcal{D})$. Their covariance is:

$$\mathrm{Cov}[y(x_1), y(x_2)] = \mathbb{E}_\mathbf{w}[\boldsymbol{\phi}(x_1)^\top \mathbf{w}\, \mathbf{w}^\top \boldsymbol{\phi}(x_2)] - \mathbb{E}_\mathbf{w}[\boldsymbol{\phi}(x_1)^\top \mathbf{w}]\,\mathbb{E}_\mathbf{w}[\mathbf{w}^\top \boldsymbol{\phi}(x_2)].$$

Using the fact that the posterior covariance of $\mathbf{w}$ is $\mathbf{S}_N$, this simplifies to:

Kernel as Predictive Covariance $$\mathrm{Cov}[y(x_1), y(x_2)] = \boldsymbol{\phi}(x_1)^\top \mathbf{S}_N\, \boldsymbol{\phi}(x_2) = \frac{1}{\beta}\, k(x_1, x_2).$$

The equivalent kernel quantifies the statistical dependence between predictions at different inputs. Nearby predictions covary strongly; distant predictions are nearly independent.

Bridge to Kernel Methods and Gaussian Processes

This kernel-based viewpoint decouples predictions from the explicit weight vector $\mathbf{w}$: everything depends on the data only through kernel evaluations $k(x', x_n)$. Later in the course, we will work directly with kernels without defining basis functions at all. This perspective leads naturally to Gaussian processes, where the prior over functions is specified entirely by a kernel function.