Lecture 12.2

Kernelizing Bayesian Regression

By rewriting Bayesian linear regression in dual form, predictions become weighted sums of kernel evaluations — revealing that the equivalent kernel we met in Lecture 5.1 was already a kernel, and motivating the move to Gaussian processes.

Learning Objectives

Recall the Bayesian linear regression posterior and predictive distribution.
Show that the predictive mean can be written as a weighted sum of kernel evaluations (equivalent kernel form).
Identify the limitations of the parametric approach that motivate kernelization.
Explain how the kernel trick connects Bayesian linear regression to Gaussian processes.

1. Bayesian Linear Regression Recap

From Lecture 4.3–4.5, the Bayesian linear regression model with Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$ and Gaussian noise with precision $\beta$ gives:

Posterior: $p(\mathbf{w} | \mathcal{D}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N)$ with $\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ and $\mathbf{m}_N = \beta\mathbf{S}_N\boldsymbol{\Phi}^\top\mathbf{t}$.
Predictive mean: $\mu(\mathbf{x}') = \mathbf{m}_N^\top \boldsymbol{\phi}(\mathbf{x}') = \beta\, \boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\Phi}^\top \mathbf{t}$.

2. The Equivalent Kernel Form

Defining $k(\mathbf{x}', \mathbf{x}_n) = \beta\,\boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\phi}(\mathbf{x}_n)$, the predictive mean becomes:

Predictive Mean as Kernel Smoother $$\mu(\mathbf{x}') = \sum_{n=1}^{N} k(\mathbf{x}', \mathbf{x}_n)\, t_n = \mathbf{k}(\mathbf{x}')^\top \mathbf{t}.$$

This is the equivalent kernel from Lecture 5.1 — the predictive mean is already a weighted sum over training targets, with weights given by a kernel. The Bayesian regression model was implicitly a kernel method all along.

3. Limitations of the Parametric Approach

The equivalent kernel $k(\mathbf{x}', \mathbf{x}_n) = \beta\,\boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\phi}(\mathbf{x}_n)$ is derived from $\boldsymbol{\phi}$, which we must choose manually. Two problems arise:

Feature design: choosing appropriate basis functions in high dimensions is difficult. Poorly placed basis functions (e.g., a Gaussian basis centered far from the data) contribute nothing.
Computation: training requires inverting $\mathbf{S}_N^{-1}$, an $M \times M$ matrix. More expressive models need larger $M$, increasing cost as $O(M^3)$.

4. The Kernel Trick as the Solution

Since the predictive mean already depends on data only through inner products of feature vectors, we can apply the kernel trick: replace $\boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ with any valid kernel $k(\mathbf{x}, \mathbf{x}')$. This allows implicitly infinite-dimensional feature spaces without specifying $\boldsymbol{\phi}$.

From Bayesian Regression to Gaussian Processes

Applying the kernel trick fully — treating the prior on $\mathbf{w}$ as defining a prior over functions via $k(\mathbf{x}, \mathbf{x}') = \frac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$ — leads directly to Gaussian processes. The GP framework specifies the distribution over functions directly through the kernel, without reference to an explicit $\mathbf{w}$ at all. This is developed in Lecture 12.3.