Lecture 12.2

Kernelizing Bayesian Regression

By rewriting Bayesian linear regression in dual form, predictions become weighted sums of kernel evaluations — revealing that the equivalent kernel we met in Lecture 5.1 was already a kernel, and motivating the move to Gaussian processes.

Learning Objectives
  • Recall the Bayesian linear regression posterior and predictive distribution.
  • Show that the predictive mean can be written as a weighted sum of kernel evaluations (equivalent kernel form).
  • Identify the limitations of the parametric approach that motivate kernelization.
  • Explain how the kernel trick connects Bayesian linear regression to Gaussian processes.

1. Bayesian Linear Regression Recap

From Lecture 4.3–4.5, the Bayesian linear regression model with Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$ and Gaussian noise with precision $\beta$ gives:

  • Posterior: $p(\mathbf{w} | \mathcal{D}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N)$ with $\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ and $\mathbf{m}_N = \beta\mathbf{S}_N\boldsymbol{\Phi}^\top\mathbf{t}$.
  • Predictive mean: $\mu(\mathbf{x}') = \mathbf{m}_N^\top \boldsymbol{\phi}(\mathbf{x}') = \beta\, \boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\Phi}^\top \mathbf{t}$.

2. The Equivalent Kernel Form

Defining $k(\mathbf{x}', \mathbf{x}_n) = \beta\,\boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\phi}(\mathbf{x}_n)$, the predictive mean becomes:

Predictive Mean as Kernel Smoother $$\mu(\mathbf{x}') = \sum_{n=1}^{N} k(\mathbf{x}', \mathbf{x}_n)\, t_n = \mathbf{k}(\mathbf{x}')^\top \mathbf{t}.$$

This is the equivalent kernel from Lecture 5.1 — the predictive mean is already a weighted sum over training targets, with weights given by a kernel. The Bayesian regression model was implicitly a kernel method all along.

3. Limitations of the Parametric Approach

The equivalent kernel $k(\mathbf{x}', \mathbf{x}_n) = \beta\,\boldsymbol{\phi}(\mathbf{x}')^\top \mathbf{S}_N \boldsymbol{\phi}(\mathbf{x}_n)$ is derived from $\boldsymbol{\phi}$, which we must choose manually. Two problems arise:

  • Feature design: choosing appropriate basis functions in high dimensions is difficult. Poorly placed basis functions (e.g., a Gaussian basis centered far from the data) contribute nothing.
  • Computation: training requires inverting $\mathbf{S}_N^{-1}$, an $M \times M$ matrix. More expressive models need larger $M$, increasing cost as $O(M^3)$.

4. The Kernel Trick as the Solution

Since the predictive mean already depends on data only through inner products of feature vectors, we can apply the kernel trick: replace $\boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$ with any valid kernel $k(\mathbf{x}, \mathbf{x}')$. This allows implicitly infinite-dimensional feature spaces without specifying $\boldsymbol{\phi}$.

From Bayesian Regression to Gaussian Processes

Applying the kernel trick fully — treating the prior on $\mathbf{w}$ as defining a prior over functions via $k(\mathbf{x}, \mathbf{x}') = \frac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$ — leads directly to Gaussian processes. The GP framework specifies the distribution over functions directly through the kernel, without reference to an explicit $\mathbf{w}$ at all. This is developed in Lecture 12.3.