Lecture 12.3
Gaussian Processes
A Gaussian process is a distribution over functions: every time you "sample" from it, you get a random function. The distribution is fully specified by a mean function and a kernel, which together determine the prior beliefs about what the function looks like before any data is observed.
- Define a Gaussian process (GP) and state its defining property.
- Express a GP through its mean function $m(\mathbf{x})$ and covariance (kernel) function $k(\mathbf{x}, \mathbf{x}')$.
- Show that evaluating a GP at finitely many points gives a multivariate Gaussian.
- Connect GPs to Bayesian linear regression: the prior over $\mathbf{w}$ induces a GP with a specific kernel.
- Describe how to sample GP functions using the reparameterization trick.
1. Definition
A Gaussian process is a collection of random variables — one for each point in an input domain — such that any finite subcollection is jointly Gaussian distributed. We write
$$f \sim \mathcal{GP}\bigl(m(\cdot),\; k(\cdot, \cdot)\bigr),$$where the mean function $m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$ and the covariance function (kernel) $k(\mathbf{x}, \mathbf{x}') = \mathrm{Cov}[f(\mathbf{x}), f(\mathbf{x}')]$ fully characterize the distribution. The kernel must be a valid (positive semi-definite) function.
Think of $f$ as an infinitely high-dimensional random vector: one component per input point. Because the input space is continuous, we work with functions rather than vectors.
2. Finite Evaluations Are Multivariate Gaussian
For any finite set of input points $\{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$, the vector of function values $(f(\mathbf{x}_1), \ldots, f(\mathbf{x}_N))$ is jointly Gaussian:
$$\mathbf{f} = \bigl(f(\mathbf{x}_1), \ldots, f(\mathbf{x}_N)\bigr)^\top \sim \mathcal{N}(\mathbf{m}, \mathbf{K}),$$where $m_n = m(\mathbf{x}_n)$ and $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$ is the Gram matrix. The GP property — any subset is Gaussian — follows directly from the marginalization property of multivariate Gaussians (Lecture 12.1).
3. GP from Bayesian Linear Regression
Bayesian linear regression with prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$ defines a GP on $f(\mathbf{x}) = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})$:
- Mean: $\mathbb{E}[f(\mathbf{x})] = \mathbb{E}[\mathbf{w}]^\top\boldsymbol{\phi}(\mathbf{x}) = \mathbf{0}$.
- Covariance: $\mathrm{Cov}[f(\mathbf{x}), f(\mathbf{x}')] = \boldsymbol{\phi}(\mathbf{x})^\top \mathbb{E}[\mathbf{w}\mathbf{w}^\top] \boldsymbol{\phi}(\mathbf{x}') = \tfrac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$.
So $f \sim \mathcal{GP}(0,\; k(\mathbf{x}, \mathbf{x}'))$ with $k(\mathbf{x}, \mathbf{x}') = \tfrac{1}{\alpha}\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\phi}(\mathbf{x}')$. Bayesian linear regression is a special case of a GP.
Think of a continuous function as a vector of infinite dimension, sampled at every point in the input space. A GP is a distribution over such infinite-dimensional vectors — defined consistently across all finite sub-collections by the kernel. By choosing an expressive kernel (e.g., Gaussian RBF) we work with a GP whose implicit feature space is infinite-dimensional, providing unlimited modeling power.
4. Sampling from a GP
To draw a sample function, evaluate the GP on a finite grid $\{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$:
- Build the Gram matrix $\mathbf{K}$ with $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m)$.
- Factorize $\mathbf{K} = \mathbf{L}\mathbf{L}^\top$ (Cholesky) or $\mathbf{K} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^\top$ (eigendecomposition).
- Sample $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and set $\mathbf{f} = \mathbf{m} + \mathbf{L}\boldsymbol{\varepsilon}$.
As the grid becomes denser, the sampled vector traces a continuous random function. The kernel controls how smooth and structured these sampled functions look (explored in Lecture 12.4).