Lecture 12.4

GPs: Exponential Kernel

By choosing a specific kernel, we control the qualitative character of the functions sampled from a GP — their smoothness, amplitude, global trend, and noise. The exponential (RBF) kernel provides a flexible four-parameter family that covers a wide range of behaviors.

Learning Objectives

Write the four-parameter exponential kernel and identify each parameter's role.
Explain how $\theta_0$ (amplitude), $\theta_1$ (length scale), $\theta_2$ (constant offset), and $\theta_3$ (linear trend) affect GP samples.
Describe the limiting behavior when $\theta_1 \to 0$ (white noise) and $\theta_1 \to \infty$ (very smooth).

1. The Exponential Kernel

Four-Parameter Exponential Kernel $$k(x_n, x_m) = \theta_0 \exp\!\Bigl(-\frac{\theta_1}{2}(x_n - x_m)^2\Bigr) + \theta_2 + \theta_3\, x_n x_m.$$

Each parameter shapes a distinct qualitative property of the sampled functions:

$\theta_0 \geq 0$: amplitude. Scales the overall variance. Larger $\theta_0$ produces functions with greater fluctuation magnitude.
$\theta_1 \geq 0$: inverse length scale. Controls how quickly correlation decays with distance. Large $\theta_1$ means short length scale → wiggly functions; small $\theta_1$ means long length scale → smooth functions.
$\theta_2 \geq 0$: constant offset. Adds position-independent correlation: all function values are equally correlated regardless of distance. In the extreme ($\theta_0 = \theta_3 = 0$), functions are flat random constants — straight horizontal lines.
$\theta_3 \geq 0$: linear trend. The term $x_n x_m$ is an inner product; it adds a global linear drift to sampled functions. In the extreme, samples are random lines through the origin.

2. Effect of the Length Scale $\theta_1$

Smoothness vs. Wiggliness

With $\theta_2 = \theta_3 = 0$ and fixed $\theta_0$:

Small $\theta_1$ (short length scale): the exponential decays rapidly as $|x_n - x_m|$ grows. Points just slightly apart are nearly uncorrelated. Sampled functions are highly oscillatory.
Large $\theta_1$ (long length scale): the exponential remains close to 1 even for distant points. All points co-vary strongly. Sampled functions are very smooth and slowly varying.
$\theta_1 \to 0$: the kernel becomes the identity matrix ($K_{nm} \to 0$ for $n \neq m$). Sampled "functions" are independent noise at each point — white noise.

3. The Offset Term $\theta_2$

The constant $\theta_2$ contributes a rank-1 matrix $\theta_2 \mathbf{1}\mathbf{1}^\top$ to the Gram matrix. Every pair of points has the same covariance $\theta_2$ regardless of distance. This creates a single shared random variable that shifts all function values together: sampled functions differ only by a vertical offset, appearing as "parallel" curves.

4. The Linear Term $\theta_3$

$\theta_3 x_n x_m$ is a linear kernel. It corresponds to a feature map $\phi(x) = \sqrt{\theta_3}\, x$. Points with the same sign of $x$ co-vary positively; opposite signs co-vary negatively. When $\theta_0 = \theta_2 = 0$, sampled functions are linear: $f(x) = ax$ for a random slope $a$, giving random lines through the origin.

5. Combining Components

Since sums of valid kernels are valid (Lecture 11.2), the full kernel combines all four effects: a smooth exponentially-correlated component scaled by $\theta_0$, a global offset from $\theta_2$, and a linear trend from $\theta_3$. This composition allows GP priors that capture a range of real-world function behaviors — smooth local variations, global baselines, and linear trends — simultaneously. Hyperparameters $\theta_0, \theta_1, \theta_2, \theta_3$ are tuned by maximizing the marginal likelihood (Lecture 12.5).