Lecture 4.3

Gaussian Posteriors

Bayesian linear regression: the posterior over weights is Gaussian, enabling exact inference in closed form.

Learning Objectives

Write the Gaussian likelihood for a single data point and for the full dataset.
State the Gaussian prior over weights and explain the conjugacy property.
Give the closed-form Gaussian posterior for the isotropic prior case.
Recover the MLE solution as the limit of the MAP estimate when the prior becomes flat.

1. Probabilistic Setting

We model the target $t'$ for a new input $x'$ as a Gaussian centered on a linear prediction in feature space:

$$p(t' \mid x', \mathbf{w}, \beta) = \mathcal{N}(t' \mid \boldsymbol{\phi}(x')^\top \mathbf{w},\, \beta^{-1}),$$

where $\boldsymbol{\phi}(x')$ is the feature vector (basis functions evaluated at $x'$), $\mathbf{w}$ are the model weights, and $\beta$ is the noise precision (a hyperparameter).

Assuming the $N$ observations are i.i.d., the joint likelihood over all targets $\mathbf{t} = (t_1,\dots,t_N)^\top$ is the product of individual likelihoods. This product of Gaussians is itself a multivariate Gaussian:

$$p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) = \mathcal{N}(\mathbf{t} \mid \boldsymbol{\Phi}\mathbf{w},\, \beta^{-1}\mathbf{I}),$$

where $\boldsymbol{\Phi}$ is the design matrix introduced in Lecture 3.2.

2. Gaussian Prior and Conjugacy

We place a Gaussian prior over the weights:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0).$$

Conjugate Prior

A prior is conjugate to a likelihood if the resulting posterior belongs to the same family of distributions. For a Gaussian likelihood, the Gaussian prior is conjugate: the posterior $p(\mathbf{w} \mid \mathbf{t}) \propto p(\mathbf{t} \mid \mathbf{w})\, p(\mathbf{w})$ is again Gaussian.

This holds because both factors are of the form $\exp(-\tfrac{1}{2}(\cdots)^\top(\cdots))$; their product adds the exponents, producing a new quadratic form in $\mathbf{w}$ — another (unnormalized) Gaussian.

3. Closed-Form Posterior

The posterior is $p(\mathbf{w} \mid \mathbf{t}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N)$. For the isotropic prior $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{S}_0 = \alpha^{-1}\mathbf{I}$ (zero-mean, precision $\alpha$), completing the square in $\mathbf{w}$ gives:

Gaussian Posterior — Isotropic Prior $$\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi},$$ $$\mathbf{m}_N = \beta\,\mathbf{S}_N\,\boldsymbol{\Phi}^\top\mathbf{t}.$$

The MAP estimate — the mode of the posterior — equals the mean $\mathbf{m}_N$, because the mode of a Gaussian is its mean.

Connection to Ridge Regression

The MAP estimate $\mathbf{m}_N$ under the isotropic Gaussian prior is identical to the ridge regression solution with $\lambda = \alpha/\beta$ — established in Lecture 3.5. The Bayesian framework additionally yields the full posterior covariance $\mathbf{S}_N$, which quantifies uncertainty in the weights.

4. Limiting Cases

Flat Prior Recovers MLE ($\alpha \to 0$)

As $\alpha \to 0$, the prior becomes infinitely broad — no prior information. Then $\mathbf{S}_N^{-1} \approx \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ and

$$\mathbf{m}_N \to \beta \cdot (\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = \mathbf{w}_{\mathrm{MLE}}.$$

This is the pseudoinverse solution from Lecture 3.2. With no prior information, the posterior mean agrees with maximum likelihood.

Narrow Prior Forces Weights to Zero ($\alpha \to \infty$)

As $\alpha \to \infty$, the $\alpha\mathbf{I}$ term dominates $\mathbf{S}_N^{-1}$, giving $\mathbf{S}_N \to \mathbf{0}$ and $\mathbf{m}_N \to \mathbf{0}$. The posterior collapses to a delta function at zero — the model ignores the data entirely in favor of the prior belief that weights are zero.