Lecture 4.3

Gaussian Posteriors

Bayesian linear regression: the posterior over weights is Gaussian, enabling exact inference in closed form.

Learning Objectives
  • Write the Gaussian likelihood for a single data point and for the full dataset.
  • State the Gaussian prior over weights and explain the conjugacy property.
  • Give the closed-form Gaussian posterior for the isotropic prior case.
  • Recover the MLE solution as the limit of the MAP estimate when the prior becomes flat.

1. Probabilistic Setting

We model the target $t'$ for a new input $x'$ as a Gaussian centered on a linear prediction in feature space:

$$p(t' \mid x', \mathbf{w}, \beta) = \mathcal{N}(t' \mid \boldsymbol{\phi}(x')^\top \mathbf{w},\, \beta^{-1}),$$

where $\boldsymbol{\phi}(x')$ is the feature vector (basis functions evaluated at $x'$), $\mathbf{w}$ are the model weights, and $\beta$ is the noise precision (a hyperparameter).

Assuming the $N$ observations are i.i.d., the joint likelihood over all targets $\mathbf{t} = (t_1,\dots,t_N)^\top$ is the product of individual likelihoods. This product of Gaussians is itself a multivariate Gaussian:

$$p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) = \mathcal{N}(\mathbf{t} \mid \boldsymbol{\Phi}\mathbf{w},\, \beta^{-1}\mathbf{I}),$$

where $\boldsymbol{\Phi}$ is the design matrix introduced in Lecture 3.2.

2. Gaussian Prior and Conjugacy

We place a Gaussian prior over the weights:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0).$$
Conjugate Prior

A prior is conjugate to a likelihood if the resulting posterior belongs to the same family of distributions. For a Gaussian likelihood, the Gaussian prior is conjugate: the posterior $p(\mathbf{w} \mid \mathbf{t}) \propto p(\mathbf{t} \mid \mathbf{w})\, p(\mathbf{w})$ is again Gaussian.

This holds because both factors are of the form $\exp(-\tfrac{1}{2}(\cdots)^\top(\cdots))$; their product adds the exponents, producing a new quadratic form in $\mathbf{w}$ — another (unnormalized) Gaussian.

3. Closed-Form Posterior

The posterior is $p(\mathbf{w} \mid \mathbf{t}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N)$. For the isotropic prior $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{S}_0 = \alpha^{-1}\mathbf{I}$ (zero-mean, precision $\alpha$), completing the square in $\mathbf{w}$ gives:

Gaussian Posterior — Isotropic Prior $$\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi},$$ $$\mathbf{m}_N = \beta\,\mathbf{S}_N\,\boldsymbol{\Phi}^\top\mathbf{t}.$$

The MAP estimate — the mode of the posterior — equals the mean $\mathbf{m}_N$, because the mode of a Gaussian is its mean.

Connection to Ridge Regression

The MAP estimate $\mathbf{m}_N$ under the isotropic Gaussian prior is identical to the ridge regression solution with $\lambda = \alpha/\beta$ — established in Lecture 3.5. The Bayesian framework additionally yields the full posterior covariance $\mathbf{S}_N$, which quantifies uncertainty in the weights.

4. Limiting Cases

Flat Prior Recovers MLE ($\alpha \to 0$)

As $\alpha \to 0$, the prior becomes infinitely broad — no prior information. Then $\mathbf{S}_N^{-1} \approx \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ and

$$\mathbf{m}_N \to \beta \cdot (\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = \mathbf{w}_{\mathrm{MLE}}.$$

This is the pseudoinverse solution from Lecture 3.2. With no prior information, the posterior mean agrees with maximum likelihood.

Narrow Prior Forces Weights to Zero ($\alpha \to \infty$)

As $\alpha \to \infty$, the $\alpha\mathbf{I}$ term dominates $\mathbf{S}_N^{-1}$, giving $\mathbf{S}_N \to \mathbf{0}$ and $\mathbf{m}_N \to \mathbf{0}$. The posterior collapses to a delta function at zero — the model ignores the data entirely in favor of the prior belief that weights are zero.