Lecture 4.3
Gaussian Posteriors
Bayesian linear regression: the posterior over weights is Gaussian, enabling exact inference in closed form.
- Write the Gaussian likelihood for a single data point and for the full dataset.
- State the Gaussian prior over weights and explain the conjugacy property.
- Give the closed-form Gaussian posterior for the isotropic prior case.
- Recover the MLE solution as the limit of the MAP estimate when the prior becomes flat.
1. Probabilistic Setting
We model the target $t'$ for a new input $x'$ as a Gaussian centered on a linear prediction in feature space:
$$p(t' \mid x', \mathbf{w}, \beta) = \mathcal{N}(t' \mid \boldsymbol{\phi}(x')^\top \mathbf{w},\, \beta^{-1}),$$where $\boldsymbol{\phi}(x')$ is the feature vector (basis functions evaluated at $x'$), $\mathbf{w}$ are the model weights, and $\beta$ is the noise precision (a hyperparameter).
Assuming the $N$ observations are i.i.d., the joint likelihood over all targets $\mathbf{t} = (t_1,\dots,t_N)^\top$ is the product of individual likelihoods. This product of Gaussians is itself a multivariate Gaussian:
$$p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) = \mathcal{N}(\mathbf{t} \mid \boldsymbol{\Phi}\mathbf{w},\, \beta^{-1}\mathbf{I}),$$where $\boldsymbol{\Phi}$ is the design matrix introduced in Lecture 3.2.
2. Gaussian Prior and Conjugacy
We place a Gaussian prior over the weights:
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0).$$A prior is conjugate to a likelihood if the resulting posterior belongs to the same family of distributions. For a Gaussian likelihood, the Gaussian prior is conjugate: the posterior $p(\mathbf{w} \mid \mathbf{t}) \propto p(\mathbf{t} \mid \mathbf{w})\, p(\mathbf{w})$ is again Gaussian.
This holds because both factors are of the form $\exp(-\tfrac{1}{2}(\cdots)^\top(\cdots))$; their product adds the exponents, producing a new quadratic form in $\mathbf{w}$ — another (unnormalized) Gaussian.
3. Closed-Form Posterior
The posterior is $p(\mathbf{w} \mid \mathbf{t}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N)$. For the isotropic prior $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{S}_0 = \alpha^{-1}\mathbf{I}$ (zero-mean, precision $\alpha$), completing the square in $\mathbf{w}$ gives:
The MAP estimate — the mode of the posterior — equals the mean $\mathbf{m}_N$, because the mode of a Gaussian is its mean.
The MAP estimate $\mathbf{m}_N$ under the isotropic Gaussian prior is identical to the ridge regression solution with $\lambda = \alpha/\beta$ — established in Lecture 3.5. The Bayesian framework additionally yields the full posterior covariance $\mathbf{S}_N$, which quantifies uncertainty in the weights.
4. Limiting Cases
As $\alpha \to 0$, the prior becomes infinitely broad — no prior information. Then $\mathbf{S}_N^{-1} \approx \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ and
$$\mathbf{m}_N \to \beta \cdot (\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = \mathbf{w}_{\mathrm{MLE}}.$$This is the pseudoinverse solution from Lecture 3.2. With no prior information, the posterior mean agrees with maximum likelihood.
As $\alpha \to \infty$, the $\alpha\mathbf{I}$ term dominates $\mathbf{S}_N^{-1}$, giving $\mathbf{S}_N \to \mathbf{0}$ and $\mathbf{m}_N \to \mathbf{0}$. The posterior collapses to a delta function at zero — the model ignores the data entirely in favor of the prior belief that weights are zero.