Lecture 4.4
Sequential Bayesian Learning
Online Bayesian updating: the posterior after seeing $N$ data points becomes the prior for the $(N+1)$-th, yielding a principled incremental learner.
- Derive the sequential update rule from Bayes' theorem and the i.i.d. assumption.
- Explain why the current posterior can serve as the prior for the next observation.
- Show analytically that the posterior covariance shrinks to zero as $N \to \infty$.
- Show that the posterior mean converges to the MLE solution as $N \to \infty$.
1. The Sequential Update Rule
Bayes' theorem with i.i.d. data lets us factor the full posterior recursively. For two observations:
$$p(\mathbf{w} \mid x_1, x_2) \propto p(x_2 \mid \mathbf{w})\, p(x_1 \mid \mathbf{w})\, p(\mathbf{w}) = p(x_2 \mid \mathbf{w}) \cdot \underbrace{p(\mathbf{w} \mid x_1)}_{\text{previous posterior}}.$$The posterior after $x_1$ plays the role of a prior for the update on $x_2$. Generalizing to $N$ observations:
Given the posterior after $n-1$ observations, the posterior after the $n$-th observation $(x_n, t_n)$ is:
$$p(\mathbf{w} \mid \mathcal{D}_n) \propto p(t_n \mid x_n, \mathbf{w})\, \cdot\, p(\mathbf{w} \mid \mathcal{D}_{n-1}),$$where $p(\mathbf{w} \mid \mathcal{D}_{n-1})$ is the posterior from the previous step, now acting as the prior.
Since both the likelihood and the prior are Gaussian, the updated posterior is also Gaussian. The closed-form updates for $\mathbf{m}_N$ and $\mathbf{S}_N$ from Lecture 4.3 apply at each step.
Processing $N$ observations one at a time yields exactly the same final posterior as the batch formula from Lecture 4.3. The sequential form is useful in online settings where data arrives incrementally and reprocessing all past data is expensive.
2. A Visual Example
Consider a linear model $t = a_0 + a_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\, 0.2^2)$ and true parameters $a_0 = -0.3$, $a_1 = 0.5$. The model is $y = w_0 + w_1 x$, and we place an isotropic Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{0},\, \alpha^{-1}\mathbf{I})$ with $\alpha = 2$.
- Before any data: the prior is broad; samples of $(w_0, w_1)$ produce wildly different lines.
- After 1 observation: the posterior narrows; sampled lines pass near the observed point but still vary greatly.
- After 2 observations: the posterior sharpens further; lines begin to converge on the true relationship.
- After 20 observations: the posterior is tightly concentrated; sampled lines are nearly indistinguishable and closely match the true line $-0.3 + 0.5x$.
3. Convergence as $N \to \infty$
From Lecture 4.3, $\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$. As $N \to \infty$, the term $\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ grows without bound (it is a sum of $N$ rank-1 matrices). Therefore:
$$\mathbf{S}_N \to \mathbf{0} \quad \text{as } N \to \infty.$$The posterior concentrates to a single point — the learning algorithm becomes certain about the weights.
In the same limit, $\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ dominates $\alpha\mathbf{I}$. Substituting into $\mathbf{m}_N = \beta\,\mathbf{S}_N\,\boldsymbol{\Phi}^\top\mathbf{t}$:
$$\mathbf{m}_N \to (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = \mathbf{w}_{\mathrm{MLE}} \quad \text{as } N \to \infty.$$Regardless of the prior (as long as it is not degenerate), the posterior mean converges to the maximum likelihood solution with enough data. The prior matters only when data is scarce.