Lecture 4.4

Sequential Bayesian Learning

Online Bayesian updating: the posterior after seeing $N$ data points becomes the prior for the $(N+1)$-th, yielding a principled incremental learner.

Learning Objectives

Derive the sequential update rule from Bayes' theorem and the i.i.d. assumption.
Explain why the current posterior can serve as the prior for the next observation.
Show analytically that the posterior covariance shrinks to zero as $N \to \infty$.
Show that the posterior mean converges to the MLE solution as $N \to \infty$.

1. The Sequential Update Rule

Bayes' theorem with i.i.d. data lets us factor the full posterior recursively. For two observations:

$$p(\mathbf{w} \mid x_1, x_2) \propto p(x_2 \mid \mathbf{w})\, p(x_1 \mid \mathbf{w})\, p(\mathbf{w}) = p(x_2 \mid \mathbf{w}) \cdot \underbrace{p(\mathbf{w} \mid x_1)}_{\text{previous posterior}}.$$

The posterior after $x_1$ plays the role of a prior for the update on $x_2$. Generalizing to $N$ observations:

Sequential Bayesian Update

Given the posterior after $n-1$ observations, the posterior after the $n$-th observation $(x_n, t_n)$ is:

$$p(\mathbf{w} \mid \mathcal{D}_n) \propto p(t_n \mid x_n, \mathbf{w})\, \cdot\, p(\mathbf{w} \mid \mathcal{D}_{n-1}),$$

where $p(\mathbf{w} \mid \mathcal{D}_{n-1})$ is the posterior from the previous step, now acting as the prior.

Since both the likelihood and the prior are Gaussian, the updated posterior is also Gaussian. The closed-form updates for $\mathbf{m}_N$ and $\mathbf{S}_N$ from Lecture 4.3 apply at each step.

Equivalence to Batch Inference

Processing $N$ observations one at a time yields exactly the same final posterior as the batch formula from Lecture 4.3. The sequential form is useful in online settings where data arrives incrementally and reprocessing all past data is expensive.

2. A Visual Example

Consider a linear model $t = a_0 + a_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\, 0.2^2)$ and true parameters $a_0 = -0.3$, $a_1 = 0.5$. The model is $y = w_0 + w_1 x$, and we place an isotropic Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{0},\, \alpha^{-1}\mathbf{I})$ with $\alpha = 2$.

Posterior Evolution

Before any data: the prior is broad; samples of $(w_0, w_1)$ produce wildly different lines.
After 1 observation: the posterior narrows; sampled lines pass near the observed point but still vary greatly.
After 2 observations: the posterior sharpens further; lines begin to converge on the true relationship.
After 20 observations: the posterior is tightly concentrated; sampled lines are nearly indistinguishable and closely match the true line $-0.3 + 0.5x$.

3. Convergence as $N \to \infty$

Posterior Covariance Shrinks to Zero

From Lecture 4.3, $\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$. As $N \to \infty$, the term $\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ grows without bound (it is a sum of $N$ rank-1 matrices). Therefore:

$$\mathbf{S}_N \to \mathbf{0} \quad \text{as } N \to \infty.$$

The posterior concentrates to a single point — the learning algorithm becomes certain about the weights.

Posterior Mean Converges to MLE

In the same limit, $\beta\,\boldsymbol{\Phi}^\top\boldsymbol{\Phi}$ dominates $\alpha\mathbf{I}$. Substituting into $\mathbf{m}_N = \beta\,\mathbf{S}_N\,\boldsymbol{\Phi}^\top\mathbf{t}$:

$$\mathbf{m}_N \to (\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top\mathbf{t} = \mathbf{w}_{\mathrm{MLE}} \quad \text{as } N \to \infty.$$

Regardless of the prior (as long as it is not degenerate), the posterior mean converges to the maximum likelihood solution with enough data. The prior matters only when data is scarce.