Lecture 2.5

Maximum A Posteriori Estimation

Learning Objectives

After this lecture you should be able to:

Explain the shift from MLE to MAP: instead of maximizing $p(\mathcal{D}|\mathbf{w})$, we maximize the posterior $p(\mathbf{w}|\mathcal{D})$.
Use Bayes' theorem to express the posterior as likelihood $\times$ prior, and explain why the evidence can be dropped when optimizing over $\mathbf{w}$.
State the Gaussian prior on weights, explain what it encodes (weights should be small), and identify the role of the precision parameter $\alpha$.
Derive the MAP objective — log likelihood plus log prior — and show it corresponds to regularized least squares with an L2 penalty.
Compare MLE and MAP: same predictive distribution form, but MAP weights are found by penalized least squares rather than plain least squares.

MLE finds the parameters that make the data most probable. But it ignores any prior knowledge we might have about what reasonable parameters look like. Maximum A Posteriori (MAP) estimation incorporates such prior beliefs: instead of maximizing $p(\mathcal{D}|\mathbf{w})$, we maximize the posterior probability of the parameters given the data, $p(\mathbf{w}|\mathcal{D})$.

1. From MLE to MAP

The key conceptual shift is a change of viewpoint — from a distribution over data to a distribution over parameters:

MLE vs. MAP

MLE: $\mathbf{w}^*_{ML} = \arg\max_\mathbf{w}\; p(\mathcal{D} \mid \mathbf{w})$ — parameters that make the observed data most probable.

MAP: $\mathbf{w}^*_{MAP} = \arg\max_\mathbf{w}\; p(\mathbf{w} \mid \mathcal{D})$ — parameters that are most probable given the observed data.

We recover the posterior using Bayes' theorem:

$$p(\mathbf{w} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \mathbf{w})\; p(\mathbf{w})}{p(\mathcal{D})}$$

The evidence $p(\mathcal{D})$ does not depend on $\mathbf{w}$, so for optimization we only need the numerator:

$$p(\mathbf{w} \mid \mathcal{D}) \propto p(\mathcal{D} \mid \mathbf{w})\; p(\mathbf{w})$$

In words: posterior $\propto$ likelihood $\times$ prior. The prior $p(\mathbf{w})$ encodes what we believe about the weights before seeing any data.

2. Log MAP and the Prior

As with MLE, we work with the log (which does not change the location of the optimum). Taking the log of the posterior:

$$\ln p(\mathbf{w} \mid \mathcal{D}) = \ln p(\mathcal{D} \mid \mathbf{w}) + \ln p(\mathbf{w}) - \underbrace{\ln p(\mathcal{D})}_{\text{const w.r.t. }\mathbf{w}}$$

Maximizing the log posterior is therefore equivalent to maximizing:

$$\ln p(\mathcal{D} \mid \mathbf{w}) + \ln p(\mathbf{w})$$

Gaussian Prior on the Weights

We model the prior as a zero-mean isotropic Gaussian — our belief that weights should be small and uncorrelated:

Gaussian Weight Prior $$p(\mathbf{w} \mid \alpha) = \mathcal{N}(\mathbf{w}\,;\, \mathbf{0},\, \alpha^{-1}\mathbf{I}) = \left(\frac{\alpha}{2\pi}\right)^{M/2} \exp\!\left(-\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}\right)$$

$\alpha$ is a precision parameter: large $\alpha$ concentrates the prior tightly around zero (weights must stay small); small $\alpha$ allows larger weights. Each weight $w_j$ is assumed independent and drawn from $\mathcal{N}(0, \alpha^{-1})$.

Taking the log of the prior (dropping the constant prefactor):

$$\ln p(\mathbf{w} \mid \alpha) = \frac{M}{2}\ln\frac{\alpha}{2\pi} - \frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}$$

Only the quadratic term $-\tfrac{\alpha}{2}\mathbf{w}^\top\mathbf{w}$ depends on $\mathbf{w}$.

3. The MAP Objective: Regularized Least Squares

Standard recipe. Formulate the (negative) log posterior as the objective; set its derivative to zero; solve for $\mathbf{w}$.

Substituting the Gaussian regression likelihood from lecture 2.4 and the Gaussian prior, and converting to a minimization problem (negative log posterior):

MAP Objective $$\mathbf{w}^*_{MAP} = \arg\min_\mathbf{w}\; \underbrace{\frac{\beta}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}) - t_i\bigr)^2}_{\text{negative log likelihood}} + \underbrace{\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}}_{\text{negative log prior}}$$

This is regularized least squares (ridge regression). The prior adds an L2 penalty that discourages large weights. The ratio $\alpha/\beta$ controls the strength of regularization relative to the data fit.

The prior thus acts as a regularizer — a term that prevents overfitting by penalizing model complexity. This is the probabilistic justification for L2 regularization: it corresponds exactly to placing a zero-mean Gaussian prior on the weights.

4. Predictive Distribution

The predictive distribution takes the same form as in the MLE case, but with MAP weights:

$$p(t \mid \mathbf{x}_0, \mathbf{w}^*_{MAP}, \beta) = \mathcal{N}\!\left(t\,;\; y(\mathbf{x}_0;\mathbf{w}^*_{MAP}),\; \beta^{-1}\right)$$

And a point prediction is again the model mean: $\mathbb{E}[t] = y(\mathbf{x}_0;\mathbf{w}^*_{MAP})$.

5. MLE vs. MAP: Summary

Comparison

	MLE	MAP
Objective	Maximize $p(\mathcal{D}\|\mathbf{w})$	Maximize $p(\mathbf{w}\|\mathcal{D})$
Prior	None (uniform)	$\mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$
Regression solution	Least squares	Regularized least squares (ridge)
Overfitting	Can overfit with small $N$	Prior shrinks weights, reduces overfitting
Predictive distribution	$\mathcal{N}(y(\mathbf{x};\mathbf{w}^*_{ML}), \beta^{-1})$	$\mathcal{N}(y(\mathbf{x};\mathbf{w}^*_{MAP}), \beta^{-1})$