Lecture 2.4
Maximum Likelihood: An Example
After this lecture you should be able to:
- Write down the Gaussian noise model for regression and define the precision parameter $\beta$.
- Derive the log likelihood for a regression dataset under Gaussian noise.
- Show that maximizing the log likelihood with respect to $\mathbf{w}$ is equivalent to minimizing the mean squared error — i.e. that least squares has a probabilistic justification.
- Derive the MLE for $\beta$ and interpret it as the average squared residual of the fitted model.
- State the predictive distribution for a new input and explain how to obtain a point prediction from it.
Lecture 2.3 applied MLE to fit a Gaussian to data with no structure. Here we move to a proper regression setting: data comes in input-target pairs, we have a parametric model $y(\mathbf{x}; \mathbf{w})$, and we want to use MLE to find both the model weights $\mathbf{w}$ and the noise level. The key result: under Gaussian noise, maximizing the likelihood is exactly equivalent to minimizing the mean squared error.
1. The Generative Model
We assume the targets are generated by a deterministic function plus additive Gaussian noise:
$$t = y(\mathbf{x};\, \mathbf{w}) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\, \sigma^2)$$This means that given $\mathbf{x}$ and $\mathbf{w}$, the target $t$ is a Gaussian random variable centered on the model prediction:
where $\beta = 1/\sigma^2$ is the precision — the inverse of the noise variance. High precision means low noise and tightly concentrated targets; low precision means high noise and broadly spread targets.
2. The Log Likelihood for Regression
Given $N$ i.i.d. input-target pairs $\mathcal{D} = \{(\mathbf{x}_i, t_i)\}$, the log likelihood under this model is:
$$\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \sum_{i=1}^{N} \ln \mathcal{N}(t_i\,;\, y(\mathbf{x}_i;\mathbf{w}),\, \beta^{-1})$$3. MLE for $\mathbf{w}$: Least Squares
Standard recipe. To find the MLE for any parameter: (1) formulate the log likelihood as a function of that parameter; (2) set its derivative to zero; (3) solve for the parameter. We apply this recipe twice below — once for $\mathbf{w}$, once for $\beta$.
Only the last term depends on $\mathbf{w}$. Maximizing the log likelihood with respect to $\mathbf{w}$ is therefore equivalent to minimizing:
$$\frac{\beta}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}) - t_i\bigr)^2$$Since $\beta > 0$ is a positive constant with respect to $\mathbf{w}$, we can drop it:
Maximizing the likelihood under Gaussian noise is exactly minimizing the sum of squared residuals. Least squares is not an arbitrary choice — it is the probabilistically correct thing to do when measurement noise is Gaussian.
4. MLE for $\beta$: Noise Level
With $\mathbf{w}^*_{ML}$ in hand, differentiate the log likelihood with respect to $\beta$ and set to zero:
$$\frac{\partial}{\partial \beta}\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \frac{N}{2\beta} - \frac{1}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}^*_{ML}) - t_i\bigr)^2 = 0$$Solving for $\beta$:
The MLE noise variance is the mean squared residual of the fitted model — the average squared distance between predictions and targets.
5. The Predictive Distribution
Once the parameters are estimated, we have a full probabilistic model for new predictions. For a new input $\mathbf{x}_0$:
This assigns a distribution over possible target values, not just a single number. The width reflects the estimated noise level.
If a single point prediction is needed, take the expected value of this distribution. Since it is Gaussian, the expected value equals the mean:
$$\mathbb{E}[t_0] = y(\mathbf{x}_0;\, \mathbf{w}^*_{ML})$$The point prediction is just the model evaluated at $\mathbf{x}_0$ — which comes from solving the least squares problem. The probabilistic framework adds uncertainty quantification on top for free.
Key takeaway. The Gaussian noise assumption provides a principled justification for least squares. It also extends the framework: instead of returning a single prediction, the model returns a distribution — capturing not just what it predicts, but how confident it is.