Lecture 2.4

Maximum Likelihood: An Example

Learning Objectives

After this lecture you should be able to:

  • Write down the Gaussian noise model for regression and define the precision parameter $\beta$.
  • Derive the log likelihood for a regression dataset under Gaussian noise.
  • Show that maximizing the log likelihood with respect to $\mathbf{w}$ is equivalent to minimizing the mean squared error — i.e. that least squares has a probabilistic justification.
  • Derive the MLE for $\beta$ and interpret it as the average squared residual of the fitted model.
  • State the predictive distribution for a new input and explain how to obtain a point prediction from it.

Lecture 2.3 applied MLE to fit a Gaussian to data with no structure. Here we move to a proper regression setting: data comes in input-target pairs, we have a parametric model $y(\mathbf{x}; \mathbf{w})$, and we want to use MLE to find both the model weights $\mathbf{w}$ and the noise level. The key result: under Gaussian noise, maximizing the likelihood is exactly equivalent to minimizing the mean squared error.

1. The Generative Model

We assume the targets are generated by a deterministic function plus additive Gaussian noise:

$$t = y(\mathbf{x};\, \mathbf{w}) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\, \sigma^2)$$

This means that given $\mathbf{x}$ and $\mathbf{w}$, the target $t$ is a Gaussian random variable centered on the model prediction:

Conditional Target Distribution $$p(t \mid \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}\!\left(t\,;\; y(\mathbf{x};\mathbf{w}),\; \beta^{-1}\right)$$

where $\beta = 1/\sigma^2$ is the precision — the inverse of the noise variance. High precision means low noise and tightly concentrated targets; low precision means high noise and broadly spread targets.

2. The Log Likelihood for Regression

Given $N$ i.i.d. input-target pairs $\mathcal{D} = \{(\mathbf{x}_i, t_i)\}$, the log likelihood under this model is:

$$\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \sum_{i=1}^{N} \ln \mathcal{N}(t_i\,;\, y(\mathbf{x}_i;\mathbf{w}),\, \beta^{-1})$$
Regression Log Likelihood $$\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi) - \frac{\beta}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}) - t_i\bigr)^2$$

3. MLE for $\mathbf{w}$: Least Squares

Standard recipe. To find the MLE for any parameter: (1) formulate the log likelihood as a function of that parameter; (2) set its derivative to zero; (3) solve for the parameter. We apply this recipe twice below — once for $\mathbf{w}$, once for $\beta$.

Only the last term depends on $\mathbf{w}$. Maximizing the log likelihood with respect to $\mathbf{w}$ is therefore equivalent to minimizing:

$$\frac{\beta}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}) - t_i\bigr)^2$$

Since $\beta > 0$ is a positive constant with respect to $\mathbf{w}$, we can drop it:

MLE for $\mathbf{w}$ = Least Squares $$\mathbf{w}^*_{ML} = \arg\min_{\mathbf{w}}\; \sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}) - t_i\bigr)^2$$

Maximizing the likelihood under Gaussian noise is exactly minimizing the sum of squared residuals. Least squares is not an arbitrary choice — it is the probabilistically correct thing to do when measurement noise is Gaussian.

4. MLE for $\beta$: Noise Level

With $\mathbf{w}^*_{ML}$ in hand, differentiate the log likelihood with respect to $\beta$ and set to zero:

$$\frac{\partial}{\partial \beta}\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \frac{N}{2\beta} - \frac{1}{2}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}^*_{ML}) - t_i\bigr)^2 = 0$$

Solving for $\beta$:

MLE for Precision and Noise Variance $$\frac{1}{\beta^*_{ML}} = \frac{1}{N}\sum_{i=1}^{N}\bigl(y(\mathbf{x}_i;\mathbf{w}^*_{ML}) - t_i\bigr)^2$$

The MLE noise variance is the mean squared residual of the fitted model — the average squared distance between predictions and targets.

5. The Predictive Distribution

Once the parameters are estimated, we have a full probabilistic model for new predictions. For a new input $\mathbf{x}_0$:

Predictive Distribution $$p(t_0 \mid \mathbf{x}_0, \mathbf{w}^*_{ML}, \beta^*_{ML}) = \mathcal{N}\!\left(t_0\,;\; y(\mathbf{x}_0;\mathbf{w}^*_{ML}),\; (\beta^*_{ML})^{-1}\right)$$

This assigns a distribution over possible target values, not just a single number. The width reflects the estimated noise level.

If a single point prediction is needed, take the expected value of this distribution. Since it is Gaussian, the expected value equals the mean:

$$\mathbb{E}[t_0] = y(\mathbf{x}_0;\, \mathbf{w}^*_{ML})$$

The point prediction is just the model evaluated at $\mathbf{x}_0$ — which comes from solving the least squares problem. The probabilistic framework adds uncertainty quantification on top for free.

Key takeaway. The Gaussian noise assumption provides a principled justification for least squares. It also extends the framework: instead of returning a single prediction, the model returns a distribution — capturing not just what it predicts, but how confident it is.