Lecture 2.3

Maximum Likelihood Estimation

Learning Objectives

After this lecture you should be able to:

State the maximum likelihood principle and explain intuitively what it means to maximize the likelihood of the data.
Explain why optimizing the log likelihood is equivalent to optimizing the likelihood (monotonicity argument via the chain rule), and why it is numerically preferable (avoids underflow; converts products to sums).
Derive the log likelihood for a dataset of $N$ i.i.d. Gaussian observations.
Derive the MLE estimates $\hat{\mu}_{ML}$ and $\hat{\sigma}^2_{ML}$ by differentiating the log likelihood and setting the derivatives to zero.
Show that $\hat{\mu}_{ML}$ is an unbiased estimator and that $\hat{\sigma}^2_{ML}$ underestimates $\sigma^2$ by a factor $\tfrac{N-1}{N}$.
State the bias-corrected variance estimator and explain intuitively why fitting a Gaussian to a small dataset always underestimates the true spread.

We now have the probabilistic tools in place. This lecture puts them to work: given a dataset, how do we find the parameters of a distribution that best explain it? The answer is the maximum likelihood principle, one of the most widely used estimation techniques in machine learning.

1. The Maximum Likelihood Principle

Suppose we have a dataset $\mathcal{D} = \{x_1, \ldots, x_N\}$ of $N$ observations, and we want to fit a parametric model $p(\mathcal{D} \mid \mathbf{w})$ to it. This function — the probability of observing the dataset given parameters $\mathbf{w}$ — is called the likelihood. Note that it is a distribution over the data, but a function of the parameters.

Maximum Likelihood Principle

Choose the parameters that make the observed data as probable as possible:

$$\mathbf{w}^*_{ML} = \arg\max_{\mathbf{w}}\; p(\mathcal{D} \mid \mathbf{w})$$

The IID Assumption

We assume the data points are independent and identically distributed (i.i.d.): each $x_i$ is drawn independently from the same distribution $p(x \mid \mathbf{w})$. Under this assumption the joint likelihood factorizes into a product of individual likelihoods:

$$p(\mathcal{D} \mid \mathbf{w}) = \prod_{i=1}^{N} p(x_i \mid \mathbf{w})$$

2. The Log Likelihood

Directly maximizing the product above is problematic: each $p(x_i \mid \mathbf{w}) \leq 1$, so the product of many such terms becomes vanishingly small, causing numerical underflow. Instead we maximize the log likelihood:

$$\ln p(\mathcal{D} \mid \mathbf{w}) = \sum_{i=1}^{N} \ln p(x_i \mid \mathbf{w})$$

The product becomes a sum — far more convenient to work with analytically and numerically.

Why Log Does Not Change the Solution

The logarithm is a monotonically increasing function. At an optimum $\mathbf{w}^*$ the derivative of the objective is zero. Applying the chain rule to the log likelihood:

$$\frac{d}{d\mathbf{w}} \ln p(\mathcal{D} \mid \mathbf{w}) = \frac{1}{p(\mathcal{D} \mid \mathbf{w})} \cdot \frac{d}{d\mathbf{w}} p(\mathcal{D} \mid \mathbf{w}) = 0$$

Since $p(\mathcal{D} \mid \mathbf{w}) > 0$ always, the factor $\tfrac{1}{p(\mathcal{D}|\mathbf{w})}$ is never zero and never shifts the location of the optimum. The zeros of the derivative are determined entirely by $\tfrac{d}{d\mathbf{w}} p(\mathcal{D} \mid \mathbf{w})$ — the same condition as maximizing the original likelihood.

Convention: in machine learning we typically minimize error functions rather than maximize objectives. Maximizing the log likelihood is equivalent to minimizing the negative log likelihood (NLL).

3. MLE for a Gaussian: Deriving the Log Likelihood

Assume the data are drawn i.i.d. from $\mathcal{N}(x\,;\,\mu, \sigma^2)$. The log likelihood is:

$$\ln p(\mathcal{D} \mid \mu, \sigma^2) = \sum_{i=1}^{N} \ln \mathcal{N}(x_i\,;\,\mu,\sigma^2)$$ $$= \sum_{i=1}^{N} \left[ -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]$$

Gaussian Log Likelihood $$\ln p(\mathcal{D} \mid \mu, \sigma^2) = -\frac{N}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i - \mu)^2$$

4. MLE for the Mean

Standard recipe. To find the MLE for any parameter: (1) formulate the log likelihood as a function of that parameter; (2) set its derivative to zero; (3) solve for the parameter. We apply this recipe twice below — once for $\mu$, once for $\sigma^2$.

Differentiate with respect to $\mu$ and set to zero. The first term does not depend on $\mu$:

$$\frac{\partial}{\partial \mu}\ln p(\mathcal{D} \mid \mu, \sigma^2) = \frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i - \mu) = 0$$

Solving:

$$\sum_{i=1}^{N} x_i = N\mu$$

MLE Estimate: Mean $$\hat{\mu}_{ML} = \frac{1}{N}\sum_{i=1}^{N} x_i$$

The maximum likelihood estimate of the mean is the sample mean.

5. MLE for the Variance

Differentiate with respect to $\sigma^2$ and set to zero:

$$\frac{\partial}{\partial \sigma^2}\ln p(\mathcal{D} \mid \mu, \sigma^2) = -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^{N}(x_i - \mu)^2 = 0$$

Multiplying through by $2\sigma^4$ and rearranging:

MLE Estimate: Variance $$\hat{\sigma}^2_{ML} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \hat{\mu}_{ML})^2$$

The maximum likelihood estimate of the variance is the sample variance.

6. Bias Analysis

Are these estimators reliable? We check by computing their expected values over all possible datasets drawn from the true distribution $\mathcal{N}(\mu, \sigma^2)$.

Mean: Unbiased

$$\mathbb{E}[\hat{\mu}_{ML}] = \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[x_i] = \frac{1}{N}\cdot N\mu = \mu \quad \checkmark$$

The MLE mean is an unbiased estimator — on average it recovers the true mean.

Variance: Biased

The calculation is more involved. The key ingredients, using $\mathbb{E}[x_i^2] = \sigma^2 + \mu^2$ (from the variance formula) and $\mathbb{E}[x_i x_j] = \mu^2$ for $i \neq j$ (from independence), give:

$$\mathbb{E}[\hat{\sigma}^2_{ML}] = \frac{N-1}{N}\,\sigma^2$$

The MLE variance systematically underestimates the true variance by a factor $\tfrac{N-1}{N}$. It is a biased estimator.

Bias-Corrected Variance Estimator $$\tilde{\sigma}^2 = \frac{N}{N-1}\,\hat{\sigma}^2_{ML} = \frac{1}{N-1}\sum_{i=1}^{N}(x_i - \hat{\mu}_{ML})^2$$

This is the familiar $N-1$ denominator seen in statistics. It satisfies $\mathbb{E}[\tilde{\sigma}^2] = \sigma^2$.

Intuition: Why Does MLE Underestimate Variance?

When fitting a Gaussian to a small dataset, the sample mean $\hat{\mu}_{ML}$ is computed from the same data as the variance. The sample mean is always at least as close to every data point as the true mean — it is the least-squares center of the data. Measuring spread around this center therefore yields a smaller value than measuring spread around the true (unknown) mean. The bias grows as the dataset shrinks: for large $N$, $\tfrac{N-1}{N} \to 1$ and the bias vanishes.