Lecture 2.3
Maximum Likelihood Estimation
After this lecture you should be able to:
- State the maximum likelihood principle and explain intuitively what it means to maximize the likelihood of the data.
- Explain why optimizing the log likelihood is equivalent to optimizing the likelihood (monotonicity argument via the chain rule), and why it is numerically preferable (avoids underflow; converts products to sums).
- Derive the log likelihood for a dataset of $N$ i.i.d. Gaussian observations.
- Derive the MLE estimates $\hat{\mu}_{ML}$ and $\hat{\sigma}^2_{ML}$ by differentiating the log likelihood and setting the derivatives to zero.
- Show that $\hat{\mu}_{ML}$ is an unbiased estimator and that $\hat{\sigma}^2_{ML}$ underestimates $\sigma^2$ by a factor $\tfrac{N-1}{N}$.
- State the bias-corrected variance estimator and explain intuitively why fitting a Gaussian to a small dataset always underestimates the true spread.
We now have the probabilistic tools in place. This lecture puts them to work: given a dataset, how do we find the parameters of a distribution that best explain it? The answer is the maximum likelihood principle, one of the most widely used estimation techniques in machine learning.
1. The Maximum Likelihood Principle
Suppose we have a dataset $\mathcal{D} = \{x_1, \ldots, x_N\}$ of $N$ observations, and we want to fit a parametric model $p(\mathcal{D} \mid \mathbf{w})$ to it. This function — the probability of observing the dataset given parameters $\mathbf{w}$ — is called the likelihood. Note that it is a distribution over the data, but a function of the parameters.
Choose the parameters that make the observed data as probable as possible:
$$\mathbf{w}^*_{ML} = \arg\max_{\mathbf{w}}\; p(\mathcal{D} \mid \mathbf{w})$$The IID Assumption
We assume the data points are independent and identically distributed (i.i.d.): each $x_i$ is drawn independently from the same distribution $p(x \mid \mathbf{w})$. Under this assumption the joint likelihood factorizes into a product of individual likelihoods:
$$p(\mathcal{D} \mid \mathbf{w}) = \prod_{i=1}^{N} p(x_i \mid \mathbf{w})$$2. The Log Likelihood
Directly maximizing the product above is problematic: each $p(x_i \mid \mathbf{w}) \leq 1$, so the product of many such terms becomes vanishingly small, causing numerical underflow. Instead we maximize the log likelihood:
$$\ln p(\mathcal{D} \mid \mathbf{w}) = \sum_{i=1}^{N} \ln p(x_i \mid \mathbf{w})$$The product becomes a sum — far more convenient to work with analytically and numerically.
The logarithm is a monotonically increasing function. At an optimum $\mathbf{w}^*$ the derivative of the objective is zero. Applying the chain rule to the log likelihood:
$$\frac{d}{d\mathbf{w}} \ln p(\mathcal{D} \mid \mathbf{w}) = \frac{1}{p(\mathcal{D} \mid \mathbf{w})} \cdot \frac{d}{d\mathbf{w}} p(\mathcal{D} \mid \mathbf{w}) = 0$$Since $p(\mathcal{D} \mid \mathbf{w}) > 0$ always, the factor $\tfrac{1}{p(\mathcal{D}|\mathbf{w})}$ is never zero and never shifts the location of the optimum. The zeros of the derivative are determined entirely by $\tfrac{d}{d\mathbf{w}} p(\mathcal{D} \mid \mathbf{w})$ — the same condition as maximizing the original likelihood.
Convention: in machine learning we typically minimize error functions rather than maximize objectives. Maximizing the log likelihood is equivalent to minimizing the negative log likelihood (NLL).
3. MLE for a Gaussian: Deriving the Log Likelihood
Assume the data are drawn i.i.d. from $\mathcal{N}(x\,;\,\mu, \sigma^2)$. The log likelihood is:
$$\ln p(\mathcal{D} \mid \mu, \sigma^2) = \sum_{i=1}^{N} \ln \mathcal{N}(x_i\,;\,\mu,\sigma^2)$$ $$= \sum_{i=1}^{N} \left[ -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]$$4. MLE for the Mean
Standard recipe. To find the MLE for any parameter: (1) formulate the log likelihood as a function of that parameter; (2) set its derivative to zero; (3) solve for the parameter. We apply this recipe twice below — once for $\mu$, once for $\sigma^2$.
Differentiate with respect to $\mu$ and set to zero. The first term does not depend on $\mu$:
$$\frac{\partial}{\partial \mu}\ln p(\mathcal{D} \mid \mu, \sigma^2) = \frac{1}{\sigma^2}\sum_{i=1}^{N}(x_i - \mu) = 0$$Solving:
$$\sum_{i=1}^{N} x_i = N\mu$$The maximum likelihood estimate of the mean is the sample mean.
5. MLE for the Variance
Differentiate with respect to $\sigma^2$ and set to zero:
$$\frac{\partial}{\partial \sigma^2}\ln p(\mathcal{D} \mid \mu, \sigma^2) = -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^{N}(x_i - \mu)^2 = 0$$Multiplying through by $2\sigma^4$ and rearranging:
The maximum likelihood estimate of the variance is the sample variance.
6. Bias Analysis
Are these estimators reliable? We check by computing their expected values over all possible datasets drawn from the true distribution $\mathcal{N}(\mu, \sigma^2)$.
Mean: Unbiased
$$\mathbb{E}[\hat{\mu}_{ML}] = \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[x_i] = \frac{1}{N}\cdot N\mu = \mu \quad \checkmark$$The MLE mean is an unbiased estimator — on average it recovers the true mean.
Variance: Biased
The calculation is more involved. The key ingredients, using $\mathbb{E}[x_i^2] = \sigma^2 + \mu^2$ (from the variance formula) and $\mathbb{E}[x_i x_j] = \mu^2$ for $i \neq j$ (from independence), give:
$$\mathbb{E}[\hat{\sigma}^2_{ML}] = \frac{N-1}{N}\,\sigma^2$$The MLE variance systematically underestimates the true variance by a factor $\tfrac{N-1}{N}$. It is a biased estimator.
This is the familiar $N-1$ denominator seen in statistics. It satisfies $\mathbb{E}[\tilde{\sigma}^2] = \sigma^2$.
When fitting a Gaussian to a small dataset, the sample mean $\hat{\mu}_{ML}$ is computed from the same data as the variance. The sample mean is always at least as close to every data point as the true mean — it is the least-squares center of the data. Measuring spread around this center therefore yields a smaller value than measuring spread around the true (unknown) mean. The bias grows as the dataset shrinks: for large $N$, $\tfrac{N-1}{N} \to 1$ and the bias vanishes.