Lecture 6.1
Probabilistic Generative Models: Maximum Likelihood
Fitting the parameters of a Gaussian generative classifier by maximum likelihood, and showing that the shared-covariance case yields an analytical, linear decision boundary.
- Write the likelihood for a two-class Gaussian generative model using the binary-selection trick.
- Derive the MLE for the class prior $q$, class means $\boldsymbol{\mu}_1, \boldsymbol{\mu}_2$, and shared covariance $\boldsymbol{\Sigma}$.
- Interpret each MLE solution as a natural sample statistic.
- State the advantages and disadvantages of the LDA framework.
1. Model and Likelihood
We observe $N$ data points $\{(\mathbf{x}_n, t_n)\}$ where $t_n \in \{0,1\}$ encodes the class. The generative model assumes:
- Class-conditional densities (shared covariance for LDA): $p(\mathbf{x} \mid C_1) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma})$ and $p(\mathbf{x} \mid C_2) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})$.
- Class priors: $p(C_1) = q$ and $p(C_2) = 1-q$.
Using the binary-selection trick, the joint likelihood of one data point is
$$p(\mathbf{x}_n, t_n) = \bigl[q\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma})\bigr]^{t_n} \bigl[(1-q)\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})\bigr]^{1-t_n}.$$Assuming i.i.d. data, the log-likelihood is
$$\ln L = \sum_{n=1}^{N} \Bigl[ t_n \ln q + (1-t_n)\ln(1-q) + t_n \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_1,\boldsymbol{\Sigma}) + (1-t_n)\ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_2,\boldsymbol{\Sigma}) \Bigr].$$For each parameter: (1) identify the terms in $\ln L$ that depend on it, (2) take the derivative and set it to zero, (3) solve. Each parameter separates cleanly.
2. MLE for the Class Prior $q$
Only the first two terms in $\ln L$ involve $q$. Setting $\partial(\ln L)/\partial q = 0$ gives
$$\sum_{n=1}^{N} \frac{t_n}{q} - \frac{1-t_n}{1-q} = 0.$$Solving yields
where $N_1$ is the number of class-1 examples. The MLE is simply the empirical fraction of class-1 observations.
3. MLE for the Class Means
Only the Gaussian terms for class $C_1$ involve $\boldsymbol{\mu}_1$. Setting the gradient to zero and using positive-definiteness of $\boldsymbol{\Sigma}$ to factor it out yields
Each is the sample mean over the corresponding class.
4. MLE for the Shared Covariance
Both Gaussian terms involve $\boldsymbol{\Sigma}$. Differentiating the log-likelihood with respect to $\boldsymbol{\Sigma}^{-1}$ (see Bishop ยง4.2.2) and setting it to zero gives
where $\mathbf{S}_k = \frac{1}{N_k}\displaystyle\sum_{n \in C_k}(\mathbf{x}_n - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_n - \hat{\boldsymbol{\mu}}_k)^\top$ is the within-class sample covariance. The MLE is a data-weighted average of the two within-class covariances.
If $\mathbf{S}_1$ is elongated horizontally and $\mathbf{S}_2$ is elongated vertically, their weighted average tends toward isotropy. In general $\hat{\boldsymbol{\Sigma}}$ need not be isotropic โ this depends on the individual $\mathbf{S}_k$ โ but its form is always the weighted blend above.
5. The LDA Decision Rule
Plugging the MLE parameters into the posterior (Lecture 5.6) and expanding the log-odds for $K=2$ confirms the linear discriminant:
where $\mathbf{w} = \hat{\boldsymbol{\Sigma}}^{-1}(\hat{\boldsymbol{\mu}}_1 - \hat{\boldsymbol{\mu}}_2)$ and $w_0 = -\tfrac{1}{2}\hat{\boldsymbol{\mu}}_1^\top\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}_1 + \tfrac{1}{2}\hat{\boldsymbol{\mu}}_2^\top\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}_2 + \ln\!\bigl(\hat{q}/(1-\hat{q})\bigr)$. Assigning $\mathbf{x}$ to $C_1$ when $\mathbf{w}^\top\mathbf{x} + w_0 > 0$ gives a linear decision boundary.
6. Advantages and Limitations of LDA
Advantages: all parameters have closed-form MLE solutions; the result is a full probabilistic model; the decision rule is a simple linear threshold.
Limitations:
- Sensitivity to outliers. A single distant point pulls $\hat{\boldsymbol{\mu}}_k$ and $\hat{\boldsymbol{\Sigma}}$ substantially, distorting the decision boundary.
- Reliance on handcrafted features. In high dimensions, appropriate basis functions must be chosen manually.
- Prone to overfitting. No regularization is included in the MLE objective.