Lecture 6.1

Probabilistic Generative Models: Maximum Likelihood

Fitting the parameters of a Gaussian generative classifier by maximum likelihood, and showing that the shared-covariance case yields an analytical, linear decision boundary.

Learning Objectives

Write the likelihood for a two-class Gaussian generative model using the binary-selection trick.
Derive the MLE for the class prior $q$, class means $\boldsymbol{\mu}_1, \boldsymbol{\mu}_2$, and shared covariance $\boldsymbol{\Sigma}$.
Interpret each MLE solution as a natural sample statistic.
State the advantages and disadvantages of the LDA framework.

1. Model and Likelihood

We observe $N$ data points $\{(\mathbf{x}_n, t_n)\}$ where $t_n \in \{0,1\}$ encodes the class. The generative model assumes:

Class-conditional densities (shared covariance for LDA): $p(\mathbf{x} \mid C_1) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma})$ and $p(\mathbf{x} \mid C_2) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})$.
Class priors: $p(C_1) = q$ and $p(C_2) = 1-q$.

Using the binary-selection trick, the joint likelihood of one data point is

$$p(\mathbf{x}_n, t_n) = \bigl[q\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma})\bigr]^{t_n} \bigl[(1-q)\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})\bigr]^{1-t_n}.$$

Assuming i.i.d. data, the log-likelihood is

$$\ln L = \sum_{n=1}^{N} \Bigl[ t_n \ln q + (1-t_n)\ln(1-q) + t_n \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_1,\boldsymbol{\Sigma}) + (1-t_n)\ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_2,\boldsymbol{\Sigma}) \Bigr].$$

Standard MLE Recipe

For each parameter: (1) identify the terms in $\ln L$ that depend on it, (2) take the derivative and set it to zero, (3) solve. Each parameter separates cleanly.

2. MLE for the Class Prior $q$

Only the first two terms in $\ln L$ involve $q$. Setting $\partial(\ln L)/\partial q = 0$ gives

$$\sum_{n=1}^{N} \frac{t_n}{q} - \frac{1-t_n}{1-q} = 0.$$

Solving yields

MLE for the Class Prior $$\hat{q} = \frac{1}{N}\sum_{n=1}^{N} t_n = \frac{N_1}{N},$$

where $N_1$ is the number of class-1 examples. The MLE is simply the empirical fraction of class-1 observations.

3. MLE for the Class Means

Only the Gaussian terms for class $C_1$ involve $\boldsymbol{\mu}_1$. Setting the gradient to zero and using positive-definiteness of $\boldsymbol{\Sigma}$ to factor it out yields

MLE for Class Means $$\hat{\boldsymbol{\mu}}_1 = \frac{1}{N_1}\sum_{n=1}^{N} t_n\,\mathbf{x}_n, \qquad \hat{\boldsymbol{\mu}}_2 = \frac{1}{N_2}\sum_{n=1}^{N}(1-t_n)\,\mathbf{x}_n.$$

Each is the sample mean over the corresponding class.

4. MLE for the Shared Covariance

Both Gaussian terms involve $\boldsymbol{\Sigma}$. Differentiating the log-likelihood with respect to $\boldsymbol{\Sigma}^{-1}$ (see Bishop §4.2.2) and setting it to zero gives

MLE for the Shared Covariance $$\hat{\boldsymbol{\Sigma}} = \frac{N_1}{N}\mathbf{S}_1 + \frac{N_2}{N}\mathbf{S}_2,$$

where $\mathbf{S}_k = \frac{1}{N_k}\displaystyle\sum_{n \in C_k}(\mathbf{x}_n - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_n - \hat{\boldsymbol{\mu}}_k)^\top$ is the within-class sample covariance. The MLE is a data-weighted average of the two within-class covariances.

Isotropy from Averaging

If $\mathbf{S}_1$ is elongated horizontally and $\mathbf{S}_2$ is elongated vertically, their weighted average tends toward isotropy. In general $\hat{\boldsymbol{\Sigma}}$ need not be isotropic — this depends on the individual $\mathbf{S}_k$ — but its form is always the weighted blend above.

5. The LDA Decision Rule

Plugging the MLE parameters into the posterior (Lecture 5.6) and expanding the log-odds for $K=2$ confirms the linear discriminant:

Linear Discriminant Analysis (LDA) Decision Rule $$p(C_1 \mid \mathbf{x}) = \sigma\!\bigl(\mathbf{w}^\top \mathbf{x} + w_0\bigr),$$

where $\mathbf{w} = \hat{\boldsymbol{\Sigma}}^{-1}(\hat{\boldsymbol{\mu}}_1 - \hat{\boldsymbol{\mu}}_2)$ and $w_0 = -\tfrac{1}{2}\hat{\boldsymbol{\mu}}_1^\top\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}_1 + \tfrac{1}{2}\hat{\boldsymbol{\mu}}_2^\top\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}_2 + \ln\!\bigl(\hat{q}/(1-\hat{q})\bigr)$. Assigning $\mathbf{x}$ to $C_1$ when $\mathbf{w}^\top\mathbf{x} + w_0 > 0$ gives a linear decision boundary.

6. Advantages and Limitations of LDA

Advantages: all parameters have closed-form MLE solutions; the result is a full probabilistic model; the decision rule is a simple linear threshold.

Limitations:

Sensitivity to outliers. A single distant point pulls $\hat{\boldsymbol{\mu}}_k$ and $\hat{\boldsymbol{\Sigma}}$ substantially, distorting the decision boundary.
Reliance on handcrafted features. In high dimensions, appropriate basis functions must be chosen manually.
Prone to overfitting. No regularization is included in the MLE objective.