Lecture 9.4
Gaussian Mixture Models & EM
Gaussian Mixture Models (GMMs) are the probabilistic counterpart of K-means, replacing hard cluster assignments with posterior probabilities. Fitting GMMs via the Expectation-Maximization (EM) algorithm recovers closed-form parameter updates that closely mirror the K-means E and M steps, while handling variable cluster shapes and uncertain assignments.
- Write the GMM density as a mixture $p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ and define the mixing coefficients $\pi_k$.
- Define the responsibility $\gamma(z_{nk})$ as the posterior probability that cluster $k$ generated $\mathbf{x}_n$.
- Explain why the log-likelihood of a GMM cannot be optimized in closed form (log of a sum).
- State the EM E-step and M-step update rules for $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$, and $\pi_k$.
- Compare GMMs to K-means: soft vs. hard assignments, variable cluster shapes, but slower convergence.
1. The Gaussian Mixture Model
Assume each data point $\mathbf{x}_n$ was generated by first drawing a latent cluster index $z_k \in \{0,1\}$ (one-hot) from a generalized Bernoulli prior, then drawing $\mathbf{x}_n$ from the corresponding Gaussian:
where the mixing coefficients satisfy $\pi_k \geq 0$ and $\sum_{k=1}^K \pi_k = 1$. Each $\pi_k = p(z_k = 1)$ is the prior probability of belonging to cluster $k$, and $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ is the class-conditional density.
2. Responsibilities
Given an observation $\mathbf{x}_n$, the posterior probability that cluster $k$ "is responsible" for it follows from Bayes' theorem:
Unlike K-means, a point can have nonzero responsibility for multiple clusters simultaneously — this is the soft assignment.
3. The Log-Likelihood and Its Difficulty
Under the i.i.d. assumption, the log-likelihood of the observed data is
$$\ln p(\mathbf{X}) = \sum_{n=1}^{N} \ln \sum_{k=1}^{K} \pi_k\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k).$$Taking derivatives and setting to zero does not yield a closed-form solution because the log acts on a sum of Gaussians. Setting $\partial \ln p / \partial \boldsymbol{\mu}_k = 0$ gives an expression for $\boldsymbol{\mu}_k$ that still depends on the responsibilities $\gamma(z_{nk})$, which themselves depend on all parameters. The EM algorithm exploits this circular dependence iteratively.
4. The EM Algorithm for GMMs
- Initialize $\pi_k$, $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$ (e.g., run K-means to initialize $\boldsymbol{\mu}_k$).
- E-step: compute responsibilities $\gamma(z_{nk})$ for all $n, k$ using the current parameters.
- M-step: compute the effective number of points per cluster $N_k = \sum_n \gamma(z_{nk})$ and update: $$\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})\,\mathbf{x}_n,$$ $$\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})\,(\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})^\top,$$ $$\pi_k^{\text{new}} = \frac{N_k}{N}.$$
- Repeat until the log-likelihood converges.
5. Deriving the M-Step Updates
The update for $\boldsymbol{\mu}_k$ comes from differentiating $\ln p(\mathbf{X})$ w.r.t. $\boldsymbol{\mu}_k$, using $\nabla_{\boldsymbol{\mu}_k}\ln\mathcal{N}(\mathbf{x}\mid\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) = \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)$ and recognizing the responsibility in the resulting expression.
The update for $\pi_k$ uses Lagrange multipliers (Lecture 9.3) to enforce $\sum_k \pi_k = 1$. Setting up the Lagrangian $\mathcal{L} = \ln p(\mathbf{X}) + \lambda(\sum_k \pi_k - 1)$, differentiating w.r.t. $\pi_k$, and using $\sum_k \gamma(z_{nk}) = 1$ to solve for $\lambda = -N$ yields $\pi_k = N_k / N$.
K-means assigns each point to exactly one cluster (responsibility is 0 or 1, the step function of distance). GMM assigns fractional responsibilities: a point between two well-separated clusters gets nearly 0 for both; a point near the boundary gets substantial probability for each. GMM also learns a per-cluster covariance matrix, allowing elongated or otherwise shaped clusters that K-means cannot capture.
6. GMM vs. K-means: Advantages and Limitations
| Property | K-means | GMM |
|---|---|---|
| Assignments | Hard (0/1) | Soft (probabilities) |
| Cluster shape | Spherical (Euclidean) | Ellipsoidal ($\boldsymbol{\Sigma}_k$ per cluster) |
| Cluster sizes | Approximately equal | Variable (via $\pi_k$) |
| Convergence speed | Fast | Slower (more parameters) |
| Initialization | Random | Often K-means then EM |
| Optimum reached | Local minimum of $J$ | Local maximum of log-likelihood |