Lecture 9.4

Gaussian Mixture Models & EM

Gaussian Mixture Models (GMMs) are the probabilistic counterpart of K-means, replacing hard cluster assignments with posterior probabilities. Fitting GMMs via the Expectation-Maximization (EM) algorithm recovers closed-form parameter updates that closely mirror the K-means E and M steps, while handling variable cluster shapes and uncertain assignments.

Learning Objectives

Write the GMM density as a mixture $p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ and define the mixing coefficients $\pi_k$.
Define the responsibility $\gamma(z_{nk})$ as the posterior probability that cluster $k$ generated $\mathbf{x}_n$.
Explain why the log-likelihood of a GMM cannot be optimized in closed form (log of a sum).
State the EM E-step and M-step update rules for $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$, and $\pi_k$.
Compare GMMs to K-means: soft vs. hard assignments, variable cluster shapes, but slower convergence.

1. The Gaussian Mixture Model

Assume each data point $\mathbf{x}_n$ was generated by first drawing a latent cluster index $z_k \in \{0,1\}$ (one-hot) from a generalized Bernoulli prior, then drawing $\mathbf{x}_n$ from the corresponding Gaussian:

GMM Density $$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k\,\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),$$

where the mixing coefficients satisfy $\pi_k \geq 0$ and $\sum_{k=1}^K \pi_k = 1$. Each $\pi_k = p(z_k = 1)$ is the prior probability of belonging to cluster $k$, and $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ is the class-conditional density.

2. Responsibilities

Given an observation $\mathbf{x}_n$, the posterior probability that cluster $k$ "is responsible" for it follows from Bayes' theorem:

Responsibility $$\gamma(z_{nk}) \;=\; p(z_k=1 \mid \mathbf{x}_n) \;=\; \frac{\pi_k\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\displaystyle\sum_{j=1}^{K}\pi_j\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}.$$

Unlike K-means, a point can have nonzero responsibility for multiple clusters simultaneously — this is the soft assignment.

3. The Log-Likelihood and Its Difficulty

Under the i.i.d. assumption, the log-likelihood of the observed data is

$$\ln p(\mathbf{X}) = \sum_{n=1}^{N} \ln \sum_{k=1}^{K} \pi_k\,\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k).$$

Taking derivatives and setting to zero does not yield a closed-form solution because the log acts on a sum of Gaussians. Setting $\partial \ln p / \partial \boldsymbol{\mu}_k = 0$ gives an expression for $\boldsymbol{\mu}_k$ that still depends on the responsibilities $\gamma(z_{nk})$, which themselves depend on all parameters. The EM algorithm exploits this circular dependence iteratively.

4. The EM Algorithm for GMMs

EM for Gaussian Mixture Models

Initialize $\pi_k$, $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$ (e.g., run K-means to initialize $\boldsymbol{\mu}_k$).
E-step: compute responsibilities $\gamma(z_{nk})$ for all $n, k$ using the current parameters.
M-step: compute the effective number of points per cluster $N_k = \sum_n \gamma(z_{nk})$ and update: $$\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})\,\mathbf{x}_n,$$ $$\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_{n=1}^N \gamma(z_{nk})\,(\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})^\top,$$ $$\pi_k^{\text{new}} = \frac{N_k}{N}.$$
Repeat until the log-likelihood converges.

5. Deriving the M-Step Updates

The update for $\boldsymbol{\mu}_k$ comes from differentiating $\ln p(\mathbf{X})$ w.r.t. $\boldsymbol{\mu}_k$, using $\nabla_{\boldsymbol{\mu}_k}\ln\mathcal{N}(\mathbf{x}\mid\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) = \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)$ and recognizing the responsibility in the resulting expression.

The update for $\pi_k$ uses Lagrange multipliers (Lecture 9.3) to enforce $\sum_k \pi_k = 1$. Setting up the Lagrangian $\mathcal{L} = \ln p(\mathbf{X}) + \lambda(\sum_k \pi_k - 1)$, differentiating w.r.t. $\pi_k$, and using $\sum_k \gamma(z_{nk}) = 1$ to solve for $\lambda = -N$ yields $\pi_k = N_k / N$.

Soft vs. Hard: K-means and GMM Compared

K-means assigns each point to exactly one cluster (responsibility is 0 or 1, the step function of distance). GMM assigns fractional responsibilities: a point between two well-separated clusters gets nearly 0 for both; a point near the boundary gets substantial probability for each. GMM also learns a per-cluster covariance matrix, allowing elongated or otherwise shaped clusters that K-means cannot capture.

6. GMM vs. K-means: Advantages and Limitations

Property	K-means	GMM
Assignments	Hard (0/1)	Soft (probabilities)
Cluster shape	Spherical (Euclidean)	Ellipsoidal ($\boldsymbol{\Sigma}_k$ per cluster)
Cluster sizes	Approximately equal	Variable (via $\pi_k$)
Convergence speed	Fast	Slower (more parameters)
Initialization	Random	Often K-means then EM
Optimum reached	Local minimum of $J$	Local maximum of log-likelihood