Lecture 9.2
K-Means Clustering
K-means clustering is a simple, widely-used algorithm for partitioning unlabelled data into $K$ groups by alternating between assigning points to the nearest centroid and recomputing centroids — a non-probabilistic instance of the EM pattern.
- Write the K-means objective $J$ in terms of cluster assignments $z_{nk}$ and cluster means $\boldsymbol{\mu}_k$.
- Derive the E-step (assignment) and M-step (centroid update) and show each step reduces $J$.
- Derive the M-step solution by differentiating $J$ w.r.t. $\boldsymbol{\mu}_k$.
- State the convergence guarantee and its limitation (local minimum).
- List the structural limitations of K-means: spherical clusters, equal-size bias, scale sensitivity, fixed $K$.
1. The K-Means Objective
Given $N$ unlabelled observations $\{\mathbf{x}_n\}$ and a target number of clusters $K$, we encode the (unknown) cluster membership of point $n$ as a one-hot vector $\mathbf{z}_n = (z_{n1}, \dots, z_{nK})^\top$ where $z_{nk} \in \{0,1\}$ and $\sum_k z_{nk} = 1$.
$J$ is the total squared distance from each point to its assigned cluster centroid $\boldsymbol{\mu}_k$. We minimize $J$ jointly over the assignments $z_{nk}$ and centroids $\boldsymbol{\mu}_k$.
2. The E-Step and M-Step
Jointly minimizing $J$ over both $z_{nk}$ and $\boldsymbol{\mu}_k$ is NP-hard (the joint problem is non-convex), but alternating optimization gives an efficient algorithm.
- Initialize cluster centroids $\boldsymbol{\mu}_k$ (e.g., randomly from data points).
- E-step (assignment): assign each point to the nearest centroid, $$z_{nk} = \begin{cases}1 & k = \arg\min_{k'}\|\mathbf{x}_n - \boldsymbol{\mu}_{k'}\|^2 \\ 0 & \text{otherwise.}\end{cases}$$ This minimizes $J$ for fixed $\boldsymbol{\mu}_k$ (trivially, by definition).
- M-step (centroid update): recompute each centroid as the mean of its assigned points, $$\boldsymbol{\mu}_k = \frac{\displaystyle\sum_{n=1}^N z_{nk}\,\mathbf{x}_n}{\displaystyle\sum_{n=1}^N z_{nk}} = \frac{1}{N_k}\sum_{n \in C_k}\mathbf{x}_n.$$ This minimizes $J$ for fixed assignments (shown below).
- Repeat steps 2–3 until assignments no longer change.
3. Deriving the M-Step
For fixed assignments $z_{nk}$, $J$ is a sum of independent quadratic terms in each $\boldsymbol{\mu}_k$. Taking the gradient and setting it to zero:
$$\frac{\partial J}{\partial \boldsymbol{\mu}_k} = \sum_{n=1}^N z_{nk}\,2(\boldsymbol{\mu}_k - \mathbf{x}_n)^\top = \mathbf{0}.$$Solving for $\boldsymbol{\mu}_k$ yields the sample mean over cluster $k$, confirming the M-step formula above.
Each step (E and M) is guaranteed to reduce or preserve $J$, so the algorithm converges. However, because $J$ is non-convex in $(\{\boldsymbol{\mu}_k\}, \{z_{nk}\})$ jointly, convergence is to a local minimum. To mitigate this, run K-means multiple times from different random initializations and keep the solution with the lowest $J$.
4. Application: Image Compression
Treat each pixel's RGB triple as a 3-dimensional data point $\mathbf{x}_n \in \mathbb{R}^3$. K-means clusters all pixel colors into $K$ representative colors (centroids). Each pixel is then stored as its cluster index (an integer) rather than its full RGB triple. For an $H \times W$ image, storage drops from $3HW$ values to $HW$ integers plus $3K$ centroid values — a significant saving when $K \ll \min(H,W)$.
5. Limitations of K-Means
- Spherical clusters only. The Euclidean distance metric means all points equidistant from a centroid form a sphere. K-means cannot capture elongated or non-convex cluster shapes (e.g., two interlocking crescents).
- Equal-size bias. Voronoi regions produced by K-means tend toward equal volume, so clusters of very different sizes are split or merged incorrectly.
- Scale sensitivity. Features with large numerical range dominate the Euclidean distance. Pre-processing (centering, whitening via PCA — Lecture 10.1) is often necessary.
- $K$ must be specified. The number of clusters is a hyperparameter chosen before running the algorithm. Various heuristics (elbow method, silhouette score) exist but are not fool-proof.
- Hard assignments. Each point belongs to exactly one cluster with certainty. This is problematic near cluster boundaries. Gaussian Mixture Models (Lecture 9.4) address this with soft assignments.