Lecture 9.2

K-Means Clustering

K-means clustering is a simple, widely-used algorithm for partitioning unlabelled data into $K$ groups by alternating between assigning points to the nearest centroid and recomputing centroids — a non-probabilistic instance of the EM pattern.

Learning Objectives

Write the K-means objective $J$ in terms of cluster assignments $z_{nk}$ and cluster means $\boldsymbol{\mu}_k$.
Derive the E-step (assignment) and M-step (centroid update) and show each step reduces $J$.
Derive the M-step solution by differentiating $J$ w.r.t. $\boldsymbol{\mu}_k$.
State the convergence guarantee and its limitation (local minimum).
List the structural limitations of K-means: spherical clusters, equal-size bias, scale sensitivity, fixed $K$.

1. The K-Means Objective

Given $N$ unlabelled observations $\{\mathbf{x}_n\}$ and a target number of clusters $K$, we encode the (unknown) cluster membership of point $n$ as a one-hot vector $\mathbf{z}_n = (z_{n1}, \dots, z_{nK})^\top$ where $z_{nk} \in \{0,1\}$ and $\sum_k z_{nk} = 1$.

K-Means Objective $$J(\{\boldsymbol{\mu}_k\}, \{z_{nk}\}) = \sum_{n=1}^{N}\sum_{k=1}^{K} z_{nk}\,\|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2.$$

$J$ is the total squared distance from each point to its assigned cluster centroid $\boldsymbol{\mu}_k$. We minimize $J$ jointly over the assignments $z_{nk}$ and centroids $\boldsymbol{\mu}_k$.

2. The E-Step and M-Step

Jointly minimizing $J$ over both $z_{nk}$ and $\boldsymbol{\mu}_k$ is NP-hard (the joint problem is non-convex), but alternating optimization gives an efficient algorithm.

K-Means Algorithm (Lloyd's Algorithm)

Initialize cluster centroids $\boldsymbol{\mu}_k$ (e.g., randomly from data points).
E-step (assignment): assign each point to the nearest centroid, $$z_{nk} = \begin{cases}1 & k = \arg\min_{k'}\|\mathbf{x}_n - \boldsymbol{\mu}_{k'}\|^2 \\ 0 & \text{otherwise.}\end{cases}$$ This minimizes $J$ for fixed $\boldsymbol{\mu}_k$ (trivially, by definition).
M-step (centroid update): recompute each centroid as the mean of its assigned points, $$\boldsymbol{\mu}_k = \frac{\displaystyle\sum_{n=1}^N z_{nk}\,\mathbf{x}_n}{\displaystyle\sum_{n=1}^N z_{nk}} = \frac{1}{N_k}\sum_{n \in C_k}\mathbf{x}_n.$$ This minimizes $J$ for fixed assignments (shown below).
Repeat steps 2–3 until assignments no longer change.

3. Deriving the M-Step

For fixed assignments $z_{nk}$, $J$ is a sum of independent quadratic terms in each $\boldsymbol{\mu}_k$. Taking the gradient and setting it to zero:

$$\frac{\partial J}{\partial \boldsymbol{\mu}_k} = \sum_{n=1}^N z_{nk}\,2(\boldsymbol{\mu}_k - \mathbf{x}_n)^\top = \mathbf{0}.$$

Solving for $\boldsymbol{\mu}_k$ yields the sample mean over cluster $k$, confirming the M-step formula above.

Convergence to a Local Minimum

Each step (E and M) is guaranteed to reduce or preserve $J$, so the algorithm converges. However, because $J$ is non-convex in $(\{\boldsymbol{\mu}_k\}, \{z_{nk}\})$ jointly, convergence is to a local minimum. To mitigate this, run K-means multiple times from different random initializations and keep the solution with the lowest $J$.

4. Application: Image Compression

Treat each pixel's RGB triple as a 3-dimensional data point $\mathbf{x}_n \in \mathbb{R}^3$. K-means clusters all pixel colors into $K$ representative colors (centroids). Each pixel is then stored as its cluster index (an integer) rather than its full RGB triple. For an $H \times W$ image, storage drops from $3HW$ values to $HW$ integers plus $3K$ centroid values — a significant saving when $K \ll \min(H,W)$.

5. Limitations of K-Means

Spherical clusters only. The Euclidean distance metric means all points equidistant from a centroid form a sphere. K-means cannot capture elongated or non-convex cluster shapes (e.g., two interlocking crescents).
Equal-size bias. Voronoi regions produced by K-means tend toward equal volume, so clusters of very different sizes are split or merged incorrectly.
Scale sensitivity. Features with large numerical range dominate the Euclidean distance. Pre-processing (centering, whitening via PCA — Lecture 10.1) is often necessary.
$K$ must be specified. The number of clusters is a hyperparameter chosen before running the algorithm. Various heuristics (elbow method, silhouette score) exist but are not fool-proof.
Hard assignments. Each point belongs to exactly one cluster with certainty. This is problematic near cluster boundaries. Gaussian Mixture Models (Lecture 9.4) address this with soft assignments.