Lecture 9.1
Unsupervised Learning: Latent Variable Models
Unsupervised learning drops the requirement of labelled targets and instead seeks the hidden structure that explains how observations were generated. The key concept is the latent variable — an unobserved quantity that influences what we measure.
- Contrast supervised and unsupervised learning in terms of goals and available data.
- Name three motivations for unsupervised learning: density estimation, clustering, and dimensionality reduction.
- Define a latent variable and explain how it factorizes the joint distribution $p(\mathbf{x}, z)$.
- Distinguish discrete latent variables (clustering) from continuous ones (dimensionality reduction).
1. From Supervised to Unsupervised Learning
In supervised learning we work with input-output pairs $\{(\mathbf{x}_n, t_n)\}$. Unsupervised learning discards the targets and works only with observations $\{\mathbf{x}_n\}$. The goal is to discover structure in the data rather than to predict a given target.
Three core tasks arise in unsupervised learning:
- Density estimation. Recover the probability distribution $p(\mathbf{x})$ that generated the data. Useful for outlier detection (low-probability inputs are anomalies) and for generating new synthetic samples.
- Clustering. Infer discrete hidden class labels that explain why data points group together.
- Dimensionality reduction. Find a compact lower-dimensional representation that captures the intrinsic structure of high-dimensional data.
2. Latent Variable Models
A latent variable $z$ is an unobserved quantity that influences how observations $\mathbf{x}$ are generated. The generative model is defined by:
- $p(z)$: a prior over the latent variable.
- $p(\mathbf{x} \mid z)$: the class-conditional (or latent-conditional) density of $\mathbf{x}$ given $z$.
The marginal density of the observation is obtained by marginalizing out the latent variable:
$$p(\mathbf{x}) = \int p(\mathbf{x} \mid z)\, p(z)\, dz \quad \text{(continuous } z\text{)}, \qquad p(\mathbf{x}) = \sum_z p(\mathbf{x} \mid z)\, p(z) \quad \text{(discrete } z\text{)}.$$Given an observation $\mathbf{x}$, Bayes' theorem gives the posterior over the latent variable:
$$p(z \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid z)\, p(z)}{p(\mathbf{x})}.$$3. Discrete Latent Variables: Clustering
When $z$ takes values in a finite set $\{1, \dots, K\}$, each value corresponds to a cluster. The prior $p(z = k)$ gives the base rate of cluster $k$, and $p(\mathbf{x} \mid z = k)$ describes the distribution of observations within cluster $k$.
Suppose we measure height and weight of animals, observing two natural clusters. The latent variable $z \in \{\text{cat}, \text{dog}\}$ is never observed, but we can assume it exists and model $p(\mathbf{x} \mid \text{cat})$ and $p(\mathbf{x} \mid \text{dog})$ separately — each as a Gaussian with its own mean. Marginalizing over $z$ then gives $p(\mathbf{x})$. Given a new animal measurement $\mathbf{x}'$, the posterior $p(z \mid \mathbf{x}')$ tells us which species it likely belongs to.
4. Continuous Latent Variables: Dimensionality Reduction
When $z$ is continuous and low-dimensional ($z \in \mathbb{R}^M$ with $M \ll d$), the latent variable model describes a lower-dimensional manifold embedded in the high-dimensional observation space.
A $100 \times 100$ image has $10{,}000$ pixel dimensions, yet a database of one tree rotated and translated is governed by only 3 parameters (two translations, one rotation). The intrinsic data lives on a 3-dimensional manifold; the rest of the $10{,}000$-dimensional space is redundant. PCA and its generalizations (Lectures 10.1–10.4) recover this low-dimensional structure.
Lectures 9.2–9.4 focus on discrete latent variables: K-means (hard assignments) and Gaussian Mixture Models (soft, probabilistic assignments). Lectures 10.1–10.4 focus on continuous latent variables: Principal Component Analysis in its maximum-variance, minimum-reconstruction, probabilistic, and nonlinear forms.