Lecture 10.3

Probabilistic PCA

Probabilistic PCA (PPCA) casts dimensionality reduction as a Gaussian latent variable model. The result recovers the same eigenvectors as classical PCA, but now within a generative framework that supports uncertainty quantification, missing-data imputation, and Bayesian model selection of the latent dimension $M$.

Learning Objectives
  • Define the PPCA generative model: prior $p(\mathbf{z})$, linear forward model, noise $\boldsymbol{\varepsilon}$, and conditional $p(\mathbf{x} \mid \mathbf{z})$.
  • Derive the mean and covariance of the marginal $p(\mathbf{x})$ and identify $\mathbf{C} = \mathbf{W}\mathbf{W}^\top + \sigma^2\mathbf{I}$.
  • State the MLE solutions for $\boldsymbol{\mu}$, $\mathbf{W}$, and $\sigma^2$.
  • Explain how $\sigma^2$ captures average discarded variance.
  • Discuss advantages of the probabilistic framing over classical PCA.

1. The Generative Model

PPCA posits that each observation $\mathbf{x}_n \in \mathbb{R}^d$ was produced by a latent variable $\mathbf{z} \in \mathbb{R}^M$ through a linear map plus isotropic Gaussian noise:

PPCA Generative Model

Latent prior: $p(\mathbf{z}) = \mathcal{N}(\mathbf{z} \mid \mathbf{0}, \mathbf{I}_M)$.

Observation model: $\mathbf{x} = \mathbf{W}\mathbf{z} + \boldsymbol{\mu} + \boldsymbol{\varepsilon}$, where $\mathbf{W} \in \mathbb{R}^{d \times M}$ is the factor loading matrix, $\boldsymbol{\mu}$ is the data mean, and $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_d)$ is isotropic noise.

Conditional: $p(\mathbf{x} \mid \mathbf{z}) = \mathcal{N}(\mathbf{x} \mid \mathbf{W}\mathbf{z} + \boldsymbol{\mu},\; \sigma^2\mathbf{I}_d)$.

2. The Marginal Distribution $p(\mathbf{x})$

Marginalizing out $\mathbf{z}$ (using standard Gaussian calculus), we obtain:

PPCA Marginal $$p(\mathbf{x}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu},\; \mathbf{C}), \quad \mathbf{C} = \mathbf{W}\mathbf{W}^\top + \sigma^2\mathbf{I}_d.$$

Derivation of mean: $\mathbb{E}[\mathbf{x}] = \mathbf{W}\,\mathbb{E}[\mathbf{z}] + \boldsymbol{\mu} + \mathbb{E}[\boldsymbol{\varepsilon}] = \boldsymbol{\mu}$ (since $\mathbf{z}$ and $\boldsymbol{\varepsilon}$ are zero-mean).

Derivation of covariance: Using independence of $\mathbf{z}$ and $\boldsymbol{\varepsilon}$, $\text{Cov}[\mathbf{x}] = \mathbf{W}\,\mathbb{E}[\mathbf{z}\mathbf{z}^\top]\mathbf{W}^\top + \mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\top] = \mathbf{W}\mathbf{I}_M\mathbf{W}^\top + \sigma^2\mathbf{I}_d = \mathbf{C}$.

Intuitively, $\mathbf{C}$ decomposes the observed covariance into a low-rank signal component $\mathbf{W}\mathbf{W}^\top$ and an isotropic noise component $\sigma^2\mathbf{I}$.

3. MLE Solutions

Maximizing $\ln p(\mathbf{X}) = \sum_n \ln \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}, \mathbf{C})$ over $\boldsymbol{\mu}$, $\mathbf{W}$, and $\sigma^2$ gives closed-form solutions:

PPCA MLE Solutions $$\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}},$$ $$\hat{\mathbf{W}} = \mathbf{U}_M\,(\boldsymbol{\Lambda}_M - \sigma^2\mathbf{I}_M)^{1/2}\,\mathbf{R},$$ $$\hat{\sigma}^2 = \frac{1}{d - M}\sum_{i=M+1}^{d} \lambda_i,$$

where $\mathbf{U}_M = [\mathbf{u}_1, \dots, \mathbf{u}_M]$ are the top-$M$ eigenvectors of the sample covariance $\mathbf{S}$, $\boldsymbol{\Lambda}_M = \operatorname{diag}(\lambda_1, \dots, \lambda_M)$, and $\mathbf{R}$ is an arbitrary $M \times M$ rotation matrix (rotational freedom in the latent space).

Interpreting $\hat{\sigma}^2$

The noise variance $\hat{\sigma}^2$ is the average eigenvalue of the discarded principal components — i.e., the average variance in the directions we chose to ignore. A small $\hat{\sigma}^2$ means we trust the data and have kept most of the structure; a large $\hat{\sigma}^2$ means we assumed the model is heavily corrupted by noise. Choosing a small $M$ inflates $\hat{\sigma}^2$; choosing $M = d-1$ makes $\hat{\sigma}^2$ equal to the single smallest eigenvalue.

4. Connection to Classical PCA

As $\sigma^2 \to 0$, the PPCA model collapses to classical PCA: the covariance $\mathbf{C} \to \mathbf{W}\mathbf{W}^\top$ becomes exactly rank-$M$, and the model places all probability on the $M$-dimensional subspace. For any $\sigma^2 > 0$, PPCA is a proper full-rank probability model over $\mathbb{R}^d$.

5. Advantages of the Probabilistic Framing

  • Generative model. New data points can be generated by sampling $\mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ and computing $\hat{\mathbf{x}} = \hat{\mathbf{W}}\mathbf{z} + \hat{\boldsymbol{\mu}}$.
  • Missing data. The posterior $p(\mathbf{z} \mid \mathbf{x}_\text{obs})$ can be evaluated even when some components of $\mathbf{x}$ are missing, enabling principled imputation.
  • Bayesian $M$ selection. Placing a prior over $\mathbf{W}$ and computing model evidence allows automatic determination of the latent dimension $M$ without cross-validation.
  • EM algorithm. Although closed-form MLE exists, an EM algorithm (E-step: compute posterior $p(\mathbf{z}_n \mid \mathbf{x}_n)$; M-step: update $\mathbf{W}$, $\sigma^2$) is sometimes preferred for numerical stability or for extensions where closed forms are unavailable.