Lecture 10.1
PCA: Maximum Variance
Principal Component Analysis (PCA) finds a low-dimensional linear subspace that captures the maximum variance in the data. In this formulation, the principal components emerge as the eigenvectors of the sample covariance matrix with the largest eigenvalues.
- State the PCA objective: maximize projected variance subject to an orthonormality constraint.
- Derive that the optimal projection direction satisfies the eigenvalue problem $\mathbf{S}\mathbf{u} = \lambda\mathbf{u}$.
- Show that the variance of the projected data equals the corresponding eigenvalue $\lambda$.
- Generalize to $M$-dimensional projections and state the total-variance result.
- Use the scree plot to choose $M$.
- List applications: dimensionality reduction, compression, visualization, decorrelation, and whitening.
1. Setting and Motivation
We have $N$ observations $\mathbf{x}_n \in \mathbb{R}^d$ and want to project them to $\mathbb{R}^M$ with $M \ll d$, preserving as much information as possible. Define the sample mean $\bar{\mathbf{x}} = \frac{1}{N}\sum_n \mathbf{x}_n$ and sample covariance
$$\mathbf{S} = \frac{1}{N}\sum_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top.$$$\mathbf{S}$ is symmetric and positive semi-definite.
2. The 1D Case: Maximum Variance Direction
Project the (centered) data onto a unit vector $\mathbf{u}_1 \in \mathbb{R}^d$ to get scalar values $z_{n1} = \mathbf{u}_1^\top \mathbf{x}_n$. The variance of these projections is
$$\text{Var}[z_1] = \frac{1}{N}\sum_n (z_{n1} - \bar{z}_1)^2 = \mathbf{u}_1^\top \mathbf{S}\,\mathbf{u}_1.$$We maximize this subject to $\|\mathbf{u}_1\| = 1$, a constrained problem solved via Lagrange multipliers (Lecture 9.3):
The Lagrangian is $\mathcal{L} = \mathbf{u}_1^\top\mathbf{S}\mathbf{u}_1 + \lambda_1(1 - \mathbf{u}_1^\top\mathbf{u}_1)$. Setting $\nabla_{\mathbf{u}_1}\mathcal{L} = \mathbf{0}$ gives
$$\mathbf{S}\,\mathbf{u}_1 = \lambda_1\,\mathbf{u}_1.$$The projection direction $\mathbf{u}_1$ is an eigenvector of $\mathbf{S}$ and the Lagrange multiplier $\lambda_1$ is the corresponding eigenvalue. The variance of the projected data equals $\mathbf{u}_1^\top\mathbf{S}\mathbf{u}_1 = \lambda_1$.
To maximize variance, choose $\mathbf{u}_1$ to be the eigenvector with the largest eigenvalue. This is called the first principal component.
3. $M$-Dimensional Projection
For an $M$-dimensional projection, select the $M$ eigenvectors $\mathbf{u}_1, \dots, \mathbf{u}_M$ with the $M$ largest eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_M$. The projection of a centered data point is
$$\mathbf{z}_n = \mathbf{U}_M^\top (\mathbf{x}_n - \bar{\mathbf{x}}) \in \mathbb{R}^M,$$where $\mathbf{U}_M = [\mathbf{u}_1, \dots, \mathbf{u}_M]$. The total variance captured is $\sum_{j=1}^M \lambda_j$.
The total variance of the original data is $\operatorname{tr}(\mathbf{S}) = \sum_{j=1}^d \lambda_j$ (using $\mathbf{S} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^\top$ and cyclic property of trace). The fraction of variance explained by $M$ components is
$$\text{Explained variance ratio} = \frac{\sum_{j=1}^M \lambda_j}{\sum_{j=1}^d \lambda_j}.$$The projected features $\mathbf{z}_n$ are uncorrelated: their covariance matrix is the diagonal matrix $\operatorname{diag}(\lambda_1, \dots, \lambda_M)$.
4. Choosing $M$: The Scree Plot
Plot the cumulative explained variance ratio against $M$. Choose the smallest $M$ that crosses a target threshold (e.g., 90%). Typically, a small number of principal components captures most of the variance because many directions correspond to noise with near-zero $\lambda_j$.
5. Whitening (Sphering)
After projecting to the principal-component basis, the data can be further transformed to have identity covariance by dividing each component $z_{ni}$ by $\sqrt{\lambda_i}$:
$$\tilde{\mathbf{z}}_n = \boldsymbol{\Lambda}_M^{-1/2}\,\mathbf{U}_M^\top (\mathbf{x}_n - \bar{\mathbf{x}}).$$The result has zero mean and identity covariance — an isotropic, spherical distribution. This is called whitening and makes the data scale-invariant, which benefits algorithms like K-means that rely on Euclidean distances.
6. Applications
- Dimensionality reduction. Reduce computation and memory for downstream models.
- Compression. Store $M$ latent coordinates instead of $d$ pixel values.
- Visualization. Project to $M=2$ or $3$ for scatter plots of high-dimensional data (e.g., MNIST digits).
- Decorrelation. Remove feature correlations before training models that assume independent inputs.
- Regularization. Using fewer features reduces the number of model parameters and the risk of overfitting.
A full eigendecomposition of $\mathbf{S}$ costs $O(d^3)$. When only the top $M$ components are needed, truncated singular value decomposition (SVD) is $O(Md^2)$ or better — crucial for high-dimensional data. In Python: numpy.linalg.eigh (symmetric) or sklearn.decomposition.PCA.