Lecture 10.1

PCA: Maximum Variance

Principal Component Analysis (PCA) finds a low-dimensional linear subspace that captures the maximum variance in the data. In this formulation, the principal components emerge as the eigenvectors of the sample covariance matrix with the largest eigenvalues.

Learning Objectives

State the PCA objective: maximize projected variance subject to an orthonormality constraint.
Derive that the optimal projection direction satisfies the eigenvalue problem $\mathbf{S}\mathbf{u} = \lambda\mathbf{u}$.
Show that the variance of the projected data equals the corresponding eigenvalue $\lambda$.
Generalize to $M$-dimensional projections and state the total-variance result.
Use the scree plot to choose $M$.
List applications: dimensionality reduction, compression, visualization, decorrelation, and whitening.

1. Setting and Motivation

We have $N$ observations $\mathbf{x}_n \in \mathbb{R}^d$ and want to project them to $\mathbb{R}^M$ with $M \ll d$, preserving as much information as possible. Define the sample mean $\bar{\mathbf{x}} = \frac{1}{N}\sum_n \mathbf{x}_n$ and sample covariance

$$\mathbf{S} = \frac{1}{N}\sum_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top.$$

$\mathbf{S}$ is symmetric and positive semi-definite.

2. The 1D Case: Maximum Variance Direction

Project the (centered) data onto a unit vector $\mathbf{u}_1 \in \mathbb{R}^d$ to get scalar values $z_{n1} = \mathbf{u}_1^\top \mathbf{x}_n$. The variance of these projections is

$$\text{Var}[z_1] = \frac{1}{N}\sum_n (z_{n1} - \bar{z}_1)^2 = \mathbf{u}_1^\top \mathbf{S}\,\mathbf{u}_1.$$

We maximize this subject to $\|\mathbf{u}_1\| = 1$, a constrained problem solved via Lagrange multipliers (Lecture 9.3):

PCA as an Eigenvalue Problem

The Lagrangian is $\mathcal{L} = \mathbf{u}_1^\top\mathbf{S}\mathbf{u}_1 + \lambda_1(1 - \mathbf{u}_1^\top\mathbf{u}_1)$. Setting $\nabla_{\mathbf{u}_1}\mathcal{L} = \mathbf{0}$ gives

$$\mathbf{S}\,\mathbf{u}_1 = \lambda_1\,\mathbf{u}_1.$$

The projection direction $\mathbf{u}_1$ is an eigenvector of $\mathbf{S}$ and the Lagrange multiplier $\lambda_1$ is the corresponding eigenvalue. The variance of the projected data equals $\mathbf{u}_1^\top\mathbf{S}\mathbf{u}_1 = \lambda_1$.

To maximize variance, choose $\mathbf{u}_1$ to be the eigenvector with the largest eigenvalue. This is called the first principal component.

3. $M$-Dimensional Projection

For an $M$-dimensional projection, select the $M$ eigenvectors $\mathbf{u}_1, \dots, \mathbf{u}_M$ with the $M$ largest eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_M$. The projection of a centered data point is

$$\mathbf{z}_n = \mathbf{U}_M^\top (\mathbf{x}_n - \bar{\mathbf{x}}) \in \mathbb{R}^M,$$

where $\mathbf{U}_M = [\mathbf{u}_1, \dots, \mathbf{u}_M]$. The total variance captured is $\sum_{j=1}^M \lambda_j$.

Total Variance and Eigendecomposition

The total variance of the original data is $\operatorname{tr}(\mathbf{S}) = \sum_{j=1}^d \lambda_j$ (using $\mathbf{S} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^\top$ and cyclic property of trace). The fraction of variance explained by $M$ components is

$$\text{Explained variance ratio} = \frac{\sum_{j=1}^M \lambda_j}{\sum_{j=1}^d \lambda_j}.$$

The projected features $\mathbf{z}_n$ are uncorrelated: their covariance matrix is the diagonal matrix $\operatorname{diag}(\lambda_1, \dots, \lambda_M)$.

4. Choosing $M$: The Scree Plot

Plot the cumulative explained variance ratio against $M$. Choose the smallest $M$ that crosses a target threshold (e.g., 90%). Typically, a small number of principal components captures most of the variance because many directions correspond to noise with near-zero $\lambda_j$.

5. Whitening (Sphering)

After projecting to the principal-component basis, the data can be further transformed to have identity covariance by dividing each component $z_{ni}$ by $\sqrt{\lambda_i}$:

$$\tilde{\mathbf{z}}_n = \boldsymbol{\Lambda}_M^{-1/2}\,\mathbf{U}_M^\top (\mathbf{x}_n - \bar{\mathbf{x}}).$$

The result has zero mean and identity covariance — an isotropic, spherical distribution. This is called whitening and makes the data scale-invariant, which benefits algorithms like K-means that rely on Euclidean distances.

6. Applications

Dimensionality reduction. Reduce computation and memory for downstream models.
Compression. Store $M$ latent coordinates instead of $d$ pixel values.
Visualization. Project to $M=2$ or $3$ for scatter plots of high-dimensional data (e.g., MNIST digits).
Decorrelation. Remove feature correlations before training models that assume independent inputs.
Regularization. Using fewer features reduces the number of model parameters and the risk of overfitting.

Computational Note

A full eigendecomposition of $\mathbf{S}$ costs $O(d^3)$. When only the top $M$ components are needed, truncated singular value decomposition (SVD) is $O(Md^2)$ or better — crucial for high-dimensional data. In Python: numpy.linalg.eigh (symmetric) or sklearn.decomposition.PCA.