Lecture 10.2

PCA: Minimum Reconstruction Error

PCA can also be derived by asking a different question: which low-dimensional subspace allows the best reconstruction of the original data? This minimum-reconstruction-error formulation arrives at the same eigenvectors, and motivates modern autoencoder architectures.

Learning Objectives

Write the linear reconstruction model $\hat{\mathbf{x}}_n = \sum_{i=1}^M z_{ni}\mathbf{u}_i + \sum_{i=M+1}^d b_i\mathbf{u}_i$ and identify each term.
Show the shared offsets $b_i = \mathbf{u}_i^\top\bar{\mathbf{x}}$ minimize the reconstruction error over all data points.
Show the reconstruction error simplifies to $\sum_{i=M+1}^d \mathbf{u}_i^\top\mathbf{S}\mathbf{u}_i$, i.e., the variance in the discarded directions.
Conclude that minimizing reconstruction error is equivalent to maximizing the variance in the retained directions — same PCA solution.
Connect to compression and autoencoders.

1. The Reconstruction Model

Suppose we expand each data point $\mathbf{x}_n$ in an orthonormal basis $\{\mathbf{u}_1, \dots, \mathbf{u}_d\}$. In a complete basis, $\mathbf{x}_n = \sum_{i=1}^d (\mathbf{u}_i^\top \mathbf{x}_n)\,\mathbf{u}_i$ exactly. For dimensionality reduction, we keep only $M < d$ basis vectors and allow the remaining coefficients to be shared (not data-specific):

$$\hat{\mathbf{x}}_n = \underbrace{\sum_{i=1}^{M} z_{ni}\,\mathbf{u}_i}_{\text{data-specific}} + \underbrace{\sum_{i=M+1}^{d} b_i\,\mathbf{u}_i}_{\text{shared offset}},$$

where $z_{ni} = \mathbf{u}_i^\top\mathbf{x}_n$ are the (per-point) latent variables and $b_i$ are fixed coefficients to be chosen.

2. Optimal Shared Offsets

Minimizing the total squared reconstruction error $E = \sum_n\|\mathbf{x}_n - \hat{\mathbf{x}}_n\|^2$ over $b_i$ (with $\mathbf{u}_{M+1}, \dots, \mathbf{u}_d$ fixed) yields:

Optimal Shared Offset $$b_i^* = \mathbf{u}_i^\top \bar{\mathbf{x}}, \quad i = M+1, \dots, d.$$

The shared coefficient for each discarded direction is simply the projection of the data mean onto that direction — intuitively, the best single representative for all points in that direction.

3. The Reconstruction Error

Substituting $b_i^*$ and $z_{ni} = \mathbf{u}_i^\top\mathbf{x}_n$ into the error, and using orthonormality ($\mathbf{u}_i^\top\mathbf{u}_j = \delta_{ij}$), the reconstruction error simplifies to:

Reconstruction Error as Discarded Variance $$E = \sum_{n=1}^N \|\mathbf{x}_n - \hat{\mathbf{x}}_n\|^2 = \sum_{i=M+1}^{d} \mathbf{u}_i^\top\mathbf{S}\,\mathbf{u}_i.$$

Minimizing $E$ over the basis $\{\mathbf{u}_{M+1}, \dots, \mathbf{u}_d\}$ means choosing the discarded directions to have minimum variance. Equivalently, the retained directions $\mathbf{u}_1, \dots, \mathbf{u}_M$ must have maximum variance. This is exactly the Lecture 10.1 formulation: the retained basis vectors are the eigenvectors of $\mathbf{S}$ with the $M$ largest eigenvalues.

With the eigenvector choice, the minimum achievable reconstruction error is $\sum_{i=M+1}^d \lambda_i$ — the sum of the discarded eigenvalues.

4. Application: Compression of Images

Represent each image $\mathbf{x}_n \in \mathbb{R}^d$ by its $M$-dimensional coefficient vector $\mathbf{z}_n$ and the shared mean $\bar{\mathbf{x}}$ plus the top-$M$ eigenvectors (stored once). Reconstruction quality improves with $M$:

$M = 10$: recognizable but blurry digits.
$M = 50$: clearly distinguishable digits.
$M = 200$ (out of 784 for $28\times28$ images): near-perfect reconstruction — substantial redundancy in pixel space.

The eigenvectors themselves are often interpretable: for face images they correspond to directions of variation such as lighting, smile, or gender (the classic "eigenfaces").

5. Connection to Autoencoders

The reconstruction model can be written as a two-stage pipeline:

Encoder: $\mathbf{z}_n = \mathbf{U}_M^\top(\mathbf{x}_n - \bar{\mathbf{x}})$ — a linear map to the latent space.
Decoder: $\hat{\mathbf{x}}_n = \mathbf{U}_M\mathbf{z}_n + \bar{\mathbf{x}}$ — a linear map back to input space.

This is precisely a two-layer linear neural network (no activation functions) trained to minimize reconstruction error. Replacing the linear maps with deep nonlinear networks gives an autoencoder (Lecture 10.4), which can capture curved, nonlinear manifolds that PCA cannot.