Lecture 10.4

Nonlinear PCA

Linear PCA fails when the intrinsic data structure is curved. Two extensions address this: Kernel PCA applies the kernel trick to work implicitly in infinite-dimensional feature spaces, while autoencoders use deep neural networks to learn nonlinear encoders and decoders trained end-to-end by minimizing reconstruction error.

Learning Objectives

Identify when linear PCA is inadequate (nonlinear manifolds).
Explain Kernel PCA: replace the feature-space covariance matrix with the $N \times N$ kernel (Gram) matrix and compute projections from its eigenvectors.
Describe the autoencoder architecture: encoder $f_{\mathbf{W}}(\mathbf{x}) \to \mathbf{z}$, decoder $g_{\mathbf{W}}(\mathbf{z}) \to \hat{\mathbf{x}}$, trained by minimizing $\|\mathbf{x} - \hat{\mathbf{x}}\|^2$.
Show that a two-layer linear autoencoder without activations implements classical PCA.
Outline the variational autoencoder (VAE) idea and its generative capabilities.

1. Motivation: When Linear PCA Fails

Consider data concentrated on a curved 1D manifold in $\mathbb{R}^2$ (e.g., a spiral or a circle). Linear PCA looks for the direction of greatest variance — a straight line — which cannot faithfully represent the intrinsic structure. Projecting onto the first principal component squashes the curved manifold into a line, destroying the true 1D organization.

Two Concentric Circles

A dataset of two concentric rings is intrinsically two one-dimensional manifolds (each ring). Standard PCA returns the same 2D scatter rotated — the rings are not separated at all. Kernel PCA with a radial basis function (RBF) kernel cleanly maps the inner and outer rings to opposite sides of the first principal component, separating them nonlinearly.

2. Kernel PCA

The idea is to first map data to a (possibly infinite-dimensional) feature space via $\boldsymbol{\phi}(\mathbf{x})$, then apply PCA there. The feature-space covariance is

$$\mathbf{C}_\phi = \frac{1}{N}\sum_n \boldsymbol{\phi}(\mathbf{x}_n)\boldsymbol{\phi}(\mathbf{x}_n)^\top.$$

The key insight is that the principal-component projections can be expressed entirely through the $N \times N$ kernel (Gram) matrix:

Kernel PCA

Define the kernel matrix $K_{nm} = k(\mathbf{x}_n, \mathbf{x}_m) = \boldsymbol{\phi}(\mathbf{x}_n)^\top\boldsymbol{\phi}(\mathbf{x}_m)$. Compute the eigenvectors $\mathbf{a}_i$ of $\mathbf{K}$ (not $\mathbf{C}_\phi$). The $i$-th principal component projection of a data point $\mathbf{x}_n$ is

$$z_{ni} = \sum_{m=1}^N a_{im}\,k(\mathbf{x}_m, \mathbf{x}_n).$$

The kernel function $k$ implicitly defines a feature space that may be infinite-dimensional — chosen by the practitioner, not computed explicitly. This is the kernel trick applied to PCA (Bishop §12.3).

Why the Kernel Matrix, Not the Covariance?

When $\boldsymbol{\phi}(\mathbf{x})$ has very high or infinite dimension (e.g., RBF kernel), $\mathbf{C}_\phi$ cannot be stored or decomposed. But $\mathbf{K}$ is always $N \times N$ — small and tractable regardless of the feature-space dimension. The eigenvectors of $\mathbf{K}$ encode the same principal directions in a compact form.

3. Autoencoders

An autoencoder is a neural network trained to reconstruct its input through a low-dimensional bottleneck:

Autoencoder Architecture

Encoder $f_{\mathbf{W}_1}: \mathbb{R}^d \to \mathbb{R}^M$ — maps input $\mathbf{x}$ to latent code $\mathbf{z} = f(\mathbf{x})$.
Decoder $g_{\mathbf{W}_2}: \mathbb{R}^M \to \mathbb{R}^d$ — maps latent code $\mathbf{z}$ back to reconstruction $\hat{\mathbf{x}} = g(\mathbf{z})$.
Loss: minimize $\sum_n \|\mathbf{x}_n - g(f(\mathbf{x}_n))\|^2$ over all parameters $\mathbf{W}_1, \mathbf{W}_2$ via stochastic gradient descent.

When encoder and decoder are linear (no activation functions), the optimal solution recovers the PCA subspace — the autoencoder is exactly a two-layer linear network. Nonlinear activations allow the encoder to learn curved manifolds.

4. PCA as a Linear Autoencoder

With linear encoder $f(\mathbf{x}) = \mathbf{W}_1(\mathbf{x} - \bar{\mathbf{x}})$ and linear decoder $g(\mathbf{z}) = \mathbf{W}_2\mathbf{z} + \bar{\mathbf{x}}$, minimizing reconstruction error forces $\mathbf{W}_1$ and $\mathbf{W}_2$ to span the same subspace as the top-$M$ principal components. This is the Lecture 10.2 minimum-reconstruction-error derivation expressed as a neural network.

5. Variational Autoencoders

A variational autoencoder (VAE) (Kingma & Welling, 2014) places a probabilistic interpretation on the latent space: the encoder maps $\mathbf{x}$ to a distribution $q(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}^2_\phi(\mathbf{x}))$, and the decoder defines $p_\theta(\mathbf{x} \mid \mathbf{z})$. Training maximizes a lower bound on the log-likelihood (the ELBO). The key benefit is a structured latent space from which new samples can be drawn and decoded to generate new data points.

Generative Capabilities

Sample a random $\mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ and pass it through a trained decoder to generate a new image never seen during training. Interpolating between two latent codes $\mathbf{z}_1$ and $\mathbf{z}_2$ produces a smooth morphing between the corresponding images. Applications include face generation, molecule discovery for drug design, and data augmentation — as well as ethically problematic deepfakes.

Ethical Note

Generative models that can synthesize realistic images, audio, or text raise serious ethical concerns around misinformation and privacy (deepfakes). When applying these techniques, consider carefully the intended use, potential for misuse, and broader societal impact.