Lecture 8.2
Neural Networks: Universal Approximation
The Universal Approximation Theorem gives neural networks their theoretical foundation: a single hidden layer with enough units can approximate any continuous function to arbitrary precision. Deeper networks achieve this more efficiently — exponentially more regions per parameter than shallow networks.
- State the Universal Approximation Theorem and identify its key conditions.
- Use ReLU networks to give an intuitive piecewise-linear proof of the theorem.
- Explain why depth is more parameter-efficient than width for representing complex functions.
- Connect the theorem to regression and classification as function approximation problems.
1. The Universal Approximation Theorem
Let $f : \mathbb{R}^d \to \mathbb{R}$ be any continuous function on a compact domain, and let $h$ be any non-polynomial activation function. Then for every $\varepsilon > 0$, there exists a two-layer neural network $\hat{f}$ with a finite number of hidden units $M$ such that
$$\sup_{\mathbf{x}} |f(\mathbf{x}) - \hat{f}(\mathbf{x})| < \varepsilon.$$A smaller approximation error $\varepsilon$ generally requires more hidden units $M$.
The theorem applies to any non-polynomial activation (sigmoid, tanh, ReLU, etc.). It says that the architecture — two layers with enough width — is expressive enough; the challenge is finding the right weights.
2. Intuition via ReLU Networks
Consider a 1D input. Each hidden unit in a ReLU network computes
$$z_m(x) = \max(0,\, w_m x + b_m),$$a ramp function that is zero for $x < -b_m/w_m$ and linear with slope $w_m$ thereafter. Taking linear combinations $\hat{f}(x) = \sum_m v_m z_m(x)$ produces a piecewise-linear function. The transition points are controlled by the biases $b_m$, and the slopes of adjacent pieces are controlled by the weights $v_m$.
With $M=3$ ReLU units, one can represent a function with up to 3 distinct linear pieces — enough to approximate a parabola or a half-sine. With $M=9$, we can approximate a full sine wave within a tighter error band $\varepsilon$. As $M \to \infty$, the piecewise-linear approximation converges to any continuous function.
3. Depth vs. Width: Efficiency
The theorem guarantees approximation with a single hidden layer — but at what cost? Approximating a deep network (depth $L$) with a shallow one requires exponentially more hidden units as $\varepsilon \to 0$:
$$M_{\text{shallow}} = \Omega\!\bigl((\tfrac{1}{\varepsilon})^{d}\bigr), \quad \text{vs.} \quad M_{\text{deep}} = O\!\bigl(\text{poly}(\tfrac{1}{\varepsilon})\bigr).$$Concretely, for ReLU networks with $L$ layers and width $W$:
- The number of linear regions (piecewise-linear pieces) scales as $O(W^L)$ — exponential in depth, polynomial in width.
- For a fixed parameter budget $P$, width and depth satisfy roughly $W^2 L = P$. Deeper, narrower networks create exponentially more regions than shallower, wider ones with the same parameter count.
This exponential scaling in depth explains the empirical success of deep learning: going deeper is far more parameter-efficient than going wider. It is also the theoretical justification for the shift from shallow SVMs to deep neural networks in the 2010s.
4. Function Approximation in Practice
Both regression and classification can be cast as function approximation:
- Regression: approximate the unknown mapping $f : \mathbf{x} \mapsto t$ from noisy samples.
- Classification: approximate the posterior $p(C_k|\mathbf{x})$, whose level sets give the decision boundaries.
The theorem guarantees that, given sufficient capacity, a neural network can represent the target function. Whether gradient-based training actually finds it is a separate, harder question addressed in Lectures 8.3–8.5.