Lecture 4.2
Bias-Variance Decomposition
Decomposing generalization error into bias, variance, and irreducible noise — a formal framework for understanding overfitting.
- Derive the regression function as the minimizer of the expected squared loss.
- Decompose the expected loss into bias², variance, and irreducible noise.
- Connect bias and variance to underfitting and overfitting.
- Explain qualitatively how the regularization parameter $\lambda$ trades off bias and variance.
1. The Expected Loss
So far we have evaluated loss on a fixed dataset. Here we ask: what loss should we expect a model to make, averaged over all possible datasets drawn from the true data distribution?
Assume data pairs $(x, t)$ are drawn i.i.d. from a joint distribution $p(x, t)$. Under the squared loss, the expected loss is
$$\mathbb{E}[\ell] = \iint (y(x) - t)^2 \, p(x, t) \, dx \, dt.$$In practice we observe one dataset. But we can think theoretically of infinitely many datasets drawn from $p(x,t)$. The expected loss is the average error our model achieves over all such datasets — a property of the learning procedure, not a single run.
2. The Regression Function
For a fixed $x$, the expected loss over $t$ is minimized by the prediction $y(x)$ that minimizes
$$\int (y(x) - t)^2 \, p(t|x) \, dt.$$Differentiating with respect to $y(x)$ and setting to zero gives
$$y(x) = \int t \, p(t|x) \, dt = \mathbb{E}[t \mid x].$$The function $h(x) = \mathbb{E}[t \mid x]$ is called the regression function. It is the best possible predictor under squared loss — the model that minimizes the expected loss over the true data distribution.
In the noisy sine example, $t = \sin(2\pi x) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \beta^{-1})$, so $h(x) = \sin(2\pi x)$.
3. Bias-Variance-Noise Decomposition
Writing the error relative to the regression function $h(x)$ and taking the expectation over datasets $\mathcal{D}$ (each training a model $y_\mathcal{D}(x)$), the expected loss decomposes into three terms. Let $\bar{y}(x) = \mathbb{E}_\mathcal{D}[y_\mathcal{D}(x)]$ denote the mean model. The cross term between bias and variance vanishes, yielding:
- Bias² — squared error of the mean model $\bar{y}(x)$ relative to the ground truth $h(x)$. Measures systematic error.
- Variance — how much individual models $y_\mathcal{D}(x)$ scatter around $\bar{y}(x)$. Measures sensitivity to the training set.
- Noise — irreducible variance of $t$ given $x$. Cannot be reduced by any model.
4. The Bias-Variance Tradeoff
In ridge regression, the regularization parameter $\lambda$ directly controls the tradeoff:
- Large $\lambda$ — solutions are pulled toward zero; all models look similar (low variance) but systematically differ from the ground truth (high bias). This is underfitting.
- Small $\lambda$ — models closely fit each individual dataset (low bias of the mean model) but vary wildly across datasets (high variance). This is overfitting.
Generate 100 datasets, each with 25 observations from $t = \sin(2\pi x) + \epsilon$, $x \sim \mathrm{Uniform}[0,1]$. Fit a Gaussian-basis ridge model to each dataset. Because the ground truth $h(x) = \sin(2\pi x)$ is known, bias² and variance can be computed exactly via Monte Carlo integration over $x$:
$$\mathrm{Bias}^2 \approx \frac{1}{M}\sum_{m=1}^{M}(\bar{y}(x_m) - \sin(2\pi x_m))^2, \qquad \mathrm{Variance} \approx \frac{1}{M}\sum_{m=1}^{M} \mathbb{E}_\mathcal{D}[(y_\mathcal{D}(x_m) - \bar{y}(x_m))^2].$$Plotting bias² and variance against $\lambda$ confirms that they cross near the $\lambda$ that minimizes the true test error. The gap between bias² + variance and the test error equals the irreducible noise floor.
This decomposition is a theoretical tool. In practice, the ground truth $h(x)$ and the distribution $p(x,t)$ are unknown, so bias and variance cannot be computed directly. The lesson is qualitative: increasing model complexity (or decreasing $\lambda$) tends to reduce bias but increase variance. Cross-validation (Lecture 4.1) is the practical tool for finding the right balance.