Lecture 4.2

Bias-Variance Decomposition

Decomposing generalization error into bias, variance, and irreducible noise — a formal framework for understanding overfitting.

Learning Objectives

Derive the regression function as the minimizer of the expected squared loss.
Decompose the expected loss into bias², variance, and irreducible noise.
Connect bias and variance to underfitting and overfitting.
Explain qualitatively how the regularization parameter $\lambda$ trades off bias and variance.

1. The Expected Loss

So far we have evaluated loss on a fixed dataset. Here we ask: what loss should we expect a model to make, averaged over all possible datasets drawn from the true data distribution?

Assume data pairs $(x, t)$ are drawn i.i.d. from a joint distribution $p(x, t)$. Under the squared loss, the expected loss is

$$\mathbb{E}[\ell] = \iint (y(x) - t)^2 \, p(x, t) \, dx \, dt.$$

Why the Joint Distribution?

In practice we observe one dataset. But we can think theoretically of infinitely many datasets drawn from $p(x,t)$. The expected loss is the average error our model achieves over all such datasets — a property of the learning procedure, not a single run.

2. The Regression Function

For a fixed $x$, the expected loss over $t$ is minimized by the prediction $y(x)$ that minimizes

$$\int (y(x) - t)^2 \, p(t|x) \, dt.$$

Differentiating with respect to $y(x)$ and setting to zero gives

$$y(x) = \int t \, p(t|x) \, dt = \mathbb{E}[t \mid x].$$

Regression Function

The function $h(x) = \mathbb{E}[t \mid x]$ is called the regression function. It is the best possible predictor under squared loss — the model that minimizes the expected loss over the true data distribution.

In the noisy sine example, $t = \sin(2\pi x) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \beta^{-1})$, so $h(x) = \sin(2\pi x)$.

3. Bias-Variance-Noise Decomposition

Writing the error relative to the regression function $h(x)$ and taking the expectation over datasets $\mathcal{D}$ (each training a model $y_\mathcal{D}(x)$), the expected loss decomposes into three terms. Let $\bar{y}(x) = \mathbb{E}_\mathcal{D}[y_\mathcal{D}(x)]$ denote the mean model. The cross term between bias and variance vanishes, yielding:

Bias-Variance Decomposition $$\mathbb{E}[\ell] = \underbrace{\int \bigl(\bar{y}(x) - h(x)\bigr)^2 p(x)\, dx}_{\mathrm{Bias}^2} + \underbrace{\int \mathbb{E}_\mathcal{D}\!\left[(y_\mathcal{D}(x) - \bar{y}(x))^2\right] p(x)\, dx}_{\mathrm{Variance}} + \underbrace{\int \mathrm{Var}[t \mid x]\, p(x)\, dx}_{\mathrm{Noise}}.$$

Bias² — squared error of the mean model $\bar{y}(x)$ relative to the ground truth $h(x)$. Measures systematic error.
Variance — how much individual models $y_\mathcal{D}(x)$ scatter around $\bar{y}(x)$. Measures sensitivity to the training set.
Noise — irreducible variance of $t$ given $x$. Cannot be reduced by any model.

4. The Bias-Variance Tradeoff

In ridge regression, the regularization parameter $\lambda$ directly controls the tradeoff:

Large $\lambda$ — solutions are pulled toward zero; all models look similar (low variance) but systematically differ from the ground truth (high bias). This is underfitting.
Small $\lambda$ — models closely fit each individual dataset (low bias of the mean model) but vary wildly across datasets (high variance). This is overfitting.

Simulated Experiment

Generate 100 datasets, each with 25 observations from $t = \sin(2\pi x) + \epsilon$, $x \sim \mathrm{Uniform}[0,1]$. Fit a Gaussian-basis ridge model to each dataset. Because the ground truth $h(x) = \sin(2\pi x)$ is known, bias² and variance can be computed exactly via Monte Carlo integration over $x$:

$$\mathrm{Bias}^2 \approx \frac{1}{M}\sum_{m=1}^{M}(\bar{y}(x_m) - \sin(2\pi x_m))^2, \qquad \mathrm{Variance} \approx \frac{1}{M}\sum_{m=1}^{M} \mathbb{E}_\mathcal{D}[(y_\mathcal{D}(x_m) - \bar{y}(x_m))^2].$$

Plotting bias² and variance against $\lambda$ confirms that they cross near the $\lambda$ that minimizes the true test error. The gap between bias² + variance and the test error equals the irreducible noise floor.

Remark: Theory vs. Practice

This decomposition is a theoretical tool. In practice, the ground truth $h(x)$ and the distribution $p(x,t)$ are unknown, so bias and variance cannot be computed directly. The lesson is qualitative: increasing model complexity (or decreasing $\lambda$) tends to reduce bias but increase variance. Cross-validation (Lecture 4.1) is the practical tool for finding the right balance.