Lecture 3.4
Underfitting & Overfitting
After this lecture you should be able to:
- Define underfitting and overfitting in terms of training and test error, and give an example of each for polynomial regression.
- Explain why overfitting produces large weight values and why this is a diagnostic sign of an over-complex model.
- Sketch the training error and test error as functions of model complexity (order $M$), and explain the shape of each curve.
- Define the generalization gap and explain what a large gap implies about model reliability.
- Describe the two characteristic patterns in the (train error, test error) plane that signal underfitting and overfitting respectively.
- Explain intuitively why gathering more data reduces overfitting.
Linear regression with basis functions gives us control over model complexity — a polynomial of order 1 is very rigid, while one of order 9 can represent almost any smooth function. This flexibility is a double-edged sword. Choosing $M$ too small or too large leads to two distinct failure modes: underfitting and overfitting.
1. Two Failure Modes
Consider fitting polynomials of increasing order $M$ to a small dataset drawn from a sine wave (the "true" function). The MLE solution minimizes the sum of squared errors on the training data.
| Underfitting ($M$ too small) | Overfitting ($M$ too large) | |
|---|---|---|
| Training error | High | Near zero |
| Test error | High | Very high |
| Generalization gap | Small | Large |
| Model behaviour | Too rigid; cannot capture data structure | Memorizes noise; does not generalize |
| Weight magnitudes | Moderate | Very large |
2. Why Overfitting Produces Large Weights
An order-9 polynomial could represent the same smooth function as an order-3 polynomial — the order-3 basis is a subspace of the order-9 basis. But MLE minimizes the training SSE, not the true error. With enough parameters and few data points, the optimizer finds a solution that passes exactly through every training point by using extreme weight values that cancel and amplify each other. The result is a wildly oscillating function that has zero training error but terrible generalization.
Diagnostic. Inspecting the weight vector is a useful sanity check: if fitted weights are orders of magnitude larger than expected given the scale of the data, this is a strong signal of overfitting — even before looking at any test data.
3. Training and Test Error Curves
Plotting RMSE as a function of model order $M$ reveals a characteristic pattern:
- Training error decreases monotonically with $M$: a more flexible model can always fit the training data at least as well as a less flexible one.
- Test error follows a U-shape: it decreases as the model gains enough capacity to capture the true signal, then increases once the model is complex enough to fit the noise.
The generalization gap — the difference between test error and training error — grows on the right side of this plot. A large gap means the training error is no longer representative of real-world performance.
4. Diagnosing from the (Train, Test) Plane
Underfitting: both train and test error are high, and the gap between them is small. The model performs poorly everywhere, but at least the training error gives an honest estimate of test performance. The problem is capacity — the model class is too restricted.
Overfitting: train error is low but test error is high, and the gap is large. The training error is misleading — it does not reflect how the model will perform on new data. The problem is that the model has fitted to the noise rather than the signal.
5. More Data as a Remedy
With more training data, overfitting becomes harder: the model can no longer pass through every point exactly and must instead find weights that make reasonable compromises. A model that seemed badly overfit with $N = 10$ points may fit well with $N = 100$. The training error rises slightly (the model no longer memorizes every point) while the test error drops substantially.
This is why, in machine learning, data is often the most valuable resource. More data improves generalization without any changes to the model or the optimizer. In practice, data collection can be expensive — which motivates the regularization approach of the next lecture.