Lecture 3.5
Regularized Least Squares
After this lecture you should be able to:
- Write the ridge regression objective (regularized SSE) and explain intuitively why the L2 penalty suppresses overfitting.
- Explain why the bias term is typically excluded from the regularization penalty.
- Derive the connection between ridge regression and MAP estimation: show that $\lambda = \alpha/\beta$ and interpret $\alpha$ and $\beta$ in this context.
- Sketch the effect of varying $\lambda$ on training error and test error, and identify the regime of good generalization.
- Describe the lasso (L1) penalty, explain its sparsification effect, and compare it with ridge (L2) geometrically using the constraint-region picture.
- Explain the practical value of sparsity: automatic feature selection.
Instead of manually choosing the number of basis functions to prevent overfitting, we can use a large, expressive basis and add a penalty term that discourages large weights. This is regularization — controlling model complexity through the objective, not through the architecture.
1. Ridge Regression (L2 Regularization)
$\lambda \geq 0$ is the regularization parameter. Large $\lambda$ strongly penalizes large weights; $\lambda = 0$ recovers ordinary least squares.
The intuition: overfitting requires large, cancelling weights to pass through every training point. The L2 penalty makes large weights expensive, so the optimizer is forced to find a smoother solution that accepts small residuals rather than eliminating them entirely.
Why exclude the bias? The bias $w_0$ allows the model to shift its predictions up or down without changing the function shape. Penalizing it would bias predictions toward zero regardless of the data. Moreover, the bias does not contribute to model complexity in the sense of producing oscillations — so there is no reason to suppress it.
2. Connection to MAP Estimation
This is not a heuristic — it is exactly what we derived in lecture 2.5 from the probabilistic MAP framework. Recall that maximizing the log posterior with a Gaussian prior $p(\mathbf{w}|\alpha) = \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$ gives:
$$\mathbf{w}_{MAP} = \arg\min_\mathbf{w}\;\frac{\beta}{2}\sum_{i=1}^N\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)^2 + \frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}$$Dividing through by $\beta$, this matches the ridge objective with:
$$\lambda = \frac{\alpha}{\beta}$$$\alpha$ is the prior precision on the weights (how certain we are that weights are small); $\beta$ is the noise precision (how much we trust the data). A large $\alpha/\beta$ ratio means a strong prior relative to the data — more regularization. Ridge regression is therefore MAP estimation under a Gaussian weight prior.
3. Effect of $\lambda$: The Regularization Curve
Plotting train and test RMSE against $\ln\lambda$ reveals the same U-shape as the model-order curve from lecture 3.4:
- $\lambda \approx 0$: little regularization → large weights → overfitting. Large generalization gap.
- Intermediate $\lambda$: weights are controlled → good generalization. Train and test errors are close and both low.
- Large $\lambda$: weights pushed toward zero → all predictions approach a constant → underfitting. Gap is small again, but both errors are high.
The regularization parameter $\lambda$ is a hyperparameter that must be chosen by model selection (covered in Week 4).
4. Beyond Ridge: The L$q$ Penalty Family
The L2 penalty is one member of a family parameterized by $q$:
$$\frac{\lambda}{2}\sum_{j=1}^{M-1}|w_j|^q$$Two cases matter most:
Ridge ($q=2$): L2 penalty. Corresponds to MAP with a Gaussian prior. Shrinks all weights toward zero smoothly — none are set exactly to zero. Every feature keeps a nonzero coefficient.
Lasso ($q=1$): L1 penalty; absolute value of weights. Produces sparse solutions — many weights are driven to exactly zero. The surviving weights identify the most predictive features.
Why Does L1 Produce Sparsity?
Two complementary views:
Algebraic view. For the L2 penalty, the gradient of $w_j^2$ is $2w_j$ — it is large for large $w_j$ and small for small $w_j$. So the optimizer has the most to gain by reducing the largest weights first; small weights are barely affected. For the L1 penalty, the gradient of $|w_j|$ is $\pm 1$ — the same regardless of the weight's magnitude. Reducing a large weight and reducing a small weight are equally attractive, so small weights regularly get pushed all the way to zero.
Geometric view. Minimizing the regularized objective is equivalent to minimizing the SSE subject to $\sum_j |w_j|^q \leq \eta$ for some $\eta$. The feasible region for $q=2$ is a disk; for $q=1$ it is a diamond with corners on the coordinate axes. When the SSE minimum lies outside this region, the constrained solution sits on the boundary. The diamond's corners make it overwhelmingly likely that the constrained solution lands on an axis, setting one or more weights to exactly zero.
A model is trained to predict prostate-specific antigen (PSA) levels from 8 clinical measurements. With ridge regression, all 8 weights shrink as $\lambda$ increases but none reach zero — all features remain in the model. With lasso, as $\lambda$ increases, weights are eliminated one by one. The optimal lasso model (chosen by test set performance) retains only 3 of the 8 measurements — cancer volume, prostate weight, and one other — and sets the rest to zero. The sparse model is both accurate and interpretable: it automatically identifies the most predictive features.