Lecture 4.1

Model Selection

📄 PDF Slides 📺 Watch on YouTube

Cross-validation and practical tools for choosing among competing models without overfitting to the test set.

Learning Objectives

Explain why generalization performance must be estimated on data the model has never seen.
Describe the roles of the training, validation, and test sets in a supervised learning pipeline.
Define $k$-fold cross-validation and compute the cross-validation error.
Use cross-validation to select hyperparameters such as $\alpha$ (prior precision) or $\lambda$ (regularization strength).
Apply nested cross-validation to obtain an unbiased estimate of generalization performance.

1. Two Goals of Error Evaluation

When training a supervised model, we evaluate error for two distinct reasons:

Estimating generalization performance. How well does the model perform on unseen data? A low training error alone tells us little; we need an estimate of error on data the model was never optimized on.
Choosing hyperparameters. Parameters such as $\alpha$ (the prior precision in MAP estimation) or $\lambda$ (the regularization strength in ridge regression) are not learned from the training loss. We need a separate criterion to select them.

Both goals require evaluating error on data that was not used to fit the model parameters $\mathbf{w}$.

2. Train / Validation / Test Split

The standard approach is to partition the dataset $\mathcal{D}$ into three disjoint subsets.

Three-Way Data Split

Given a dataset of $N$ samples, partition it (typically at random) into:

Training set (~80%): used to optimize the model parameters $\mathbf{w}$ by minimizing the training loss, $$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{i \in \mathcal{D}_{\mathrm{train}}} \ell\!\left(\hat{y}(\mathbf{x}_i;\mathbf{w}),\, t_i\right).$$
Validation set (~10%): used to select hyperparameters. For each candidate setting, train a model and evaluate on the validation set; keep the setting with the lowest validation error.
Test set (~10%): used once, at the very end, to report an unbiased estimate of generalization performance.

The Test Set Must Never Be Touched During Model Selection

Once the validation set is used to select hyperparameters, it has influenced the model. Any performance number computed on data that was part of this process is optimistically biased. The test set must remain completely independent of all training and hyperparameter tuning decisions — report performance on it exactly once.

3. $k$-Fold Cross-Validation

When the dataset is small, a single validation set may be too small to give reliable error estimates — a handful of outliers or easy samples can swing the result substantially. Cross-validation addresses this by reusing data efficiently.

$k$-Fold Cross-Validation

Split the dataset into $k$ equal (or near-equal) folds. Let $\kappa : \{1,\dots,N\} \to \{1,\dots,k\}$ be an indexing function that maps each data point $i$ to the fold in which it serves as a validation point.

Train $k$ models: model $\hat{y}^{(-j)}$ is trained on all folds except fold $j$. The cross-validation error is

$$E_{\mathrm{CV}} = \frac{1}{N} \sum_{i=1}^{N} \ell\!\left(\hat{y}^{(-\kappa(i))}(\mathbf{x}_i),\, t_i\right).$$

Every data point is used for validation exactly once, always by a model that was trained without it.

Example: 5-Fold Cross-Validation

With $k = 5$, split the data into five folds of equal size. In round $j$ ($j = 1,\dots,5$), fold $j$ is held out for validation and the remaining four folds form the training set, giving model $\hat{y}^{(-j)}$. The CV error averages the validation loss of each model on its held-out fold, so every sample contributes to the error estimate exactly once.

Choice of $k$ and Computational Cost

Typical values are $k = 5$ or $k = 10$. The extreme case $k = N$ is called leave-one-out cross-validation (LOOCV): each model is trained on all but one point and validated on that single point. LOOCV maximizes training set size but requires $N$ training runs.

Cross-validation is most valuable when the dataset is small. With a large dataset, a single sufficiently large validation set gives reliable estimates at far lower cost.

4. Model Selection with Cross-Validation

Cross-validation gives an objective criterion for hyperparameter selection: train with each candidate setting, compute the CV error, and choose the setting that minimizes it.

Hyperparameter Selection via Cross-Validation

For a hyperparameter $\alpha$ drawn from a candidate set $\mathcal{A}$:

For each $\alpha \in \mathcal{A}$, run $k$-fold cross-validation to obtain $E_{\mathrm{CV}}(\alpha)$.
Select $\alpha^* = \arg\min_{\alpha \in \mathcal{A}}\, E_{\mathrm{CV}}(\alpha)$.

With two hyperparameters $\alpha \in \mathcal{A}$ and $\beta \in \mathcal{B}$ (e.g., prior precision and noise precision), a full grid search requires $|\mathcal{A}| \times |\mathcal{B}| \times k$ training runs. To reduce cost, one can optimize each hyperparameter sequentially — optimize one while holding the other fixed — at the cost of ignoring interactions between them.

5. Nested Cross-Validation

After hyperparameter selection via cross-validation, we still need a reliable, unbiased estimate of generalization performance. The inner validation folds used for selection have indirectly influenced the chosen model, so they cannot serve as an independent test set. Nested cross-validation resolves this with two nested loops.

Nested Cross-Validation

Outer loop (generalization estimate): split the data into $k_{\mathrm{out}}$ folds. In round $j$, fold $j$ is set aside as the test set.

Inner loop (hyperparameter selection): on the remaining $k_{\mathrm{out}} - 1$ folds, run inner $k_{\mathrm{in}}$-fold cross-validation to select hyperparameters $\alpha^*_j$.

Train the model for round $j$ with $\alpha^*_j$ on the inner training data and evaluate on outer fold $j$ to get test error $E_j$. The nested CV generalization estimate is

$$E_{\mathrm{nested}} = \frac{1}{k_{\mathrm{out}}} \sum_{j=1}^{k_{\mathrm{out}}} E_j.$$

When to Use Nested Cross-Validation

Nested CV gives the most accurate generalization estimate when the dataset is small — every point eventually serves as a test point in the outer loop, and no test point ever participates in the inner hyperparameter-tuning loop. The tradeoff is computational: up to $k_{\mathrm{out}} \times |\mathcal{A}| \times |\mathcal{B}| \times k_{\mathrm{in}}$ training runs in the worst case.

For deployment, one typically reports the nested CV error as the performance estimate, then retrains a final model on all available data using the hyperparameters identified during the analysis.