Lecture 1.2

What Is Machine Learning?

Learning Objectives

After this lecture you should be able to:

  • State and parse the formal T/P/E definition of machine learning: task $T$, performance $P$, and experience $E$.
  • Give concrete examples of $T$, $P$, and $E$ for classification, regression, and clustering problems.
  • Distinguish classification (discrete output), regression (continuous output), and clustering (no labels) as the three core task types.
  • Write down the accuracy metric for classification and the mean squared error for regression.
  • Describe the $k$-means algorithm and the within-cluster distance objective it minimizes.
  • Explain overfitting using the polynomial curve-fitting example and state why performance must always be reported on a held-out test set.

This lecture establishes the conceptual vocabulary of machine learning: what it means for a program to "learn", what kinds of tasks are addressed, how performance is measured, and why generalization β€” not training accuracy β€” is the real goal.

1. What Is Machine Learning?

Nowadays almost any algorithm that uses data gets called "machine learning." To be precise, we use the following widely-cited definition (Mitchell, 1997):

Definition: Machine Learning (Mitchell, 1997)

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.

The three components:

  • Task $T$: the process we want to automate β€” classifying images, predicting values, grouping data points.
  • Performance $P$: a quantitative criterion for how well the algorithm is doing its job. We want to optimize $P$.
  • Experience $E$: the data the algorithm draws on to improve. This is the defining ingredient β€” it is what distinguishes machine learning from hard-coded rule-based programming.

2. Tasks

The lecture covers three fundamental task types, each illustrated with a running example.

2.1 Classification

Given an input $\mathbf{x}$, assign it to one of $K$ discrete classes. The output is a class label $\hat{y} \in \{C_1, \ldots, C_K\}$, or β€” taking a probabilistic view β€” a distribution over classes that also captures uncertainty.

Example: Digit Recognition (MNIST)

Input: a $28 \times 28$ grayscale image. Target: the digit in $\{0, \ldots, 9\}$. Experience: labelled image–label pairs. A trained classifier assigns a new image to the correct class β€” or expresses uncertainty: "80% chance this is an 8, 20% a 9."

Example: Tumour Classification from Gene Expression

Input: a vector of mRNA activity levels across thousands of genes (visualized as a heat map: green = active, red = inactive). Target: tumor type (breast cancer, leukemia, melanoma, …). Experience: previously profiled and labeled tumors. Patterns in the activation profile encode the tumor class; a classifier automates what a pathologist would otherwise read by eye.

2.2 Regression

Given an input $x$, predict a continuous output $y \in \mathbb{R}$. The goal is to learn a function $f(x; \mathbf{w})$ that captures the underlying signal.

A canonical toy example: the ground truth is $\sin(2\pi x)$, but we only observe noisy measurements $$ t_n = \sin(2\pi x_n) + \epsilon_n, \qquad \epsilon_n \sim \mathcal{N}(0, \sigma^2). $$ We model $f$ as a polynomial of degree $M$: $$ f(x; \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \cdots + w_M x^M $$ and tune $\mathbf{w}$ to fit the observed data points. The choice of $M$ matters β€” too small and the model cannot capture the shape; too large and it memorises the noise (see Β§4).

2.3 Clustering

Clustering is an unsupervised task: no target labels are given. The goal is to discover natural groupings in the data. In the tumour example we may not know how many tumour types exist; clustering can reveal them from the gene profiles alone.

Algorithm: $K$-Means Clustering

Given $N$ points $\{\mathbf{x}_i\}$ and a chosen number of clusters $K$:

  1. Randomly assign each $\mathbf{x}_i$ to one of $K$ clusters.
  2. Compute the cluster mean $\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i$.
  3. Reassign each point to the nearest mean: $k^* = \arg\min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$.
  4. Repeat steps 2–3 until assignments no longer change.

Each iteration reduces (or preserves) the total within-cluster distance, so the algorithm converges. The cluster means $\boldsymbol{\mu}_k$ serve as compact descriptors of each group β€” useful for tasks like treatment selection: find the cluster a new tumour belongs to, then draw on outcomes from similar cases.

3. Performance Measures

For each task type there is a natural choice of performance measure $P$.

3.1 Accuracy (Classification)

Count the fraction of correct predictions: $$ \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\hat{y}_i = y_i] $$ where the indicator $\mathbf{1}[\cdot]$ is 1 when prediction matches label, 0 otherwise. Accuracy = 1 means everything correct; accuracy = 0 means everything wrong.

3.2 Mean Squared Error (Regression)

Average the squared residuals between predictions and targets: $$ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} \bigl(f(x_i; \mathbf{w}) - y_i\bigr)^2 $$ MSE = 0 only when every prediction is exact. The degree-9 polynomial achieves MSE $= 0$ on the training points β€” but that turns out to be a problem (Β§4).

3.3 Within-Cluster Distance (Clustering)

Minimise the total squared distance from each point to its assigned cluster center: $$ J = \sum_{i=1}^{N} \min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 $$ Lower $J$ means tighter, more cohesive clusters. $K$-means directly optimizes $J$.

4. Generalisation and Overfitting

A high performance measure on the training set can give a false sense of success. Return to the polynomial regression example with $N = 10$ noisy observations of $\sin(2\pi x)$:

  • $M = 0$: A constant β€” cannot capture any variation. High error everywhere.
  • $M = 1$: A straight line β€” slightly better, but misses all curvature.
  • $M = 3$: Captures the broad shape well. Both training and test MSE are low. βœ“
  • $M = 9$: Passes exactly through all 10 training points (MSE = 0), but oscillates wildly between them. On new data, it performs terribly.
Definition: Overfitting

A model overfits when it performs much better on training data than on new, unseen data. It has learned the noise of the training set rather than the true underlying signal.

Rule: performance must always be reported on a held-out test set β€” data never seen during training. Training accuracy gives an over-optimistic, biased estimate; test accuracy estimates real-world behavior.

More data helps: with more observations, even a high-degree polynomial is "pinned down" and generalizes better. This tension between model complexity, data size, and generalization is a thread that runs through the entire course.

5. Looking Ahead

The lecture closes with a pointer towards the probabilistic viewpoint that underpins this course. Rather than a single hard prediction, we often want a distribution over outcomes β€” "80% probability class 8, 20% class 9" β€” so that downstream decisions can account for uncertainty. Probability theory, introduced in the next lectures, is the language for doing this rigorously.