Lecture 1.3
Types of Machine Learning
After this lecture you should be able to:
- Distinguish supervised, unsupervised, semi-supervised, and reinforcement learning by the type of data and feedback each uses.
- Explain why classification outputs are discrete labels and regression outputs are continuous values.
- Describe the PCA compression idea: representing a data point as a mean plus a weighted sum of principal components, and explain the memory saving this achieves.
- Explain what semi-supervised learning is and how unlabeled data can help a classifier.
- Describe the reinforcement learning loop in terms of state, action, and reward, and explain why experience is gathered along the way rather than provided upfront.
Lecture 1.2 introduced machine learning through the T/P/E framework. This lecture zooms in on the three major paradigms β supervised, unsupervised, and reinforcement learning β and adds a fourth, semi-supervised learning, that sits between the first two.
1. The Three Paradigms
The key distinguishing factor across paradigms is the form of the experience $E$:
- Supervised learning: experience comes as inputβtarget pairs $(\mathbf{x}_i, t_i)$. Both inputs and labels are provided upfront.
- Unsupervised learning: experience is inputs only $\{\mathbf{x}_i\}$ β no labels. The algorithm discovers structure in the data.
- Reinforcement learning: experience is not provided upfront. The agent collects it by interacting with an environment β trial and error.
2. Supervised Learning
Supervised methods always operate on paired data. What differs between the two main subtypes is the nature of the target $t$.
Classification: the target $t$ is a discrete label drawn from a finite set of classes, e.g. $t \in \{0, 1, \ldots, 9\}$ for digit recognition. Predicting $t = 2.5$ is meaningless β only valid class labels are allowed.
Regression: the target $t \in \mathbb{R}$ is a continuous numerical value. Any real number in the output range is a valid prediction.
In both cases the objective is the same: find a function $f$ that maps inputs to targets as accurately as possible, not just on the training data but on unseen data. That ability to perform well on new inputs is generalization. As we saw in lecture 1.2, a model that overfits the training set fails to generalize β it has learned the noise, not the signal.
3. Unsupervised Learning
Without labels the task changes: we look for hidden structure in the data. Two important examples are compression and clustering.
3.1 Compression via PCA
Imagine storing thumbnail images (100 Γ 100 pixels = 10,000 values per image) for millions of users. Storage is expensive; we want to represent each image with far fewer numbers without losing too much quality.
Principal Component Analysis (PCA) achieves this by exploiting shared structure across images. Applied to a dataset of face images, PCA computes:
- The mean image $\bar{\mathbf{x}}$ β a smooth, generic face that captures what all images have in common.
- A set of principal components $\mathbf{u}_1, \mathbf{u}_2, \ldots$ (also called eigenfaces in computer vision) β directions in pixel space that capture the most common variations: lighting, expression, hair, skin tone, etc.
Any face image can then be approximated by adding a weighted combination of these components to the mean: $$ \mathbf{x} \approx \bar{\mathbf{x}} + \sum_{i=1}^{M} \alpha_i \mathbf{u}_i $$ The coefficients $\alpha_1, \ldots, \alpha_M$ are all that need to be stored β just $M$ numbers instead of 10,000 pixels.
Starting from the mean face and progressively adding more components:
- $M = 1$: only the dominant variation is captured; the result barely resembles the person.
- $M = 10$: eyes and rough features start to emerge.
- $M = 50$: a recognizable likeness, including expression.
- $M = 150$: a faithful reconstruction β stored as 150 coefficients instead of 10,000 pixels, a 66Γ reduction in storage.
Note: PCA is covered in depth in Chapter 12 and the corresponding lectures later in the course.
3.2 Clustering
Clustering groups data points by similarity without any labels β we already covered this in lecture 1.2 with the tumor gene-expression example. The key idea: by partitioning samples into clusters and computing cluster means, we can discover natural categories in the data and use them to inform decisions (e.g. selecting a treatment protocol for a new patient based on which cluster their tumor profile falls into).
4. Semi-Supervised Learning
A practical middle ground between supervised and unsupervised learning: we have $N$ inputs $\{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$ but labels for only a subset $\{t_1, \ldots, t_k\}$, where $k \ll N$. The goal is to exploit all available data β labeled and unlabeled β rather than discarding the unlabeled portion.
Suppose we have 1,000 images of cats and dogs, but labels for only 100. A purely supervised classifier ignores the 900 unlabeled images. A semi-supervised approach uses the unlabeled images to learn the structure of the input space β what cat images look like in general β and then leverages that structure to assign labels. An unlabeled image that closely resembles labeled cat images is likely also a cat.
5. Reinforcement Learning
Reinforcement learning (RL) is conceptually distinct from all the above. There is no fixed dataset. Instead, an agent operates in an environment and learns by taking actions and observing outcomes.
At each step the agent observes the current state $s$ of the environment, takes an action $a$, and receives a scalar reward $r$ (positive for good moves, negative for bad ones). The state then transitions to $s'$. The agent's goal is to learn a policy β a mapping from states to actions β that maximizes cumulative reward over time.
AlphaGo learned to play the game of Go by playing millions of games against itself in a simulated environment. The state is the board configuration (black and white stones); actions are stone placements; reward signals reflect whether a move gains or loses ground. Reading the rule book alone does not make a good player β experience through play does. RL is the framework that formalizes this.
RL thrives in virtual environments because mistakes are cheap: a bad move in a simulated game costs nothing. This is why most successful RL applications are in games (Chess, Go, Starcraft) and simulation-based robotics. Deploying RL in the physical world is harder β mistakes can be costly or dangerous, and the real world cannot be simulated perfectly.
6. Summary
| Paradigm | Data available | Example tasks |
|---|---|---|
| Supervised | Inputβlabel pairs, all upfront | Classification, regression |
| Unsupervised | Inputs only, no labels | Clustering, compression, density estimation |
| Semi-supervised | Some labeled, mostly unlabeled | Classification with limited annotation |
| Reinforcement | Reward signals, gathered during learning | Game playing, robotics control |