Lecture 1.4
Probability Theory & Bayes' Theorem
After this lecture you should be able to:
- Explain why probability theory is central to machine learning, giving two concrete sources of uncertainty.
- Contrast the frequentist and Bayesian interpretations of probability with the coin-flip example.
- Define a random variable, distinguish discrete from continuous cases, and state the positivity and normalization properties.
- Derive and apply the sum rule (marginalization) and the product rule for joint and conditional probabilities.
- State Bayes' theorem and identify the four named terms: prior, likelihood, evidence, and posterior.
- Write down the change-of-variables formula for continuous densities and explain the Jacobian correction.
Probability theory provides the language for reasoning under uncertainty. As Bishop puts it: probability theory provides a consistent framework for the quantification and manipulation of uncertainty. This lecture builds that framework from scratch โ starting with counting, arriving at Bayes' theorem.
1. Why Probability in Machine Learning?
Two unavoidable sources of uncertainty appear in every ML problem:
- Measurement noise. We want to model some real-world phenomenon but never observe it exactly โ every measurement carries noise. Sometimes we know the noise model; often we don't. Either way, we must account for it.
- Finite datasets. With limited data we cannot fully characterize the underlying distribution. The overfitting example from lecture 1.2 is a direct consequence: close to observed data points we do well, but further away our uncertainty grows and errors increase.
2. Two Interpretations of Probability
There are two philosophically distinct ways to assign a number called "probability" to an event.
Probability is the long-run frequency of an event in a repeated experiment. If a coin lands heads 5 times out of 5 flips, the frequentist assigns $p(\text{heads}) = 1$ โ it will always land heads. No probability is assigned to events that have never been observed.
Probability is a degree of belief โ a measure of plausibility. A Bayesian starts with a prior belief (e.g. the coin is fair: $p(\text{heads}) = 0.5$), then updates it as evidence arrives. After 5 heads in a row the posterior shifts toward heads being more likely, but the possibility of tails is never fully discarded. Crucially, Bayesian probability can be assigned to events that have never happened before.
This course takes a strongly Bayesian viewpoint, following Bishop. It provides a coherent way to handle uncertainty even in novel situations.
3. Random Variables
A random variable $X$ takes on values from a set of possible outcomes $\mathcal{X}$. Each time we make an observation we obtain one particular value $x \in \mathcal{X}$. The variable comes equipped with a probability distribution $p(X)$ that assigns a probability to each possible outcome.
Convention: capital $X$ denotes the random variable (the full distribution); lowercase $x$ denotes one specific observed value.
Two cases:
- Discrete: $\mathcal{X}$ is a finite or countable set. Examples: a die roll $\mathcal{X} = \{1,\ldots,6\}$ with $p(x) = \tfrac{1}{6}$ each; a coin flip $\mathcal{X} = \{\text{H}, \text{T}\}$ with $p(\text{H}) = p(\text{T}) = \tfrac{1}{2}$.
- Continuous: $\mathcal{X} \subseteq \mathbb{R}$ (or $\mathbb{R}^d$). The distribution is a probability density function $p(x) \geq 0$ (see ยง5).
Both cases share two fundamental properties:
- Positivity: $p(x) \geq 0$ for all $x \in \mathcal{X}$.
- Normalization: $\sum_x p(x) = 1$ (discrete) or $\int p(x)\,dx = 1$ (continuous).
4. Joint Distributions, the Sum Rule, and the Product Rule
Suppose we have two discrete random variables $X \in \{x_1,\ldots,x_5\}$ and $Y \in \{y_1, y_2, y_3\}$. We run $N$ experiments and count how often each pair $(x_i, y_j)$ co-occurs; call this count $n_{ij}$. Let $c_i = \sum_j n_{ij}$ be the total count for $X = x_i$.
From these counts we define:
- Joint probability: $p(X\!=\!x_i,\, Y\!=\!y_j) = \dfrac{n_{ij}}{N}$
- Marginal probability: $p(X\!=\!x_i) = \dfrac{c_i}{N} = \dfrac{\sum_j n_{ij}}{N}$
- Conditional probability: $p(Y\!=\!y_j \mid X\!=\!x_i) = \dfrac{n_{ij}}{c_i}$ โ the fraction of times $Y=y_j$ among cases where $X=x_i$ was already observed.
From these counting definitions two fundamental identities follow:
$$p(X\!=\!x_i) = \sum_j p(X\!=\!x_i,\, Y\!=\!y_j)$$
Summing the joint distribution over one variable recovers the marginal distribution of the other. We say we have "marginalized out" $Y$.
$$p(X,Y) = p(Y \mid X)\, p(X)$$
The joint distribution factors into the conditional times the marginal. Equivalently, $p(X,Y) = p(X \mid Y)\, p(Y)$ by symmetry of the joint.
Scatter 60 observations in a 2D grid. The joint distribution $p(x, y)$ is the full grid of counts. Collapsing the grid along the $y$-axis (summing each column) yields the marginal $p(x)$; collapsing along $x$ yields $p(y)$. Zooming into a single column $x = x_i$ and normalizing gives the conditional $p(y \mid x = x_i)$. Every conditional distribution must sum to 1 โ a direct consequence of its definition as a ratio of counts within one column.
5. Continuous Random Variables
When $X$ is continuous, individual points have zero probability. Instead, we work with a probability density function $p(x)$:
$$p(x \in [a, b]) = \int_a^b p(x)\, dx$$The density $p(x)$ itself is not a probability (it can exceed 1), but $p(x)\,dx$ is the probability of $X$ falling in the infinitesimal interval $[x, x+dx]$. The sum and product rules extend naturally by replacing sums with integrals: $$p(x) = \int p(x, y)\, dy \qquad \text{(sum rule)}$$ $$p(x, y) = p(y \mid x)\, p(x) \qquad \text{(product rule)}$$
5.1 Change of Variables
If $Y = g(X)$ is a transformation (e.g. converting meters to kilometers), the density must be adjusted to preserve total probability:
$$p_Y(y) = p_X(x)\,\left|\frac{dx}{dy}\right|$$The factor $|dx/dy|$ (the absolute Jacobian) corrects for the stretching or compression of the axis. For a linear rescaling $y = cx$ the Jacobian is the constant $1/c$; for nonlinear transforms it depends on $y$ and can produce complex-shaped distributions even from simple originals.
5.2 Cumulative Distribution Function
The CDF accumulates probability up to a threshold: $$P(x) = p(X < x) = \int_{-\infty}^{x} p(\tilde{x})\, d\tilde{x}$$ Differentiating the CDF recovers the density: $\dfrac{dP}{dx} = p(x)$.
6. Bayes' Theorem
Bayes' theorem follows directly from the product rule and the symmetry of the joint distribution.
Derivation. Start from the symmetry of the joint distribution:
$$p(Y, X) = p(X, Y)$$Apply the product rule to each side separately:
$$p(Y \mid X)\, p(X) = p(X \mid Y)\, p(Y)$$Divide both sides by $p(X)$:
where the denominator can be expanded via the sum rule: $$p(X) = \sum_Y p(X \mid Y)\, p(Y)$$ This ensures $p(Y \mid X)$ is properly normalized and is called the evidence (or marginal likelihood).
Bayes' theorem lets us flip a conditional: knowing $p(X \mid Y)$, we can compute $p(Y \mid X)$. This is its core power.
The Four Named Terms
In a typical ML context, $X$ is an observation and $Y$ is an unknown quantity we want to infer. The four terms each have a standard name:
- Prior $p(Y)$: our belief about $Y$ before observing $X$. Encodes background knowledge or assumptions.
- Likelihood $p(X \mid Y)$: how probable is the observation $X$ for a given value of $Y$? Note: viewed as a function of $Y$ (with $X$ fixed), this need not integrate to 1 over $Y$ โ hence it is called a likelihood function, not a probability distribution over $Y$.
- Evidence $p(X)$: the marginal probability of the observation, summed over all possible $Y$. Acts as a normalization constant.
- Posterior $p(Y \mid X)$: our updated belief about $Y$ after observing $X$. This is what we are after.
In words: posterior $\propto$ likelihood $\times$ prior.
The next lecture works through concrete examples to build intuition for what each of these terms looks like in practice.