Lecture 2.1

Expectation, Variance & Covariance

Learning Objectives

After this lecture you should be able to:

Define the expectation of a function of a random variable in both the discrete and continuous settings, and state the sample-mean approximation.
State and apply the three linearity properties of the expectation operator.
Describe intuitively what variance measures; state both the definitional form $\mathbb{E}[(f - \mathbb{E}[f])^2]$ and the computational shortcut $\mathbb{E}[f^2] - \mathbb{E}[f]^2$; and derive the shortcut from the definition using linearity.
Describe intuitively what covariance measures; state both the definitional form $\mathbb{E}[(x-\mathbb{E}[x])(y-\mathbb{E}[y])]$ and the computational form $\mathbb{E}[xy] - \mathbb{E}[x]\mathbb{E}[y]$; and derive one from the other.
Extend covariance to random vectors and describe what the covariance matrix encodes.
Prove that independent random variables have zero covariance, and give a counterexample showing the converse is false.

Having defined probability distributions, we now need tools to summarize them. The expectation, variance, and covariance are the three core statistics that describe where a distribution is centered, how spread out it is, and how two variables move together.

1. Expectation

Definition: Expectation

Let $f : \mathcal{X} \to \mathbb{R}$ be a function of a random variable $x \sim p(x)$. The expectation of $f$ under $p$ is the probability-weighted average of $f$:

Discrete: $\quad\mathbb{E}[f] = \sum_{x \in \mathcal{X}} f(x)\, p(x)$

Continuous: $\quad\mathbb{E}[f] = \int f(x)\, p(x)\, dx$

When the distribution $p$ is not directly available, we approximate the expectation with a sample mean over $N$ observations $\{x_1, \ldots, x_N\}$: $$\mathbb{E}[f] \approx \frac{1}{N} \sum_{n=1}^{N} f(x_n)$$ This is the frequentist interpretation: values of $x$ that occur more often contribute more to the sum, mimicking the probability weighting.

Expectations also exist in conditional form: $\mathbb{E}[f(x) \mid y] = \sum_x f(x)\, p(x \mid y)$, i.e. the expectation of $f$ given we already know $y$.

Linearity of Expectation

The expectation operator is linear. For any functions $f$, $g$ and constant $c$:

$\mathbb{E}[f + g] = \mathbb{E}[f] + \mathbb{E}[g]$
$\mathbb{E}[c \cdot f] = c\, \mathbb{E}[f]$
$\mathbb{E}[c] = c \quad$ (the expectation of a constant is the constant itself, since $\sum_x p(x) = 1$)

These properties are used repeatedly in the derivations below.

2. Variance

Definition: Variance

The variance of $f(x)$ measures the expected squared deviation from the mean:

$$\text{Var}[f] = \mathbb{E}\!\left[\bigl(f(x) - \mathbb{E}[f]\bigr)^2\right]$$

Expanding the square and applying linearity:

$$\text{Var}[f] = \mathbb{E}\!\left[f^2 - 2f\,\mathbb{E}[f] + \mathbb{E}[f]^2\right]$$ $$= \mathbb{E}[f^2] - 2\,\mathbb{E}[f]\cdot\mathbb{E}[f] + \mathbb{E}[f]^2$$

Computational Form $$\text{Var}[f] = \mathbb{E}[f^2] - \mathbb{E}[f]^2$$

Variance = expected square minus square of the expectation. This two-term form is often easier to compute than the original definition.

Visual Intuition

Consider $x \sim \text{Uniform}[0,1]$ and two functions $f(x)$ (highly varying) and $g(x)$ (nearly flat). Both have a well-defined mean (a horizontal line). The variance of $f$ is large because individual values deviate far from the mean; the variance of $g$ is small because $g$ stays close to its mean throughout $[0,1]$. Variance quantifies the amount of spread around the expected value.

3. Covariance

Definition: Covariance (scalar)

The covariance between two random variables $x$ and $y$ (with joint distribution $p(x,y)$) measures the extent to which they vary together:

$$\text{Cov}[x, y] = \mathbb{E}\!\left[(x - \mathbb{E}[x])(y - \mathbb{E}[y])\right]$$

Expanding and applying linearity (the same steps as for variance):

Computational Form $$\text{Cov}[x, y] = \mathbb{E}[xy] - \mathbb{E}[x]\,\mathbb{E}[y]$$

Covariance Matrix for Random Vectors

When $\mathbf{x}$ and $\mathbf{y}$ are random vectors (column vectors of dimension $D$), the covariance generalizes to a matrix:

$$\text{Cov}[\mathbf{x}, \mathbf{y}] = \mathbb{E}\!\left[(\mathbf{x} - \mathbb{E}[\mathbf{x}])(\mathbf{y} - \mathbb{E}[\mathbf{y}])^\top\right] = \mathbb{E}[\mathbf{x}\mathbf{y}^\top] - \mathbb{E}[\mathbf{x}]\,\mathbb{E}[\mathbf{y}]^\top$$

The outer product $(\mathbf{x} - \bar{\mathbf{x}})(\mathbf{y} - \bar{\mathbf{y}})^\top$ is a $D \times D$ matrix; entry $(i,j)$ encodes the covariance between the $i$-th component of $\mathbf{x}$ and the $j$-th component of $\mathbf{y}$. The notation $\text{Cov}[\mathbf{x}]$ (one argument) is shorthand for $\text{Cov}[\mathbf{x}, \mathbf{x}]$, the covariance of $\mathbf{x}$ with itself.

4. Independence and Covariance

Recall that $x$ and $y$ are independent if $p(x, y) = p(x)\,p(y)$. We can show that independence implies zero covariance:

$$\text{Cov}[x,y] = \mathbb{E}[xy] - \mathbb{E}[x]\mathbb{E}[y]$$

For the first term, substitute the independence assumption $p(x,y) = p(x)p(y)$:

$$\mathbb{E}[xy] = \int\!\int xy\, p(x,y)\, dx\, dy = \int\!\int xy\, p(x)\,p(y)\, dx\, dy = \int x\,p(x)\,dx \cdot \int y\,p(y)\,dy = \mathbb{E}[x]\,\mathbb{E}[y]$$

So $\text{Cov}[x,y] = \mathbb{E}[x]\mathbb{E}[y] - \mathbb{E}[x]\mathbb{E}[y] = 0$. $\checkmark$

Important caveat: independence implies zero covariance, but zero covariance does not imply independence.

Counterexample: Zero Covariance Without Independence

Let $x \sim \text{Uniform}[-1, 1]$ (so $p(x) = \tfrac{1}{2}$) and define $y = x^2$. Clearly $y$ depends on $x$ — knowing $x$ determines $y$ exactly. Yet:

$\mathbb{E}[xy] = \int_{-1}^{1} x \cdot x^2 \cdot \tfrac{1}{2}\, dx = \tfrac{1}{2}\int_{-1}^{1} x^3\, dx = 0$ (odd function over a symmetric interval)
$\mathbb{E}[x] = \int_{-1}^{1} x \cdot \tfrac{1}{2}\, dx = 0$ (odd function)

So $\text{Cov}[x, y] = 0 - 0 \cdot \mathbb{E}[y] = 0$, even though $x$ and $y$ are completely dependent. Covariance only captures linear dependence; it misses nonlinear relationships like $y = x^2$.