Lecture 3.3

Stochastic Gradient Descent

Learning Objectives

After this lecture you should be able to:

  • Explain why the closed-form MLE solution can be computationally expensive, and identify the bottleneck (matrix inversion, $O(M^3)$).
  • State three key properties of the gradient: it encodes all directional derivatives via the dot product; it is perpendicular to the iso-contours of a function; it points in the direction of steepest ascent.
  • Describe the gradient descent algorithm: initialize $\mathbf{w}$, then iteratively step in the direction of the negative gradient until convergence.
  • Explain the "stochastic" part of SGD: approximate the full gradient using a single data point (or mini-batch) at a time.
  • Write down the SGD update rule for the regression SSE and explain the role of the learning rate $\eta$.
  • State the convergence guarantee: if $\eta$ is small enough and the objective is convex, SGD converges to the global optimum.

Lecture 3.2 derived a one-step closed-form solution for $\mathbf{w}_{ML}$. This is elegant but requires inverting an $M \times M$ matrix — an $O(M^3)$ operation that becomes prohibitive as the number of basis functions grows, and requires all data to be in memory at once. Stochastic gradient descent (SGD) is an iterative alternative that avoids both problems.

1. Properties of the Gradient

Before describing the algorithm, it helps to recall three fundamental properties of the gradient $\nabla_\mathbf{w} E$:

Three Properties of the Gradient
  1. Encodes directional derivatives. The rate of change of $E$ in any direction $\mathbf{v}$ is $(\nabla_\mathbf{w} E)\,\mathbf{v}$ — the dot product of the gradient (row vector) with the direction vector.
  2. Perpendicular to iso-contours. The gradient is always orthogonal to the level sets of $E$ (curves of constant error).
  3. Steepest ascent. Together, properties 1 and 2 imply that $\nabla_\mathbf{w} E$ points in the direction of maximum increase of $E$. The negative gradient $-\nabla_\mathbf{w} E$ points downhill.

2. Gradient Descent

The idea is to follow the downhill direction of the error landscape iteratively. Starting from an initial guess $\mathbf{w}^{(0)}$, each step moves against the gradient:

Gradient Descent Update $$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\,\nabla_\mathbf{w} E\big|_{\mathbf{w}^{(\tau)}}$$

$\eta > 0$ is the learning rate — it controls the step size. Too large and the iterate overshoots; too small and convergence is slow. For a convex objective (which SSE is), gradient descent with a sufficiently small $\eta$ converges to the global minimum.

The full gradient of the SSE uses all $N$ data points:

$$\nabla_\mathbf{w} E = -\sum_{i=1}^{N}\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)\,\boldsymbol{\phi}(\mathbf{x}_i)^\top$$

3. Stochastic Gradient Descent

Computing the full gradient requires a pass over the entire dataset at every step. The stochastic approximation uses just one data point at a time — replacing the exact gradient with a noisy estimate.

SGD Update Rule (single data point) $$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} + \eta\,\bigl(t_i - \mathbf{w}^{(\tau)\top}\boldsymbol{\phi}(\mathbf{x}_i)\bigr)\,\boldsymbol{\phi}(\mathbf{x}_i)$$

At each step, pick one data point $(\mathbf{x}_i, t_i)$, compute the residual $t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)$, and nudge $\mathbf{w}$ in the direction that reduces the error for that point.

Why it works. Each single-point gradient is a noisy estimate of the true gradient. The noise means the iterate does not descend perfectly, but on average it moves downhill. For a convex objective and a sufficiently small learning rate, SGD is guaranteed to converge to the global optimum. In practice, iterating through the full dataset once is called an epoch; training runs for many epochs until convergence.

4. Closed Form vs. SGD

Comparison
Closed Form ($\boldsymbol{\Phi}^+\mathbf{t}$) SGD
StepsOneMany iterations
Cost per step$O(NM^2 + M^3)$$O(M)$ per data point
MemoryAll data at onceOne point at a time
SolutionExact (up to numerics)Approximate, converges
Scales toSmall/medium $M$Large $M$, large $N$

SGD is the workhorse optimizer in modern deep learning, where models have millions of parameters and datasets have millions of examples — making the closed-form solution completely infeasible.