Lecture 8.5

Neural Networks: Backpropagation

Backpropagation is the algorithm that makes SGD tractable for neural networks: it computes all weight gradients in a single backward pass through the network using the multidimensional chain rule, at the same cost as one forward pass.

Learning Objectives
  • State the multidimensional chain rule and apply it in a network context.
  • Define node errors $\delta_j = \partial E / \partial a_j$ and derive the formula $\partial E/\partial w_{ji} = \delta_j\, z_i$.
  • Derive the backpropagation recurrence: $\delta_j = h'(a_j)\sum_k w_{kj}\,\delta_k$.
  • Describe the full forward–backward training loop.
  • Compute node errors at the output layer for MSE and cross-entropy losses.

1. The Multidimensional Chain Rule

When $E$ depends on several intermediate variables $\{g_d\}$, each of which depends on a common parameter $x$:

Multidimensional Chain Rule $$\frac{\partial E}{\partial x} = \sum_d \frac{\partial E}{\partial g_d}\cdot\frac{\partial g_d}{\partial x}.$$

In a neural network, $x$ might be a weight $w_{ji}$ and the $g_d$ are the downstream activations that $w_{ji}$ influences.

2. Node Errors

Because weight $w_{ji}$ contributes to $E$ only through activation $a_j$:

$$\frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial a_j}\cdot\frac{\partial a_j}{\partial w_{ji}}.$$

The activation $a_j = \sum_i w_{ji}\,z_i$ is linear in $w_{ji}$, so $\partial a_j / \partial w_{ji} = z_i$. Defining the node error:

Node Error and Weight Gradient $$\delta_j \;\triangleq\; \frac{\partial E}{\partial a_j}, \qquad \frac{\partial E}{\partial w_{ji}} = \delta_j\, z_i.$$

All weight gradients reduce to a product of the node error at the receiving unit and the activation at the sending unit. If all $\delta_j$ are known, weight updates follow immediately.

3. Backpropagation Recurrence

The node error $\delta_j$ depends on all downstream activations $\{a_k\}$ that receive input from $a_j$. By the multidimensional chain rule:

$$\delta_j = \frac{\partial E}{\partial a_j} = \sum_k \frac{\partial E}{\partial a_k}\cdot\frac{\partial a_k}{\partial a_j} = \sum_k \delta_k \cdot w_{kj}\, h'(a_j).$$

Factoring out the local derivative:

Backpropagation Recurrence $$\delta_j = h'(a_j)\sum_k w_{kj}\,\delta_k.$$

The node error at layer $\ell$ is the activation-function derivative at $a_j$ times the weighted sum of node errors at layer $\ell+1$. Starting from the output errors and propagating backward gives all $\delta_j$ in one pass.

4. Output Node Errors

At the output layer, $\delta_k = \partial E/\partial a_k^{\text{out}}$ can be computed directly:

  • MSE (regression, identity output): $\delta_k = y_k - t_k$.
  • Binary cross-entropy (sigmoid output): $\delta = y - t$ (same compact form, following from the sigmoid derivative and log-loss gradient cancellation).
  • Multi-class cross-entropy (softmax output): $\delta_k = y_k - t_k$.

In all three cases, the output node error is simply the prediction error — a consequence of choosing the loss to match the output activation (canonical link functions).

5. The Full Training Loop

Forward–Backward Algorithm
  1. Forward pass: propagate input $\mathbf{x}_n$ through the network layer by layer, computing activations $a_j^{(\ell)}$ and hidden units $z_j^{(\ell)} = h(a_j^{(\ell)})$ at every node.
  2. Output errors: compute $\delta_k = y_k - t_{nk}$ at the output layer.
  3. Backward pass: propagate $\delta_j = h'(a_j)\sum_k w_{kj}\delta_k$ layer by layer from output to input.
  4. Gradient assembly: $\partial E_n / \partial w_{ji} = \delta_j\, z_i$ for every weight.
  5. Weight update (SGD): $w_{ji} \leftarrow w_{ji} - \eta\,\delta_j\, z_i$.

Repeat for each mini-batch until convergence.

Two-Layer Network with tanh Hidden Units

For a two-layer network with $\tanh$ hidden units ($h(a)=\tanh(a)$, $h'(a)=1-\tanh^2(a) = 1-z^2$) and MSE output ($\delta_k = y_k - t_k$), the hidden-layer node errors are

$$\delta_j^{(1)} = (1 - z_j^2)\sum_k w_{kj}^{(2)}\,\delta_k^{(2)}.$$

Weight gradients: $\partial E/\partial w_{kj}^{(2)} = \delta_k^{(2)}\,z_j$ and $\partial E/\partial w_{ji}^{(1)} = \delta_j^{(1)}\,x_i$. All that is needed are the stored activations from the forward pass plus the above recurrence.