Lecture 8.5
Neural Networks: Backpropagation
Backpropagation is the algorithm that makes SGD tractable for neural networks: it computes all weight gradients in a single backward pass through the network using the multidimensional chain rule, at the same cost as one forward pass.
- State the multidimensional chain rule and apply it in a network context.
- Define node errors $\delta_j = \partial E / \partial a_j$ and derive the formula $\partial E/\partial w_{ji} = \delta_j\, z_i$.
- Derive the backpropagation recurrence: $\delta_j = h'(a_j)\sum_k w_{kj}\,\delta_k$.
- Describe the full forward–backward training loop.
- Compute node errors at the output layer for MSE and cross-entropy losses.
1. The Multidimensional Chain Rule
When $E$ depends on several intermediate variables $\{g_d\}$, each of which depends on a common parameter $x$:
In a neural network, $x$ might be a weight $w_{ji}$ and the $g_d$ are the downstream activations that $w_{ji}$ influences.
2. Node Errors
Because weight $w_{ji}$ contributes to $E$ only through activation $a_j$:
$$\frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial a_j}\cdot\frac{\partial a_j}{\partial w_{ji}}.$$The activation $a_j = \sum_i w_{ji}\,z_i$ is linear in $w_{ji}$, so $\partial a_j / \partial w_{ji} = z_i$. Defining the node error:
All weight gradients reduce to a product of the node error at the receiving unit and the activation at the sending unit. If all $\delta_j$ are known, weight updates follow immediately.
3. Backpropagation Recurrence
The node error $\delta_j$ depends on all downstream activations $\{a_k\}$ that receive input from $a_j$. By the multidimensional chain rule:
$$\delta_j = \frac{\partial E}{\partial a_j} = \sum_k \frac{\partial E}{\partial a_k}\cdot\frac{\partial a_k}{\partial a_j} = \sum_k \delta_k \cdot w_{kj}\, h'(a_j).$$Factoring out the local derivative:
The node error at layer $\ell$ is the activation-function derivative at $a_j$ times the weighted sum of node errors at layer $\ell+1$. Starting from the output errors and propagating backward gives all $\delta_j$ in one pass.
4. Output Node Errors
At the output layer, $\delta_k = \partial E/\partial a_k^{\text{out}}$ can be computed directly:
- MSE (regression, identity output): $\delta_k = y_k - t_k$.
- Binary cross-entropy (sigmoid output): $\delta = y - t$ (same compact form, following from the sigmoid derivative and log-loss gradient cancellation).
- Multi-class cross-entropy (softmax output): $\delta_k = y_k - t_k$.
In all three cases, the output node error is simply the prediction error — a consequence of choosing the loss to match the output activation (canonical link functions).
5. The Full Training Loop
- Forward pass: propagate input $\mathbf{x}_n$ through the network layer by layer, computing activations $a_j^{(\ell)}$ and hidden units $z_j^{(\ell)} = h(a_j^{(\ell)})$ at every node.
- Output errors: compute $\delta_k = y_k - t_{nk}$ at the output layer.
- Backward pass: propagate $\delta_j = h'(a_j)\sum_k w_{kj}\delta_k$ layer by layer from output to input.
- Gradient assembly: $\partial E_n / \partial w_{ji} = \delta_j\, z_i$ for every weight.
- Weight update (SGD): $w_{ji} \leftarrow w_{ji} - \eta\,\delta_j\, z_i$.
Repeat for each mini-batch until convergence.
For a two-layer network with $\tanh$ hidden units ($h(a)=\tanh(a)$, $h'(a)=1-\tanh^2(a) = 1-z^2$) and MSE output ($\delta_k = y_k - t_k$), the hidden-layer node errors are
$$\delta_j^{(1)} = (1 - z_j^2)\sum_k w_{kj}^{(2)}\,\delta_k^{(2)}.$$Weight gradients: $\partial E/\partial w_{kj}^{(2)} = \delta_k^{(2)}\,z_j$ and $\partial E/\partial w_{ji}^{(1)} = \delta_j^{(1)}\,x_i$. All that is needed are the stored activations from the forward pass plus the above recurrence.