Lecture 8.4
Neural Networks: SGD
Neural network loss surfaces are non-convex, so SGD cannot guarantee the global optimum — but its noisy gradient estimates are precisely what allow it to escape local minima. Understanding this trade-off is key to training deep networks in practice.
- Explain why neural network loss functions are non-convex and what this means for optimization.
- Describe how stochastic gradient descent (mini-batch SGD) works and why noisy gradients can help escape local minima.
- Compare full gradient descent and SGD in terms of convergence route, per-step cost, and local minima avoidance.
- Discuss the effect of the learning rate and the motivation for learning rate schedules.
- Explain why results should always be reported across multiple random initializations.
1. Non-Convexity of the Neural Network Loss
For logistic regression the cross-entropy loss was convex. For neural networks with hidden layers, the loss is non-convex in the weights: multiple local minima exist, and gradient-based methods are not guaranteed to find the global minimum. The number of local minima grows with the number of parameters, making the landscape of a large network very complex.
2. Why SGD Still Works
The total loss splits into a sum of per-sample losses:
$$E(\mathbf{w}) = \sum_{n=1}^N E_n(\mathbf{w}).$$SGD approximates $\nabla E$ using a single data point or a mini-batch of $B$ points:
$$\tilde{\nabla} E \approx \frac{1}{B}\sum_{n \in \text{batch}} \nabla E_n.$$This noisy gradient estimate has two key advantages over full-batch gradient descent:
- Efficiency. Computing $\tilde{\nabla}E$ from a mini-batch of $B \ll N$ examples costs $O(B)$ instead of $O(N)$. Early in training, when all gradients roughly point in the same direction, a single data point suffices to identify the descent direction.
- Local minima escape. The gradient varies across data points: at a point where the full gradient $\nabla E = 0$ (a local minimum), individual $\nabla E_n$ may not be zero. The noise in the mini-batch gradient can push the iterate out of a poor local minimum and into a better one.
Full gradient descent follows the steepest-descent path directly but can get trapped in the first local minimum it reaches. SGD takes a noisier, more roundabout route — requiring more iterations — but is more likely to escape shallow local minima. In practice, SGD's per-step cost is so much lower that it reaches a good solution faster in wall-clock time, even with more iterations.
3. The Learning Rate
The learning rate $\eta$ controls the step size:
$$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\,\tilde{\nabla}E.$$- $\eta$ too small: converges extremely slowly; SGD's noise may also prevent escaping local minima.
- $\eta$ too large: overshoots the minimum and diverges or oscillates.
A common strategy is to start with a large $\eta$ (fast initial descent) and decrease it over training (fine convergence). Popular schedules include step decay (halve $\eta$ every $k$ epochs), cosine annealing, and warm-up followed by decay. Modern optimizers like Adam and RMSProp maintain per-parameter adaptive learning rates.
4. Effect of Initialization
Because the loss is non-convex, the solution SGD converges to depends heavily on where training starts. The same architecture, trained twice with different random initializations $\mathbf{w}^{(0)}$, can end up in very different local minima with different test errors.
Reporting a single training run — especially for a large, complex model — risks claiming a lucky result as the typical performance. Always train with several random seeds and report the mean and standard deviation of the test error. This also gives insight into how stable the training procedure is for a given model complexity.