Lecture 7.3
Logistic Regression: SGD
The cross-entropy loss for logistic regression is convex but has no closed-form minimizer. Stochastic gradient descent provides an efficient iterative solution, and the gradient takes a beautifully simple form that mirrors the perceptron update.
- Apply the chain rule to derive the gradient of the cross-entropy loss with respect to $\mathbf{w}$.
- State the logistic regression SGD update rule and interpret its form.
- Connect the update rule to the perceptron algorithm.
- Explain the role of the learning rate $\eta$ and the conditions for convergence.
1. Setup
The model is $y_n = \sigma(\mathbf{w}^\top \boldsymbol{\phi}_n)$ and the per-sample error is
$$E_n(\mathbf{w}) = -t_n \ln y_n - (1-t_n)\ln(1-y_n).$$The total cross-entropy loss $E = \sum_n E_n$ splits into a sum of per-sample terms, making it a natural candidate for stochastic gradient descent: approximate the full gradient using a single (or mini-batch of) data point(s) at each step.
2. Gradient Derivation
Applying the chain rule to $E_n$ with respect to the $j$-th weight:
$$\frac{\partial E_n}{\partial w_j} = \frac{\partial E_n}{\partial y_n}\cdot\frac{\partial y_n}{\partial w_j}.$$First factor:
$$\frac{\partial E_n}{\partial y_n} = -\frac{t_n}{y_n} + \frac{1-t_n}{1-y_n}.$$Second factor (using $\sigma'(a) = \sigma(a)(1-\sigma(a))$):
$$\frac{\partial y_n}{\partial w_j} = y_n(1-y_n)\,\phi_{nj}.$$Multiplying and simplifying (terms cancel cleanly):
The gradient is the prediction error $(y_n - t_n)$ scaled by the feature vector $\boldsymbol{\phi}_n$. This elegant form is not a coincidence: it holds for any generalized linear model whose activation function is the canonical link function (see Bishop ยง4.3.6).
3. The SGD Update Rule
Stochastic gradient descent picks one data point $n$ at a time and updates:
where $\eta > 0$ is the learning rate. The update adds a multiple of $\boldsymbol{\phi}_n$ weighted by the signed error: if the model over-predicts ($y_n > t_n$), $\mathbf{w}$ is pushed away from $\boldsymbol{\phi}_n$; if it under-predicts, $\mathbf{w}$ is pulled toward $\boldsymbol{\phi}_n$.
The perceptron update (Lecture 6.5) is $\mathbf{w} \leftarrow \mathbf{w} + \eta\,t_n\,\boldsymbol{\phi}_n$ applied only to misclassified points. The logistic regression update applies to every point, weighted continuously by the prediction error $(y_n - t_n) \in (-1, 1)$. It is a soft, probabilistic version of the perceptron.
4. Convergence and the Learning Rate
Because $E(\mathbf{w})$ is convex, SGD is guaranteed to converge to the global minimum โ provided $\eta$ is chosen appropriately:
- Too small $\eta$: convergence is very slow; many iterations needed.
- Too large $\eta$: oscillates around the minimum and may never converge.
- Practical strategy: start with a moderate $\eta$ and reduce it over time (learning rate schedule) โ large steps early for rapid descent, small steps later for fine convergence.
Despite convexity, the cross-entropy loss has no analytical minimizer because $y_n = \sigma(\mathbf{w}^\top\boldsymbol{\phi}_n)$ is nonlinear in $\mathbf{w}$. SGD solves this iteratively. Lecture 7.4 presents Newton-Raphson, a second-order method that converges in fewer steps without requiring a learning rate.