Lecture 7.3

Logistic Regression: SGD

The cross-entropy loss for logistic regression is convex but has no closed-form minimizer. Stochastic gradient descent provides an efficient iterative solution, and the gradient takes a beautifully simple form that mirrors the perceptron update.

Learning Objectives

Apply the chain rule to derive the gradient of the cross-entropy loss with respect to $\mathbf{w}$.
State the logistic regression SGD update rule and interpret its form.
Connect the update rule to the perceptron algorithm.
Explain the role of the learning rate $\eta$ and the conditions for convergence.

1. Setup

The model is $y_n = \sigma(\mathbf{w}^\top \boldsymbol{\phi}_n)$ and the per-sample error is

$$E_n(\mathbf{w}) = -t_n \ln y_n - (1-t_n)\ln(1-y_n).$$

The total cross-entropy loss $E = \sum_n E_n$ splits into a sum of per-sample terms, making it a natural candidate for stochastic gradient descent: approximate the full gradient using a single (or mini-batch of) data point(s) at each step.

2. Gradient Derivation

Applying the chain rule to $E_n$ with respect to the $j$-th weight:

$$\frac{\partial E_n}{\partial w_j} = \frac{\partial E_n}{\partial y_n}\cdot\frac{\partial y_n}{\partial w_j}.$$

First factor:

$$\frac{\partial E_n}{\partial y_n} = -\frac{t_n}{y_n} + \frac{1-t_n}{1-y_n}.$$

Second factor (using $\sigma'(a) = \sigma(a)(1-\sigma(a))$):

$$\frac{\partial y_n}{\partial w_j} = y_n(1-y_n)\,\phi_{nj}.$$

Multiplying and simplifying (terms cancel cleanly):

Gradient of the Cross-Entropy Loss $$\frac{\partial E_n}{\partial w_j} = (y_n - t_n)\,\phi_{nj}, \qquad \nabla_{\mathbf{w}} E_n = (y_n - t_n)\,\boldsymbol{\phi}_n.$$

The gradient is the prediction error $(y_n - t_n)$ scaled by the feature vector $\boldsymbol{\phi}_n$. This elegant form is not a coincidence: it holds for any generalized linear model whose activation function is the canonical link function (see Bishop §4.3.6).

3. The SGD Update Rule

Stochastic gradient descent picks one data point $n$ at a time and updates:

Logistic Regression SGD Update $$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta\,(y_n - t_n)\,\boldsymbol{\phi}_n,$$

where $\eta > 0$ is the learning rate. The update adds a multiple of $\boldsymbol{\phi}_n$ weighted by the signed error: if the model over-predicts ($y_n > t_n$), $\mathbf{w}$ is pushed away from $\boldsymbol{\phi}_n$; if it under-predicts, $\mathbf{w}$ is pulled toward $\boldsymbol{\phi}_n$.

Connection to the Perceptron

The perceptron update (Lecture 6.5) is $\mathbf{w} \leftarrow \mathbf{w} + \eta\,t_n\,\boldsymbol{\phi}_n$ applied only to misclassified points. The logistic regression update applies to every point, weighted continuously by the prediction error $(y_n - t_n) \in (-1, 1)$. It is a soft, probabilistic version of the perceptron.

4. Convergence and the Learning Rate

Because $E(\mathbf{w})$ is convex, SGD is guaranteed to converge to the global minimum — provided $\eta$ is chosen appropriately:

Too small $\eta$: convergence is very slow; many iterations needed.
Too large $\eta$: oscillates around the minimum and may never converge.
Practical strategy: start with a moderate $\eta$ and reduce it over time (learning rate schedule) — large steps early for rapid descent, small steps later for fine convergence.

No Closed Form

Despite convexity, the cross-entropy loss has no analytical minimizer because $y_n = \sigma(\mathbf{w}^\top\boldsymbol{\phi}_n)$ is nonlinear in $\mathbf{w}$. SGD solves this iteratively. Lecture 7.4 presents Newton-Raphson, a second-order method that converges in fewer steps without requiring a learning rate.