Lecture 7.4

Logistic Regression: Newton-Raphson

Newton-Raphson is a second-order optimization method that fits a local quadratic approximation to the error surface and jumps directly to its minimum — converging faster than SGD, without a learning rate, while also providing a proof that the cross-entropy loss is convex.

Learning Objectives

Describe the Newton-Raphson idea: quadratic Taylor approximation, minimize the approximation, iterate.
Derive the update rule $\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau-1)} - \mathbf{H}^{-1}\nabla E$.
Compute the gradient $\nabla E = \boldsymbol{\Phi}^\top(\mathbf{y} - \mathbf{t})$ and Hessian $\mathbf{H} = \boldsymbol{\Phi}^\top\mathbf{R}\boldsymbol{\Phi}$ in matrix form.
Prove the cross-entropy loss is convex using the positive-definiteness of $\mathbf{H}$.
Recognise the update as Iteratively Reweighted Least Squares (IRLS).
Compare Newton-Raphson and SGD: convergence speed vs. computational cost.

1. The Idea: Quadratic Approximation

SGD uses a linear (first-order) approximation of $E(\mathbf{w})$ at the current point, then steps down it at a fixed rate $\eta$. Newton-Raphson instead fits a quadratic (second-order) approximation and jumps to the minimum of that approximation in one step — taking curvature into account and requiring no step-size selection.

Newton-Raphson Update Rule

Given the current iterate $\mathbf{w}^{(\tau-1)}$, the second-order Taylor expansion of $E$ is minimized by the step

$$\Delta\mathbf{w} = -\mathbf{H}^{-1}\nabla E,$$

so the update is $\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau-1)} - \mathbf{H}^{-1}\nabla E$, where $\nabla E$ is the gradient (row vector, transposed to a column for the update) and $\mathbf{H}$ is the Hessian matrix of second derivatives of $E$ with respect to $\mathbf{w}$.

2. Gradient and Hessian in Matrix Form

Let $\boldsymbol{\Phi}$ be the $N\times M$ design matrix (rows are feature vectors $\boldsymbol{\phi}_n^\top$), $\mathbf{y} = (y_1,\dots,y_N)^\top$ the predictions, and $\mathbf{t} = (t_1,\dots,t_N)^\top$ the targets.

Gradient and Hessian $$\nabla E = \boldsymbol{\Phi}^\top(\mathbf{y} - \mathbf{t}), \qquad \mathbf{H} = \boldsymbol{\Phi}^\top \mathbf{R}\, \boldsymbol{\Phi},$$

where $\mathbf{R}$ is the $N\times N$ diagonal matrix with entries $R_{nn} = y_n(1-y_n)$. The gradient is the design-matrix-weighted prediction error (same as summing $(y_n-t_n)\boldsymbol{\phi}_n$). The Hessian weights each outer product $\boldsymbol{\phi}_n\boldsymbol{\phi}_n^\top$ by $y_n(1-y_n)$, the variance of the Bernoulli prediction at point $n$.

3. Convexity: The Hessian is Positive Definite

A function is convex if and only if its Hessian is positive definite everywhere. For any non-zero $\mathbf{u}$:

$$\mathbf{u}^\top \mathbf{H}\, \mathbf{u} = \mathbf{u}^\top \boldsymbol{\Phi}^\top \mathbf{R}\, \boldsymbol{\Phi}\,\mathbf{u} = \|\mathbf{R}^{1/2}\boldsymbol{\Phi}\,\mathbf{u}\|^2 \geq 0.$$

Since $y_n \in (0,1)$, all diagonal entries $R_{nn} = y_n(1-y_n) > 0$, so $\mathbf{R}^{1/2}$ is well-defined and $\|\mathbf{R}^{1/2}\boldsymbol{\Phi}\,\mathbf{u}\|^2 > 0$ for non-zero $\mathbf{u}$ (assuming the feature vectors are not degenerate). Therefore $\mathbf{H}$ is positive definite and the cross-entropy loss is strictly convex — confirming the unique global minimum.

4. Iteratively Reweighted Least Squares (IRLS)

Substituting the gradient and Hessian, the Newton-Raphson update can be rewritten as:

$$\mathbf{w}^{(\tau)} = \bigl(\boldsymbol{\Phi}^\top\mathbf{R}\boldsymbol{\Phi}\bigr)^{-1}\boldsymbol{\Phi}^\top\mathbf{R}\,\mathbf{z},$$

where $\mathbf{z} = \boldsymbol{\Phi}\mathbf{w}^{(\tau-1)} + \mathbf{R}^{-1}(\mathbf{y} - \mathbf{t})$ acts as a "working response." This is precisely the solution to a weighted least-squares problem with weight matrix $\mathbf{R}$. Because $\mathbf{R}$ depends on $\mathbf{y}$, which depends on $\mathbf{w}^{(\tau-1)}$, the weights are updated at each iteration — hence the name Iteratively Reweighted Least Squares (IRLS).

5. Newton-Raphson vs. SGD

Property	SGD	Newton-Raphson
Step size	Requires tuning $\eta$	No step size needed
Iterations to converge	Many (noisy gradients)	Few (quadratic convergence)
Cost per iteration	Cheap ($O(M)$ per sample)	Expensive ($O(M^3)$ for $\mathbf{H}^{-1}$)
Applicable beyond logistic?	Yes (general)	Yes (any twice-differentiable loss)

Practical Note

Computing $\mathbf{H}^{-1}$ scales as $O(M^3)$, which becomes prohibitive for large $M$. In practice, quasi-Newton methods (e.g., L-BFGS) approximate the Hessian inverse efficiently. For large-scale deep learning, SGD variants (Adam, RMSProp) dominate because of their cheap per-step cost.