Lecture 7.4
Logistic Regression: Newton-Raphson
Newton-Raphson is a second-order optimization method that fits a local quadratic approximation to the error surface and jumps directly to its minimum — converging faster than SGD, without a learning rate, while also providing a proof that the cross-entropy loss is convex.
- Describe the Newton-Raphson idea: quadratic Taylor approximation, minimize the approximation, iterate.
- Derive the update rule $\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau-1)} - \mathbf{H}^{-1}\nabla E$.
- Compute the gradient $\nabla E = \boldsymbol{\Phi}^\top(\mathbf{y} - \mathbf{t})$ and Hessian $\mathbf{H} = \boldsymbol{\Phi}^\top\mathbf{R}\boldsymbol{\Phi}$ in matrix form.
- Prove the cross-entropy loss is convex using the positive-definiteness of $\mathbf{H}$.
- Recognise the update as Iteratively Reweighted Least Squares (IRLS).
- Compare Newton-Raphson and SGD: convergence speed vs. computational cost.
1. The Idea: Quadratic Approximation
SGD uses a linear (first-order) approximation of $E(\mathbf{w})$ at the current point, then steps down it at a fixed rate $\eta$. Newton-Raphson instead fits a quadratic (second-order) approximation and jumps to the minimum of that approximation in one step — taking curvature into account and requiring no step-size selection.
Given the current iterate $\mathbf{w}^{(\tau-1)}$, the second-order Taylor expansion of $E$ is minimized by the step
$$\Delta\mathbf{w} = -\mathbf{H}^{-1}\nabla E,$$so the update is $\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau-1)} - \mathbf{H}^{-1}\nabla E$, where $\nabla E$ is the gradient (row vector, transposed to a column for the update) and $\mathbf{H}$ is the Hessian matrix of second derivatives of $E$ with respect to $\mathbf{w}$.
2. Gradient and Hessian in Matrix Form
Let $\boldsymbol{\Phi}$ be the $N\times M$ design matrix (rows are feature vectors $\boldsymbol{\phi}_n^\top$), $\mathbf{y} = (y_1,\dots,y_N)^\top$ the predictions, and $\mathbf{t} = (t_1,\dots,t_N)^\top$ the targets.
where $\mathbf{R}$ is the $N\times N$ diagonal matrix with entries $R_{nn} = y_n(1-y_n)$. The gradient is the design-matrix-weighted prediction error (same as summing $(y_n-t_n)\boldsymbol{\phi}_n$). The Hessian weights each outer product $\boldsymbol{\phi}_n\boldsymbol{\phi}_n^\top$ by $y_n(1-y_n)$, the variance of the Bernoulli prediction at point $n$.
3. Convexity: The Hessian is Positive Definite
A function is convex if and only if its Hessian is positive definite everywhere. For any non-zero $\mathbf{u}$:
$$\mathbf{u}^\top \mathbf{H}\, \mathbf{u} = \mathbf{u}^\top \boldsymbol{\Phi}^\top \mathbf{R}\, \boldsymbol{\Phi}\,\mathbf{u} = \|\mathbf{R}^{1/2}\boldsymbol{\Phi}\,\mathbf{u}\|^2 \geq 0.$$Since $y_n \in (0,1)$, all diagonal entries $R_{nn} = y_n(1-y_n) > 0$, so $\mathbf{R}^{1/2}$ is well-defined and $\|\mathbf{R}^{1/2}\boldsymbol{\Phi}\,\mathbf{u}\|^2 > 0$ for non-zero $\mathbf{u}$ (assuming the feature vectors are not degenerate). Therefore $\mathbf{H}$ is positive definite and the cross-entropy loss is strictly convex — confirming the unique global minimum.
4. Iteratively Reweighted Least Squares (IRLS)
Substituting the gradient and Hessian, the Newton-Raphson update can be rewritten as:
$$\mathbf{w}^{(\tau)} = \bigl(\boldsymbol{\Phi}^\top\mathbf{R}\boldsymbol{\Phi}\bigr)^{-1}\boldsymbol{\Phi}^\top\mathbf{R}\,\mathbf{z},$$where $\mathbf{z} = \boldsymbol{\Phi}\mathbf{w}^{(\tau-1)} + \mathbf{R}^{-1}(\mathbf{y} - \mathbf{t})$ acts as a "working response." This is precisely the solution to a weighted least-squares problem with weight matrix $\mathbf{R}$. Because $\mathbf{R}$ depends on $\mathbf{y}$, which depends on $\mathbf{w}^{(\tau-1)}$, the weights are updated at each iteration — hence the name Iteratively Reweighted Least Squares (IRLS).
5. Newton-Raphson vs. SGD
| Property | SGD | Newton-Raphson |
|---|---|---|
| Step size | Requires tuning $\eta$ | No step size needed |
| Iterations to converge | Many (noisy gradients) | Few (quadratic convergence) |
| Cost per iteration | Cheap ($O(M)$ per sample) | Expensive ($O(M^3)$ for $\mathbf{H}^{-1}$) |
| Applicable beyond logistic? | Yes (general) | Yes (any twice-differentiable loss) |
Computing $\mathbf{H}^{-1}$ scales as $O(M^3)$, which becomes prohibitive for large $M$. In practice, quasi-Newton methods (e.g., L-BFGS) approximate the Hessian inverse efficiently. For large-scale deep learning, SGD variants (Adam, RMSProp) dominate because of their cheap per-step cost.