Lecture 9.3
Intermezzo: Lagrange Multipliers
Many ML derivations — including PCA, GMMs, and SVMs — require optimizing a function subject to a constraint. The method of Lagrange multipliers converts such problems into unconstrained stationarity conditions.
- State the constrained optimization problem: maximize $f(\mathbf{x})$ subject to $g(\mathbf{x}) = c$.
- Explain geometrically why $\nabla f$ and $\nabla g$ must be parallel at a constrained optimum.
- Define the Lagrangian $\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda\,(g(\mathbf{x}) - c)$ and show its stationary points solve the constrained problem.
- Apply the method to a concrete example.
1. The Setting
We want to find
$$\mathbf{x}^* = \arg\max_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = c.$$The constraint $g(\mathbf{x}) = c$ defines a level set (a curve in 2D, a surface in 3D). Our solution must lie on this surface.
2. Geometric Intuition
A key property of level sets: the gradient $\nabla g(\mathbf{x})$ is perpendicular to the level set at every point. The argument for Lagrange multipliers rests on the following observation:
At a constrained maximum $\mathbf{x}^*$, the gradient $\nabla f(\mathbf{x}^*)$ must also be perpendicular to the constraint surface. If it had a component along the surface, we could move along the constraint and increase $f$ — contradicting optimality.
Since both $\nabla f$ and $\nabla g$ are perpendicular to the constraint surface at $\mathbf{x}^*$, they must be parallel (or anti-parallel). Hence there exists a scalar $\lambda$ (the Lagrange multiplier) such that
$$\nabla f(\mathbf{x}^*) + \lambda\,\nabla g(\mathbf{x}^*) = \mathbf{0}.$$3. The Lagrangian
The stationary points of $\mathcal{L}$ — where $\nabla_{\mathbf{x}} \mathcal{L} = \mathbf{0}$ and $\partial \mathcal{L}/\partial \lambda = 0$ — are exactly the constrained optima:
- $\nabla_{\mathbf{x}} \mathcal{L} = \nabla f + \lambda\,\nabla g = \mathbf{0}$ recovers the Lagrange condition.
- $\partial \mathcal{L}/\partial \lambda = g(\mathbf{x}) - c = 0$ enforces the constraint.
- Identify $f(\mathbf{x})$ (objective) and write the constraint as $g(\mathbf{x}) = c$.
- Form $\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda(g(\mathbf{x}) - c)$.
- Set $\nabla_{\mathbf{x}} \mathcal{L} = \mathbf{0}$ and $\partial \mathcal{L}/\partial\lambda = 0$.
- Solve the resulting system for $\mathbf{x}^*$ and $\lambda^*$.
Note: $\lambda$ itself is an auxiliary variable — only $\mathbf{x}^*$ is the final answer.
4. Worked Example
Problem. Maximize $f(x_1, x_2) = -x_1^2 - x_2^2$ subject to $g(x_1, x_2) = x_1 + x_2 = 1$.
Lagrangian. $\mathcal{L} = -x_1^2 - x_2^2 + \lambda(x_1 + x_2 - 1)$.
Stationarity conditions:
- $\partial\mathcal{L}/\partial x_1 = -2x_1 + \lambda = 0 \implies x_1 = \lambda/2$.
- $\partial\mathcal{L}/\partial x_2 = -2x_2 + \lambda = 0 \implies x_2 = \lambda/2$.
- $\partial\mathcal{L}/\partial\lambda = x_1 + x_2 - 1 = 0 \implies \lambda/2 + \lambda/2 = 1 \implies \lambda = 1$.
Solution. $x_1^* = x_2^* = \tfrac{1}{2}$, $\lambda^* = 1$. The constrained maximum is $f(1/2, 1/2) = -1/2$.
Intuition: $f$ is maximized at the origin, but we must stay on the line $x_1 + x_2 = 1$. The closest point on that line to the origin is $(\tfrac{1}{2}, \tfrac{1}{2})$ — exactly our answer.
5. Role in Machine Learning
Lagrange multipliers appear throughout this course:
- PCA (Lecture 10.1): maximize projected variance $\mathbf{u}^\top \mathbf{S}\mathbf{u}$ subject to $\|\mathbf{u}\| = 1$ → eigenvalue problem $\mathbf{S}\mathbf{u} = \lambda\mathbf{u}$.
- Gaussian mixture models (Lecture 9.4): maximize log-likelihood subject to $\sum_k \pi_k = 1$ → MLE update for mixing coefficients $\pi_k$.
- SVMs (Lecture 11.3–11.4): maximize margin subject to classification constraints → dual Lagrangian with inequality constraints (KKT conditions).