Lecture 9.3

Intermezzo: Lagrange Multipliers

Many ML derivations — including PCA, GMMs, and SVMs — require optimizing a function subject to a constraint. The method of Lagrange multipliers converts such problems into unconstrained stationarity conditions.

Learning Objectives

State the constrained optimization problem: maximize $f(\mathbf{x})$ subject to $g(\mathbf{x}) = c$.
Explain geometrically why $\nabla f$ and $\nabla g$ must be parallel at a constrained optimum.
Define the Lagrangian $\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda\,(g(\mathbf{x}) - c)$ and show its stationary points solve the constrained problem.
Apply the method to a concrete example.

1. The Setting

We want to find

$$\mathbf{x}^* = \arg\max_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = c.$$

The constraint $g(\mathbf{x}) = c$ defines a level set (a curve in 2D, a surface in 3D). Our solution must lie on this surface.

2. Geometric Intuition

A key property of level sets: the gradient $\nabla g(\mathbf{x})$ is perpendicular to the level set at every point. The argument for Lagrange multipliers rests on the following observation:

Lagrange Condition

At a constrained maximum $\mathbf{x}^*$, the gradient $\nabla f(\mathbf{x}^*)$ must also be perpendicular to the constraint surface. If it had a component along the surface, we could move along the constraint and increase $f$ — contradicting optimality.

Since both $\nabla f$ and $\nabla g$ are perpendicular to the constraint surface at $\mathbf{x}^*$, they must be parallel (or anti-parallel). Hence there exists a scalar $\lambda$ (the Lagrange multiplier) such that

$$\nabla f(\mathbf{x}^*) + \lambda\,\nabla g(\mathbf{x}^*) = \mathbf{0}.$$

3. The Lagrangian

Lagrangian Function $$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda\,(g(\mathbf{x}) - c).$$

The stationary points of $\mathcal{L}$ — where $\nabla_{\mathbf{x}} \mathcal{L} = \mathbf{0}$ and $\partial \mathcal{L}/\partial \lambda = 0$ — are exactly the constrained optima:

$\nabla_{\mathbf{x}} \mathcal{L} = \nabla f + \lambda\,\nabla g = \mathbf{0}$ recovers the Lagrange condition.
$\partial \mathcal{L}/\partial \lambda = g(\mathbf{x}) - c = 0$ enforces the constraint.

Recipe

Identify $f(\mathbf{x})$ (objective) and write the constraint as $g(\mathbf{x}) = c$.
Form $\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda(g(\mathbf{x}) - c)$.
Set $\nabla_{\mathbf{x}} \mathcal{L} = \mathbf{0}$ and $\partial \mathcal{L}/\partial\lambda = 0$.
Solve the resulting system for $\mathbf{x}^*$ and $\lambda^*$.

Note: $\lambda$ itself is an auxiliary variable — only $\mathbf{x}^*$ is the final answer.

4. Worked Example

Maximize $f$ on a Line

Problem. Maximize $f(x_1, x_2) = -x_1^2 - x_2^2$ subject to $g(x_1, x_2) = x_1 + x_2 = 1$.

Lagrangian. $\mathcal{L} = -x_1^2 - x_2^2 + \lambda(x_1 + x_2 - 1)$.

Stationarity conditions:

$\partial\mathcal{L}/\partial x_1 = -2x_1 + \lambda = 0 \implies x_1 = \lambda/2$.
$\partial\mathcal{L}/\partial x_2 = -2x_2 + \lambda = 0 \implies x_2 = \lambda/2$.
$\partial\mathcal{L}/\partial\lambda = x_1 + x_2 - 1 = 0 \implies \lambda/2 + \lambda/2 = 1 \implies \lambda = 1$.

Solution. $x_1^* = x_2^* = \tfrac{1}{2}$, $\lambda^* = 1$. The constrained maximum is $f(1/2, 1/2) = -1/2$.

Intuition: $f$ is maximized at the origin, but we must stay on the line $x_1 + x_2 = 1$. The closest point on that line to the origin is $(\tfrac{1}{2}, \tfrac{1}{2})$ — exactly our answer.

5. Role in Machine Learning

Lagrange multipliers appear throughout this course:

PCA (Lecture 10.1): maximize projected variance $\mathbf{u}^\top \mathbf{S}\mathbf{u}$ subject to $\|\mathbf{u}\| = 1$ → eigenvalue problem $\mathbf{S}\mathbf{u} = \lambda\mathbf{u}$.
Gaussian mixture models (Lecture 9.4): maximize log-likelihood subject to $\sum_k \pi_k = 1$ → MLE update for mixing coefficients $\pi_k$.
SVMs (Lecture 11.3–11.4): maximize margin subject to classification constraints → dual Lagrangian with inequality constraints (KKT conditions).