Lecture 3.2

Linear Regression via Maximum Likelihood

Learning Objectives

After this lecture you should be able to:

Write down the log likelihood for a Gaussian regression model and identify which terms depend on $\mathbf{w}$.
Explain why minimizing the sum of squared errors is equivalent to maximizing the log likelihood under Gaussian noise.
Explain why the sum of squared errors is a convex function of $\mathbf{w}$, and what that implies for optimization.
Define the design matrix $\boldsymbol{\Phi}$ and write the normal equations $\boldsymbol{\Phi}^\top\boldsymbol{\Phi}\,\mathbf{w} = \boldsymbol{\Phi}^\top\mathbf{t}$.
State the closed-form MLE solution $\mathbf{w}_{ML} = \boldsymbol{\Phi}^+\mathbf{t}$ and explain what the Moore–Penrose pseudoinverse is.
Define the RMSE as an interpretable error metric and explain why it is preferred over the raw SSE.

We now have a model class — basis function regression — and an optimization criterion — maximum likelihood. This lecture works out the solution explicitly, deriving the closed-form weights that minimize the sum of squared errors.

1. Setting and Log Likelihood

The model assumes targets are generated with Gaussian noise around the model prediction:

$$p(t \mid \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(t\,;\; \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}),\; \beta^{-1})$$

For a dataset $\mathcal{D} = \{(\mathbf{x}_i, t_i)\}_{i=1}^N$ of i.i.d. pairs the log likelihood is (from lecture 2.4):

$$\ln p(\mathcal{D} \mid \mathbf{w}, \beta) = \frac{N}{2}\ln\beta - \frac{N}{2}\ln(2\pi) - \frac{\beta}{2}\sum_{i=1}^{N}\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)^2$$

Only the last term depends on $\mathbf{w}$. Maximizing the log likelihood with respect to $\mathbf{w}$ is therefore equivalent to minimizing the sum of squared errors (SSE):

SSE Objective $$E(\mathbf{w}) = \frac{1}{2}\sum_{i=1}^{N}\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)^2$$

This is a convex quadratic function of $\mathbf{w}$: it has a unique global minimum, so setting the gradient to zero is both necessary and sufficient for optimality.

2. The Design Matrix

Stack all feature vectors as rows to form the design matrix:

Design Matrix $\boldsymbol{\Phi}$ $$\boldsymbol{\Phi} = \begin{bmatrix} \boldsymbol{\phi}(\mathbf{x}_1)^\top \\ \boldsymbol{\phi}(\mathbf{x}_2)^\top \\ \vdots \\ \boldsymbol{\phi}(\mathbf{x}_N)^\top \end{bmatrix} \in \mathbb{R}^{N \times M}, \qquad \Phi_{ij} = \phi_j(\mathbf{x}_i)$$

Row $i$ is the $M$-dimensional feature vector for the $i$-th data point. Column $j$ is the $j$-th basis function evaluated at all data points.

In matrix notation, $E(\mathbf{w}) = \tfrac{1}{2}\|\mathbf{t} - \boldsymbol{\Phi}\mathbf{w}\|^2$, where $\mathbf{t} = [t_1,\ldots,t_N]^\top$.

3. Deriving the Normal Equations

Gradient convention. We define $\nabla_\mathbf{w} f = \tfrac{\partial f}{\partial \mathbf{w}}$ as a row vector — the $j$-th entry is $\tfrac{\partial f}{\partial w_j}$. Setting this row vector to zero gives the optimality condition.

Differentiating $E(\mathbf{w})$ with respect to $\mathbf{w}$ using the substitution $u_i = t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)$:

$$\nabla_\mathbf{w} E = -\sum_{i=1}^{N}\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)\,\boldsymbol{\phi}(\mathbf{x}_i)^\top$$

The key step: $\tfrac{\partial}{\partial \mathbf{w}}(\mathbf{w}^\top\boldsymbol{\phi}) = \boldsymbol{\phi}^\top$ (row vector), so the chain rule gives $-\boldsymbol{\phi}^\top$. Setting to zero and transposing both sides:

Normal Equations $$\boldsymbol{\Phi}^\top\boldsymbol{\Phi}\,\mathbf{w} = \boldsymbol{\Phi}^\top\mathbf{t}$$

Any $\mathbf{w}$ satisfying this system minimizes the SSE globally. In matrix form, the left-hand side collects $\sum_i \boldsymbol{\phi}(\mathbf{x}_i)\boldsymbol{\phi}(\mathbf{x}_i)^\top\,\mathbf{w}$ and the right-hand side collects $\sum_i t_i\,\boldsymbol{\phi}(\mathbf{x}_i)$.

4. Closed-Form Solution

Multiplying both sides of the normal equations by $(\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}$:

MLE Solution for $\mathbf{w}$ $$\mathbf{w}_{ML} = \underbrace{(\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top}_{\boldsymbol{\Phi}^+}\,\mathbf{t} = \boldsymbol{\Phi}^+\mathbf{t}$$

$\boldsymbol{\Phi}^+$ is the Moore–Penrose pseudoinverse of $\boldsymbol{\Phi}$. It satisfies $\boldsymbol{\Phi}^+\boldsymbol{\Phi} = \mathbf{I}$ and can be computed numerically. The prediction for a new input $\mathbf{x}^*$ is:

$$\mathbb{E}[t^*] = \mathbf{w}_{ML}^\top\boldsymbol{\phi}(\mathbf{x}^*)$$

5. Reporting Error: RMSE

The raw SSE grows with the number of data points, making it hard to compare across datasets. Two normalizations give a more interpretable metric:

Root Mean Squared Error (RMSE) $$\text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\bigl(t_i - \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}_i)\bigr)^2}$$

Dividing by $N$ gives a per-data-point average; taking the square root restores the original units. If targets are in meters, RMSE is also in meters — unlike SSE which is in meters squared. RMSE is the standard error metric reported in practice.