Lecture 3.1

Linear Regression With Basis Functions

Learning Objectives

After this lecture you should be able to:

  • Write down the plain linear model $y(\mathbf{x};\mathbf{w}) = w_0 + \mathbf{w}^\top\mathbf{x}$ and show how the bias can be absorbed into the weight vector using an augmented input $\tilde{\mathbf{x}} = [1, x_1, \ldots, x_D]^\top$.
  • Explain why a model that is linear in the raw input cannot fit nonlinear data, and how basis functions solve this.
  • Write the basis function regression model $y(\mathbf{x};\mathbf{w}) = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x})$, explain what each symbol means, and describe how $\boldsymbol{\phi}$ changes the dimensionality of the input.
  • Explain why this is still called a linear model: it is linear in $\mathbf{w}$, even when it is nonlinear in $\mathbf{x}$.
  • State and describe three basis function families — polynomial, Gaussian, and logistic sigmoid — and identify the hyperparameters of each.
  • Distinguish parameters (fit automatically via optimization) from hyperparameters (chosen by the designer).

Week 2 established how to optimize a parametric model (via MLE, MAP, or Bayesian averaging) but left the model $y(\mathbf{x};\mathbf{w})$ abstract. This lecture makes it concrete: we define a family of models, specify the parameters to be optimized, and see how to handle nonlinear data with linear methods.

1. The Plain Linear Model

The simplest model for regression assigns one weight to each input dimension plus a bias:

Linear Model $$y(\mathbf{x};\mathbf{w}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D = w_0 + \mathbf{w}^\top\mathbf{x}$$

Here $\mathbf{x} \in \mathbb{R}^D$ is the input vector, $\mathbf{w} = [w_1,\ldots,w_D]^\top$ are the weights, and $w_0$ is the bias (intercept).

It is often convenient to absorb the bias into the weight vector. Define the augmented input and weight vectors:

$$\tilde{\mathbf{x}} = \begin{bmatrix}1 \\ x_1 \\ \vdots \\ x_D\end{bmatrix}, \qquad \tilde{\mathbf{w}} = \begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_D\end{bmatrix}$$

Then $y(\mathbf{x};\tilde{\mathbf{w}}) = \tilde{\mathbf{w}}^\top\tilde{\mathbf{x}}$ — a single dot product with no separate bias term. (We will drop the tilde where the context is clear.)

Example: House Prices

Predict house price from floor area, age, and garden size. The input vector is $\mathbf{x} = [\text{area}, \text{age}, \text{garden}]^\top$ and the model assigns one weight to each feature: $y = w_0 + w_1\cdot\text{area} + w_2\cdot\text{age} + w_3\cdot\text{garden}$. In a 1D version (area only) the model is a straight line through the scatter plot; tuning $w_0$ shifts the intercept and $w_1$ sets the slope.

2. Why Linear Models Are Not Enough

A model that is linear in the raw input $\mathbf{x}$ can only represent straight lines (in 1D) or hyperplanes (in higher dimensions). If the data follows a nonlinear pattern — for instance, house prices that rise then saturate as floor area increases — no choice of weights $\mathbf{w}$ will give a good fit. Fitting a line to nonlinear data forces a trade-off: the fit will be poor in at least part of the input space.

The fix is to transform the input first, and then apply a linear model to the transformed features. This is the idea behind basis functions.

3. Basis Function Regression

Choose $M-1$ basis functions $\phi_1, \ldots, \phi_{M-1}$, where each $\phi_i : \mathbb{R}^D \to \mathbb{R}$ maps the raw input to a new scalar feature. Define $\phi_0(\mathbf{x}) = 1$ to absorb the bias. Stack them into a feature vector:

$$\boldsymbol{\phi}(\mathbf{x}) = \begin{bmatrix}\phi_0(\mathbf{x}) \\ \phi_1(\mathbf{x}) \\ \vdots \\ \phi_{M-1}(\mathbf{x})\end{bmatrix} \in \mathbb{R}^M$$
Basis Function Regression Model $$y(\mathbf{x};\mathbf{w}) = \mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}) = \sum_{j=0}^{M-1} w_j\,\phi_j(\mathbf{x})$$

The model is a linear combination of $M$ basis function outputs, weighted by $\mathbf{w} \in \mathbb{R}^M$.

This is still called a linear model because it is linear in the parameters $\mathbf{w}$: doubling any weight $w_j$ doubles its contribution to the output. The function $y(\mathbf{x};\mathbf{w})$ can be highly nonlinear in $\mathbf{x}$, but that nonlinearity is baked into the fixed basis functions — not into the parameters being optimized. This distinction is what makes the optimization tractable.

Note also that $\boldsymbol{\phi}$ changes the dimensionality of the representation: a $D$-dimensional input $\mathbf{x}$ becomes an $M$-dimensional feature vector $\boldsymbol{\phi}(\mathbf{x})$, where $M$ is independent of $D$.

4. Three Basis Function Families

Polynomial Basis

Polynomial Basis Functions $$\phi_i(x) = x^i, \quad i = 0, 1, 2, \ldots$$ $$\boldsymbol{\phi}(x) = \begin{bmatrix}1,\; x,\; x^2,\; x^3,\; \ldots\end{bmatrix}^\top$$

The resulting model $y = w_0 + w_1 x + w_2 x^2 + \cdots$ is a polynomial in $x$. Tuning $w_0$ shifts the intercept, $w_1$ adds a slope, $w_2$ adds a parabolic term, and so on. By combining these, arbitrarily complex smooth curves can be built as linear combinations of the basis functions.

Gaussian Basis

Gaussian Basis Functions $$\phi_i(\mathbf{x}) = \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_i)^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_i)\right)$$

Each basis function is a Gaussian bump centered at $\boldsymbol{\mu}_i$ with shape controlled by $\boldsymbol{\Sigma}$. (The normalization constant is omitted — it would just be absorbed into $w_i$.)

Hyperparameters: the centers $\boldsymbol{\mu}_i$ and the covariance $\boldsymbol{\Sigma}$. Key property: each basis function is localized — it responds strongly near $\boldsymbol{\mu}_i$ and decays to zero elsewhere. This makes the model sensitive to structure in different regions of input space independently.

Logistic Sigmoid Basis

Logistic Sigmoid Basis Functions $$\phi_i(x) = \sigma\!\left(\frac{x - \mu_i}{s}\right), \qquad \sigma(a) = \frac{1}{1 + e^{-a}}$$

Hyperparameters: the offset $\mu_i$ (location of the transition) and the scale $s$ (steepness — large $s$ → gradual; small $s$ → sharp step). Each basis function acts as a smooth indicator: approximately 0 for $x \ll \mu_i$ and 1 for $x \gg \mu_i$, with a smooth transition around $\mu_i$. Linear combinations of shifted sigmoids can model piecewise-constant or threshold-type behavior.

5. Parameters vs. Hyperparameters

Parameters vs. Hyperparameters
Parameters Hyperparameters
WhatWeights $\mathbf{w}$Basis function type, $M$, $\boldsymbol{\mu}_i$, $s$, $\boldsymbol{\Sigma}$
How setOptimized automatically (MLE, MAP, …)Chosen by the designer
FlexibilityInside the chosen model classDefines which model class to use

Choosing a good set of basis functions (type and hyperparameters) is a design decision that requires domain knowledge and experimentation. It is not something the optimization handles automatically — that is the job of model selection, covered in Week 4.