Lecture 6.3
Discriminant Functions
Stepping away from probabilistic models to directly learn a decision boundary via a discriminant function, and understanding the geometric role of each component in the linear model.
- Define a generalized linear model and a discriminant function.
- Interpret the weight vector $\mathbf{w}$ as the orientation of the decision hyperplane.
- Interpret the bias $w_0$ as the offset of the hyperplane from the origin.
- Show that $y(\mathbf{x})$ is proportional to the signed distance from $\mathbf{x}$ to the decision surface.
- Extend discriminant functions to the multi-class setting and state the convexity property of the resulting decision regions.
1. From Generative to Discriminative
Probabilistic generative models (Lectures 5.6–6.2) first build $p(\mathbf{x} \mid C_k)$ and $p(C_k)$, then derive the posterior. A discriminant function skips the probabilistic model entirely and directly defines a mapping from input to class label. The appeal is simplicity and flexibility in the choice of decision boundary.
2. Generalized Linear Models
A generalized linear model takes the form
$$y(\mathbf{x}) = f\!\bigl(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\bigr),$$where $\boldsymbol{\phi}(\mathbf{x})$ is a (possibly nonlinear) feature vector, $\mathbf{w}$ are learned weights, and $f$ is a fixed activation function. The model is linear in the parameters $\mathbf{w}$, even if $\boldsymbol{\phi}$ is nonlinear in $\mathbf{x}$. This yields linear decision boundaries in feature space.
In the simplest case, $\boldsymbol{\phi}(\mathbf{x}) = (1, \mathbf{x}^\top)^\top$ (the input prepended with a 1), $f$ is the identity, and
$$y(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + w_0.$$Class assignment uses the sign of $y(\mathbf{x})$: assign $\mathbf{x}$ to $C_1$ if $y(\mathbf{x}) \geq 0$, else to $C_2$.
3. Geometric Interpretation
Let the decision boundary be the set $\{y(\mathbf{x}) = 0\}$, i.e. $\mathbf{w}^\top\mathbf{x} + w_0 = 0$ — a hyperplane in $\mathbb{R}^d$.
- $\mathbf{w}$: orientation. For any two points $\mathbf{x}_A, \mathbf{x}_B$ on the decision surface, $\mathbf{w}^\top(\mathbf{x}_A - \mathbf{x}_B) = 0$, so $\mathbf{w}$ is perpendicular to the hyperplane — it defines its orientation (normal direction).
- $w_0$: distance from origin. The distance of the hyperplane from the origin is $-w_0 / \|\mathbf{w}\|$. A larger $|w_0|$ shifts the boundary further from the origin.
- $y(\mathbf{x})$: signed distance to the surface. Any point $\mathbf{x}$ can be decomposed as $\mathbf{x} = \mathbf{x}_\perp + r\,\hat{\mathbf{w}}$, where $\mathbf{x}_\perp$ is on the hyperplane and $r$ is the signed distance. Then $y(\mathbf{x}) = r\,\|\mathbf{w}\|$, giving $r = y(\mathbf{x}) / \|\mathbf{w}\|$. Points with large $|y(\mathbf{x})|$ lie far from the boundary.
4. Multi-Class Discriminant Functions
For $K > 2$ classes, assign one discriminant function per class:
$$y_k(\mathbf{x}) = \mathbf{w}_k^\top \mathbf{x} + w_{k0}, \quad k = 1,\dots,K.$$Assign $\mathbf{x}$ to class $C_k$ if $y_k(\mathbf{x}) > y_j(\mathbf{x})$ for all $j \neq k$. The decision boundary between $C_k$ and $C_j$ is where $y_k(\mathbf{x}) = y_j(\mathbf{x})$, i.e.
$$(\mathbf{w}_k - \mathbf{w}_j)^\top \mathbf{x} + (w_{k0} - w_{j0}) = 0,$$which is again a linear (hyperplane) boundary.
The decision region $\mathcal{R}_k = \{\mathbf{x} : y_k(\mathbf{x}) > y_j(\mathbf{x})\,\forall j \neq k\}$ is convex: for any $\mathbf{x}_A, \mathbf{x}_B \in \mathcal{R}_k$, every point on the segment $\lambda\mathbf{x}_A + (1-\lambda)\mathbf{x}_B$ ($\lambda \in [0,1]$) also belongs to $\mathcal{R}_k$. This follows directly from the linearity of each $y_k$.
Discriminant functions make hard class assignments without providing uncertainty estimates. The upcoming lectures (6.4 Least Squares, 6.5 Perceptron) show two concrete ways to fit them. Logistic regression (Lecture 7.2) will reintroduce probabilities into the discriminative setting.