Lecture 5.6

Probabilistic Generative Models

Generative classifiers: modelling class-conditional densities $p(\mathbf{x}|C_k)$ and class priors $p(C_k)$, then applying Bayes' theorem.

Learning Objectives
  • Explain how class-conditional densities and class priors give the posterior via Bayes' theorem.
  • Define the logistic sigmoid and softmax functions as the natural forms for posterior class probabilities.
  • Show that Gaussian class-conditionals with a shared covariance matrix yield linear decision boundaries (LDA).
  • State how different covariance matrices lead to quadratic decision boundaries.

1. The Generative Model

A probabilistic generative model specifies the joint distribution $p(\mathbf{x}, C_k) = p(\mathbf{x} \mid C_k)\, p(C_k)$ via two components:

  • Class-conditional densities $p(\mathbf{x} \mid C_k)$: the distribution of inputs given class $C_k$.
  • Class priors $p(C_k)$: the base rate of each class.

Bayes' theorem then gives the posterior class probabilities used for classification:

$$p(C_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid C_k)\, p(C_k)}{p(\mathbf{x})}, \qquad p(\mathbf{x}) = \sum_j p(\mathbf{x} \mid C_j)\, p(C_j).$$

2. The Logistic Sigmoid (Binary Case)

For $K = 2$ classes, the posterior for $C_1$ can be written entirely in terms of the log-odds:

$$a = \ln \frac{p(\mathbf{x}, C_1)}{p(\mathbf{x}, C_2)}.$$
Logistic Sigmoid

$$p(C_1 \mid \mathbf{x}) = \sigma(a) = \frac{1}{1 + e^{-a}}.$$

Key properties: (1) $\sigma(a) \in (0,1)$ for all $a$; (2) $\sigma(0) = 0.5$ (equal log-odds → equal probability); (3) $\sigma(-a) = 1 - \sigma(a)$; (4) $\frac{d\sigma}{da} = \sigma(a)(1 - \sigma(a))$.

3. The Softmax Function (General $K$)

For $K > 2$ classes, define $a_k = \ln p(\mathbf{x}, C_k)$ for each class. The posterior is:

Softmax Function $$p(C_k \mid \mathbf{x}) = \frac{\exp(a_k)}{\sum_{j=1}^{K} \exp(a_j)}.$$

The softmax amplifies the largest $a_k$ toward probability 1 and suppresses the rest toward 0 — hence the name "soft" maximum. For $K=2$ the softmax reduces to the logistic sigmoid with $a = a_1 - a_2$.

4. Gaussian Class-Conditionals and Linear Decision Boundaries

Model each class-conditional as a multivariate Gaussian with class-specific mean $\boldsymbol{\mu}_k$ but a shared covariance matrix $\boldsymbol{\Sigma}$:

$$p(\mathbf{x} \mid C_k) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}).$$
Linear Discriminant Analysis (LDA)

For $K=2$, the log-odds $a = \ln p(\mathbf{x},C_1)/p(\mathbf{x},C_2)$ with shared $\boldsymbol{\Sigma}$ simplifies to a linear function of $\mathbf{x}$ because the quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}^{-1}\mathbf{x}$ cancel:

$$a = \mathbf{w}^\top \mathbf{x} + w_0,$$

where

$$\mathbf{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2), \qquad w_0 = -\tfrac{1}{2}\boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 + \tfrac{1}{2}\boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2 + \ln\frac{p(C_1)}{p(C_2)}.$$

The posterior $p(C_1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + w_0)$ is a generalized linear model. The decision boundary — where $p(C_1|\mathbf{x}) = 0.5$, i.e. $a = 0$ — is a hyperplane. This is called linear discriminant analysis (LDA).

For general $K$, each $a_k = \mathbf{w}_k^\top\mathbf{x} + w_{k0}$ is linear in $\mathbf{x}$, and the softmax again produces linear decision boundaries.

Quadratic Decision Boundaries

If different classes are allowed their own covariance matrices $\boldsymbol{\Sigma}_k$, the quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}_k^{-1}\mathbf{x}$ no longer cancel and $a_k$ becomes quadratic in $\mathbf{x}$. This yields quadratic discriminant analysis (QDA) with curved decision boundaries. Sharing $\boldsymbol{\Sigma}$ across all classes is the special case that linearizes the boundary.