Lecture 5.6
Probabilistic Generative Models
Generative classifiers: modelling class-conditional densities $p(\mathbf{x}|C_k)$ and class priors $p(C_k)$, then applying Bayes' theorem.
- Explain how class-conditional densities and class priors give the posterior via Bayes' theorem.
- Define the logistic sigmoid and softmax functions as the natural forms for posterior class probabilities.
- Show that Gaussian class-conditionals with a shared covariance matrix yield linear decision boundaries (LDA).
- State how different covariance matrices lead to quadratic decision boundaries.
1. The Generative Model
A probabilistic generative model specifies the joint distribution $p(\mathbf{x}, C_k) = p(\mathbf{x} \mid C_k)\, p(C_k)$ via two components:
- Class-conditional densities $p(\mathbf{x} \mid C_k)$: the distribution of inputs given class $C_k$.
- Class priors $p(C_k)$: the base rate of each class.
Bayes' theorem then gives the posterior class probabilities used for classification:
$$p(C_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid C_k)\, p(C_k)}{p(\mathbf{x})}, \qquad p(\mathbf{x}) = \sum_j p(\mathbf{x} \mid C_j)\, p(C_j).$$2. The Logistic Sigmoid (Binary Case)
For $K = 2$ classes, the posterior for $C_1$ can be written entirely in terms of the log-odds:
$$a = \ln \frac{p(\mathbf{x}, C_1)}{p(\mathbf{x}, C_2)}.$$$$p(C_1 \mid \mathbf{x}) = \sigma(a) = \frac{1}{1 + e^{-a}}.$$
Key properties: (1) $\sigma(a) \in (0,1)$ for all $a$; (2) $\sigma(0) = 0.5$ (equal log-odds → equal probability); (3) $\sigma(-a) = 1 - \sigma(a)$; (4) $\frac{d\sigma}{da} = \sigma(a)(1 - \sigma(a))$.
3. The Softmax Function (General $K$)
For $K > 2$ classes, define $a_k = \ln p(\mathbf{x}, C_k)$ for each class. The posterior is:
The softmax amplifies the largest $a_k$ toward probability 1 and suppresses the rest toward 0 — hence the name "soft" maximum. For $K=2$ the softmax reduces to the logistic sigmoid with $a = a_1 - a_2$.
4. Gaussian Class-Conditionals and Linear Decision Boundaries
Model each class-conditional as a multivariate Gaussian with class-specific mean $\boldsymbol{\mu}_k$ but a shared covariance matrix $\boldsymbol{\Sigma}$:
$$p(\mathbf{x} \mid C_k) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}).$$For $K=2$, the log-odds $a = \ln p(\mathbf{x},C_1)/p(\mathbf{x},C_2)$ with shared $\boldsymbol{\Sigma}$ simplifies to a linear function of $\mathbf{x}$ because the quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}^{-1}\mathbf{x}$ cancel:
$$a = \mathbf{w}^\top \mathbf{x} + w_0,$$where
$$\mathbf{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2), \qquad w_0 = -\tfrac{1}{2}\boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 + \tfrac{1}{2}\boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2 + \ln\frac{p(C_1)}{p(C_2)}.$$The posterior $p(C_1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + w_0)$ is a generalized linear model. The decision boundary — where $p(C_1|\mathbf{x}) = 0.5$, i.e. $a = 0$ — is a hyperplane. This is called linear discriminant analysis (LDA).
For general $K$, each $a_k = \mathbf{w}_k^\top\mathbf{x} + w_{k0}$ is linear in $\mathbf{x}$, and the softmax again produces linear decision boundaries.
If different classes are allowed their own covariance matrices $\boldsymbol{\Sigma}_k$, the quadratic terms $\mathbf{x}^\top \boldsymbol{\Sigma}_k^{-1}\mathbf{x}$ no longer cancel and $a_k$ becomes quadratic in $\mathbf{x}$. This yields quadratic discriminant analysis (QDA) with curved decision boundaries. Sharing $\boldsymbol{\Sigma}$ across all classes is the special case that linearizes the boundary.