Lecture 6.2
Probabilistic Generative Models: Discrete Data (Naive Bayes)
Extending probabilistic generative models to discrete input data using the Naive Bayes assumption, which replaces an exponentially large probability table with a compact factorized model.
- Explain why the number of parameters for a discrete class-conditional distribution scales as $2^d$ in the general case.
- State the Naive Bayes assumption and show how it reduces the parameter count to $d$ per class.
- Write the Bernoulli factorized class-conditional and identify its parameters $\pi_{ki}$.
- Express the posterior class probabilities in softmax form and identify the linear activations $a_k$.
1. Discrete Inputs and the Combinatorial Problem
Suppose each input $\mathbf{x}_n \in \{0,1\}^d$ is a binary vector of $d$ features (e.g., word presence in a document). A fully general class-conditional $p(\mathbf{x} \mid C_k)$ must assign a probability to every possible binary vector, of which there are $2^d$. Because probabilities must sum to 1, we need $2^d - 1$ free parameters per class — exponential in $d$.
For $d = 100$ binary features (modest for text), the general model requires $2^{100} - 1 \approx 10^{30}$ parameters per class — far more than any dataset could ever supply. A tractable assumption is needed.
2. The Naive Bayes Assumption
Given the class $C_k$, all feature values $x_i$ are treated as conditionally independent:
$$p(\mathbf{x} \mid C_k) = \prod_{i=1}^{d} p(x_i \mid C_k).$$This factorization is called "naive" because features are rarely truly independent in practice, yet the resulting classifier often performs surprisingly well.
3. Bernoulli Class-Conditionals
Each $x_i \in \{0,1\}$ is a binary random variable, so it is natural to model it with a Bernoulli distribution parameterized by $\pi_{ki}$ — the probability that feature $i$ equals 1 given class $C_k$:
$$p(x_i \mid C_k) = \pi_{ki}^{x_i}\,(1 - \pi_{ki})^{1-x_i}.$$The Bernoulli selection mechanism works exactly: if $x_i = 1$ we pick $\pi_{ki}$; if $x_i = 0$ we pick $1 - \pi_{ki}$. Applying the Naive Bayes factorization:
The number of parameters per class is now $d$ — one $\pi_{ki}$ per feature — reduced from $2^d - 1$.
4. Posterior Class Probabilities
The generative model still uses Bayes' theorem to form the posterior. Defining $a_k = \ln p(\mathbf{x} \mid C_k) + \ln p(C_k)$, the posterior takes softmax form (as in Lecture 5.6). Taking the log of the Bernoulli class-conditional:
$$a_k = \sum_{i=1}^{d} \bigl[x_i \ln \pi_{ki} + (1-x_i)\ln(1-\pi_{ki})\bigr] + \ln p(C_k).$$This is a linear function of $\mathbf{x}$ (the $\pi_{ki}$ values play the role of learned weights), so Naive Bayes is a linear classifier in this binary-feature setting.
5. MLE for the Bernoulli Parameters
The parameters $\pi_{ki}$ are found by maximum likelihood following the standard recipe: formulate the likelihood for class $C_k$, take its log, differentiate with respect to $\pi_{ki}$, and set to zero.
Each $\hat{\pi}_{ki}$ is the empirical frequency of feature $i$ equaling 1 within class $k$. This derivation follows the same three-step recipe as all previous MLE problems and is left as an exercise.