Lecture 6.2

Probabilistic Generative Models: Discrete Data (Naive Bayes)

Extending probabilistic generative models to discrete input data using the Naive Bayes assumption, which replaces an exponentially large probability table with a compact factorized model.

Learning Objectives
  • Explain why the number of parameters for a discrete class-conditional distribution scales as $2^d$ in the general case.
  • State the Naive Bayes assumption and show how it reduces the parameter count to $d$ per class.
  • Write the Bernoulli factorized class-conditional and identify its parameters $\pi_{ki}$.
  • Express the posterior class probabilities in softmax form and identify the linear activations $a_k$.

1. Discrete Inputs and the Combinatorial Problem

Suppose each input $\mathbf{x}_n \in \{0,1\}^d$ is a binary vector of $d$ features (e.g., word presence in a document). A fully general class-conditional $p(\mathbf{x} \mid C_k)$ must assign a probability to every possible binary vector, of which there are $2^d$. Because probabilities must sum to 1, we need $2^d - 1$ free parameters per class — exponential in $d$.

Scale of the Problem

For $d = 100$ binary features (modest for text), the general model requires $2^{100} - 1 \approx 10^{30}$ parameters per class — far more than any dataset could ever supply. A tractable assumption is needed.

2. The Naive Bayes Assumption

Naive Bayes Assumption

Given the class $C_k$, all feature values $x_i$ are treated as conditionally independent:

$$p(\mathbf{x} \mid C_k) = \prod_{i=1}^{d} p(x_i \mid C_k).$$

This factorization is called "naive" because features are rarely truly independent in practice, yet the resulting classifier often performs surprisingly well.

3. Bernoulli Class-Conditionals

Each $x_i \in \{0,1\}$ is a binary random variable, so it is natural to model it with a Bernoulli distribution parameterized by $\pi_{ki}$ — the probability that feature $i$ equals 1 given class $C_k$:

$$p(x_i \mid C_k) = \pi_{ki}^{x_i}\,(1 - \pi_{ki})^{1-x_i}.$$

The Bernoulli selection mechanism works exactly: if $x_i = 1$ we pick $\pi_{ki}$; if $x_i = 0$ we pick $1 - \pi_{ki}$. Applying the Naive Bayes factorization:

Naive Bayes Class-Conditional (Binary Features) $$p(\mathbf{x} \mid C_k) = \prod_{i=1}^{d} \pi_{ki}^{x_i}\,(1-\pi_{ki})^{1-x_i}.$$

The number of parameters per class is now $d$ — one $\pi_{ki}$ per feature — reduced from $2^d - 1$.

4. Posterior Class Probabilities

The generative model still uses Bayes' theorem to form the posterior. Defining $a_k = \ln p(\mathbf{x} \mid C_k) + \ln p(C_k)$, the posterior takes softmax form (as in Lecture 5.6). Taking the log of the Bernoulli class-conditional:

$$a_k = \sum_{i=1}^{d} \bigl[x_i \ln \pi_{ki} + (1-x_i)\ln(1-\pi_{ki})\bigr] + \ln p(C_k).$$

This is a linear function of $\mathbf{x}$ (the $\pi_{ki}$ values play the role of learned weights), so Naive Bayes is a linear classifier in this binary-feature setting.

5. MLE for the Bernoulli Parameters

The parameters $\pi_{ki}$ are found by maximum likelihood following the standard recipe: formulate the likelihood for class $C_k$, take its log, differentiate with respect to $\pi_{ki}$, and set to zero.

MLE for Naive Bayes Parameters $$\hat{\pi}_{ki} = \frac{\text{number of class-}k\text{ examples with } x_i = 1}{N_k}.$$

Each $\hat{\pi}_{ki}$ is the empirical frequency of feature $i$ equaling 1 within class $k$. This derivation follows the same three-step recipe as all previous MLE problems and is left as an exercise.