Lecture 7.2

Logistic Regression

Logistic regression is the third and most natural classification strategy: instead of modeling the full data distribution (generative) or fitting a hard boundary (discriminant functions), it directly models the posterior class probabilities using a generalized linear model and the logistic sigmoid.

Learning Objectives

Place logistic regression within the three classification strategies (generative, discriminant, probabilistic discriminative).
Write the logistic regression model as a generalized linear model with sigmoid activation.
Derive the cross-entropy loss as the negative log-likelihood of the Bernoulli target distribution.
Explain why logistic regression handles outliers better than least-squares classification.
State that the cross-entropy loss is convex and describe what this implies for optimization.

1. Three Classification Strategies

Recall the three approaches introduced in Lecture 5.4:

Probabilistic generative models (Lectures 5.6–6.2): model $p(\mathbf{x}|C_k)$ and $p(C_k)$; posterior via Bayes. Requires many parameters — $O(M^2)$ for Gaussian class-conditionals.
Discriminant functions (Lectures 6.3–6.5): directly learn a boundary; no probabilistic interpretation.
Probabilistic discriminative models (this lecture): model $p(C_k|\mathbf{x})$ directly. Fewer parameters than generative — only $O(M)$ — while still providing calibrated probabilities.

2. The Logistic Regression Model

For binary classification with $t \in \{0,1\}$, logistic regression models the posterior directly:

Logistic Regression $$p(C_1 \mid \boldsymbol{\phi}) = y(\boldsymbol{\phi}) = \sigma\!\bigl(\mathbf{w}^\top \boldsymbol{\phi}\bigr), \qquad p(C_2 \mid \boldsymbol{\phi}) = 1 - y(\boldsymbol{\phi}),$$

where $\boldsymbol{\phi}(\mathbf{x})$ is a feature vector (with $\phi_0=1$), $\mathbf{w}$ are learned weights, and $\sigma(a) = 1/(1+e^{-a})$ is the logistic sigmoid. The model is a generalized linear model: linear in $\mathbf{w}$, nonlinear via $\sigma$.

The two-class target distribution can be written compactly using the binary selection trick:

$$p(t \mid \boldsymbol{\phi}, \mathbf{w}) = y^t (1-y)^{1-t}.$$

3. The Cross-Entropy Loss

Assuming i.i.d. data, the log-likelihood is

$$\ln p(\mathbf{t} \mid \mathbf{w}) = \sum_{n=1}^N \bigl[ t_n \ln y_n + (1 - t_n)\ln(1 - y_n) \bigr],$$

where $y_n = \sigma(\mathbf{w}^\top \boldsymbol{\phi}_n)$. Minimizing the negative log-likelihood gives the cross-entropy loss:

Cross-Entropy Loss $$E(\mathbf{w}) = -\sum_{n=1}^N \bigl[ t_n \ln y_n + (1-t_n)\ln(1-y_n) \bigr].$$

This measures the "distance" between the ground-truth distribution $\{t_n\}$ and the model's predicted distribution $\{y_n\}$ — hence the information-theoretic name. Minimizing $E(\mathbf{w})$ with respect to $\mathbf{w}$ gives the maximum likelihood estimate.

4. Why Logistic Regression Handles Outliers

In least-squares classification (Lecture 6.4), the loss penalizes points far from the boundary on the correct side, since the model aims for predictions close to 1 for class-1 points. This pulls the boundary toward well-separated points.

In logistic regression, the loss contribution for a correctly classified point far from the boundary is approximately linear in the distance (via the log), while for misclassified points far on the wrong side the penalty grows rapidly. Points well on the correct side of the boundary contribute nearly zero loss:

$$-\ln \sigma(a) \approx 0 \quad \text{when } a \gg 0 \quad (\text{correct side}).$$

This means outliers that are correctly classified are essentially ignored — a significant practical advantage over least squares.

5. Convexity and Optimization

Convexity of the Cross-Entropy Loss

The cross-entropy loss $E(\mathbf{w})$ is a convex function of $\mathbf{w}$ (shown by proving the Hessian is positive definite in Lecture 7.4). This means there is a unique global minimum. No closed-form solution exists because $y_n$ is nonlinear in $\mathbf{w}$, but iterative methods such as SGD (Lecture 7.3) or Newton-Raphson (Lecture 7.4) are guaranteed to converge to the global optimum.

6. Decision Rule

Once the optimal $\mathbf{w}^*$ is found, classify new input $\mathbf{x}$ as:

$$\hat{C} = \begin{cases} C_1 & \text{if } \sigma(\mathbf{w}^{*\top}\boldsymbol{\phi}) > 0.5, \text{ equivalently } \mathbf{w}^{*\top}\boldsymbol{\phi} > 0, \\ C_2 & \text{otherwise.}\end{cases}$$

The decision boundary $\{\mathbf{x} : \mathbf{w}^{*\top}\boldsymbol{\phi}(\mathbf{x}) = 0\}$ is a hyperplane in feature space — logistic regression yields linear decision boundaries.