Lecture 8.3

Neural Networks: Loss Functions

The choice of output activation function and loss is not arbitrary: each follows from a probabilistic model of the targets. Gaussian targets yield MSE; Bernoulli targets yield the logistic sigmoid and cross-entropy; categorical targets yield softmax and multi-class cross-entropy.

Learning Objectives

Derive the sum-of-squares loss from a Gaussian target distribution.
Derive the binary cross-entropy loss and logistic sigmoid output from a Bernoulli target distribution.
Derive the multi-class cross-entropy loss and softmax output from the generalized Bernoulli (categorical) distribution.
Select the correct output activation and loss function given a problem type.

1. The Probabilistic Design Principle

In all cases, the recipe is: (1) assume a target distribution $p(t|\mathbf{x}, \mathbf{w})$ parameterized by the network output, (2) form the likelihood over the training set, (3) minimize the negative log-likelihood as the training loss. The resulting loss and output activation are jointly determined.

2. Regression: Gaussian Targets → Sum of Squares

Assume $t \mid \mathbf{x}, \mathbf{w} \sim \mathcal{N}(y(\mathbf{x}), \sigma^2)$, where $y(\mathbf{x})$ is the network output. The negative log-likelihood is

$$-\ln p(\mathbf{t}|\mathbf{w}) \propto \sum_{n=1}^N (y(\mathbf{x}_n) - t_n)^2.$$

Regression Setup

Number of output units: 1 (or $K$ for multi-output regression).
Output activation: identity (no activation — targets are real-valued).
Loss: sum of squared errors $E = \frac{1}{2}\sum_n (y_n - t_n)^2$.

3. Binary Classification: Bernoulli Targets → Sigmoid + Cross-Entropy

Assume $t \mid \mathbf{x}, \mathbf{w} \sim \text{Bernoulli}(y(\mathbf{x}))$, so $p(t|\mathbf{x},\mathbf{w}) = y^t(1-y)^{1-t}$. The network output $y$ must lie in $(0,1)$ — achieved by the logistic sigmoid output activation. The negative log-likelihood is the binary cross-entropy:

Binary Classification Setup

Number of output units: 1.
Output activation: $y = \sigma(a^{\text{out}}) = 1/(1+e^{-a^{\text{out}}})$.
Loss: $E = -\displaystyle\sum_{n=1}^N \bigl[t_n \ln y_n + (1-t_n)\ln(1-y_n)\bigr]$.

4. Multi-Class Classification: Categorical Targets → Softmax + Cross-Entropy

With $K$ classes and one-hot targets $\mathbf{t}_n \in \{0,1\}^K$, assume each target follows the generalized Bernoulli (categorical) distribution:

$$p(\mathbf{t}_n|\mathbf{x}_n, \mathbf{w}) = \prod_{k=1}^{K} y_{nk}^{t_{nk}},$$

where $y_{nk} = p(C_k|\mathbf{x}_n)$ and $\sum_k y_{nk} = 1$. The softmax output activation enforces this sum constraint:

$$y_k = \frac{\exp(a_k^{\text{out}})}{\sum_{j=1}^K \exp(a_j^{\text{out}})}.$$

Multi-Class Classification Setup

Number of output units: $K$.
Output activation: softmax over all $K$ output activations.
Loss: $E = -\displaystyle\sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_{nk}$.

5. Summary Table

Problem	Target distribution	Output activation	Loss
Regression	Gaussian	Identity	Sum of squares
Binary classification	Bernoulli	Sigmoid	Binary cross-entropy
Multi-class classification	Categorical	Softmax	Multi-class cross-entropy