Lecture 8.3
Neural Networks: Loss Functions
The choice of output activation function and loss is not arbitrary: each follows from a probabilistic model of the targets. Gaussian targets yield MSE; Bernoulli targets yield the logistic sigmoid and cross-entropy; categorical targets yield softmax and multi-class cross-entropy.
- Derive the sum-of-squares loss from a Gaussian target distribution.
- Derive the binary cross-entropy loss and logistic sigmoid output from a Bernoulli target distribution.
- Derive the multi-class cross-entropy loss and softmax output from the generalized Bernoulli (categorical) distribution.
- Select the correct output activation and loss function given a problem type.
1. The Probabilistic Design Principle
In all cases, the recipe is: (1) assume a target distribution $p(t|\mathbf{x}, \mathbf{w})$ parameterized by the network output, (2) form the likelihood over the training set, (3) minimize the negative log-likelihood as the training loss. The resulting loss and output activation are jointly determined.
2. Regression: Gaussian Targets → Sum of Squares
Assume $t \mid \mathbf{x}, \mathbf{w} \sim \mathcal{N}(y(\mathbf{x}), \sigma^2)$, where $y(\mathbf{x})$ is the network output. The negative log-likelihood is
$$-\ln p(\mathbf{t}|\mathbf{w}) \propto \sum_{n=1}^N (y(\mathbf{x}_n) - t_n)^2.$$- Number of output units: 1 (or $K$ for multi-output regression).
- Output activation: identity (no activation — targets are real-valued).
- Loss: sum of squared errors $E = \frac{1}{2}\sum_n (y_n - t_n)^2$.
3. Binary Classification: Bernoulli Targets → Sigmoid + Cross-Entropy
Assume $t \mid \mathbf{x}, \mathbf{w} \sim \text{Bernoulli}(y(\mathbf{x}))$, so $p(t|\mathbf{x},\mathbf{w}) = y^t(1-y)^{1-t}$. The network output $y$ must lie in $(0,1)$ — achieved by the logistic sigmoid output activation. The negative log-likelihood is the binary cross-entropy:
- Number of output units: 1.
- Output activation: $y = \sigma(a^{\text{out}}) = 1/(1+e^{-a^{\text{out}}})$.
- Loss: $E = -\displaystyle\sum_{n=1}^N \bigl[t_n \ln y_n + (1-t_n)\ln(1-y_n)\bigr]$.
4. Multi-Class Classification: Categorical Targets → Softmax + Cross-Entropy
With $K$ classes and one-hot targets $\mathbf{t}_n \in \{0,1\}^K$, assume each target follows the generalized Bernoulli (categorical) distribution:
$$p(\mathbf{t}_n|\mathbf{x}_n, \mathbf{w}) = \prod_{k=1}^{K} y_{nk}^{t_{nk}},$$where $y_{nk} = p(C_k|\mathbf{x}_n)$ and $\sum_k y_{nk} = 1$. The softmax output activation enforces this sum constraint:
$$y_k = \frac{\exp(a_k^{\text{out}})}{\sum_{j=1}^K \exp(a_j^{\text{out}})}.$$- Number of output units: $K$.
- Output activation: softmax over all $K$ output activations.
- Loss: $E = -\displaystyle\sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_{nk}$.
5. Summary Table
| Problem | Target distribution | Output activation | Loss |
|---|---|---|---|
| Regression | Gaussian | Identity | Sum of squares |
| Binary classification | Bernoulli | Sigmoid | Binary cross-entropy |
| Multi-class classification | Categorical | Softmax | Multi-class cross-entropy |