Lecture 8.1

Neural Networks

Neural networks (multi-layer perceptrons) solve the fixed-basis-function problem by learning the feature map from data. Each layer applies a linear transformation followed by a nonlinear activation, stacking to produce increasingly abstract representations.

Learning Objectives

Interpret a two-layer neural network as a generalized linear model with learned basis functions.
Define the components of an MLP: input units, activations, hidden units, output units.
Write the forward-pass equations for a two-layer network.
Name three common activation functions (sigmoid, tanh, ReLU) and explain why activation functions are necessary.
Describe skip connections, sparse connections, and weight sharing as design choices.

1. Learned Basis Functions

In all previous models, basis functions $\phi_m(\mathbf{x})$ were fixed before training. Neural networks parameterize each basis function by its own weight vector $\mathbf{w}_m^{(1)}$:

$$\phi_m(\mathbf{x}; \mathbf{w}_m^{(1)}) = h\!\bigl(\mathbf{w}_m^{(1)\top}\mathbf{x}\bigr),$$

where $h$ is a nonlinear activation function. Stacking $M$ such basis functions (each with its own weights), the first-layer output is a vector of $M$ features. A final linear layer then maps these features to the output prediction. Both layers are optimized jointly from data.

2. Components of a Two-Layer Network

MLP Vocabulary

Input units $x_d$: components of the input vector, often prepended with $x_0 = 1$ for the bias.
Activations $a_m^{(1)}$: pre-activation values at layer 1, computed by $\mathbf{a}^{(1)} = \mathbf{W}^{(1)}\mathbf{x}$.
Hidden units $z_m$: post-activation values, $z_m = h(a_m^{(1)})$. These are the learned features — the "basis function values."
Output units $y_k$: final predictions, $\mathbf{y} = g(\mathbf{W}^{(2)}\mathbf{z})$, where $g$ is the output activation (identity for regression, sigmoid/softmax for classification).

3. Forward-Pass Equations

For a two-layer network ($L=2$) mapping input $\mathbf{x} \in \mathbb{R}^{d+1}$ to output $\mathbf{y} \in \mathbb{R}^K$:

$$\mathbf{a}^{(1)} = \mathbf{W}^{(1)}\mathbf{x}, \quad \mathbf{z} = h(\mathbf{a}^{(1)}), \quad \mathbf{a}^{(2)} = \mathbf{W}^{(2)}\mathbf{z}, \quad \mathbf{y} = g(\mathbf{a}^{(2)}),$$

where $h$ and $g$ are applied element-wise. Deeper networks simply stack more $(\mathbf{W}^{(l)}, h)$ pairs.

4. Activation Functions

Common Activation Functions

Name	Formula	Range
Logistic sigmoid	$\sigma(a)=1/(1+e^{-a})$	$(0,1)$
Hyperbolic tangent	$\tanh(a)=(e^a-e^{-a})/(e^a+e^{-a})$	$(-1,1)$
ReLU	$\max(0, a)$	$[0,\infty)$

The sigmoid and tanh saturate for large $|a|$, producing very small gradients (the "vanishing gradient" problem during backpropagation). The ReLU does not saturate on its positive side and is by far the most widely used activation in modern deep networks.

Why Activation Functions Are Essential

Without a nonlinear activation, stacking multiple linear layers is equivalent to a single linear layer: $\mathbf{W}^{(2)}(\mathbf{W}^{(1)}\mathbf{x}) = (\mathbf{W}^{(2)}\mathbf{W}^{(1)})\mathbf{x}$. Nonlinear activations are what give neural networks their expressive power.

5. Architectural Choices

The standard feedforward network connects every unit to every unit in the next layer. Additional design choices include:

Skip connections: direct paths from lower-layer units to higher layers, bypassing one or more intermediate layers (e.g., ResNets).
Sparse connections: omitting certain connections to reduce parameters or impose structure.
Weight sharing: multiple connections constrained to use the same weight. The prototypical example is convolutional neural networks (CNNs), where a filter kernel slides across the input, sharing weights at every position. This encodes translation equivariance and dramatically reduces the parameter count for structured data (images, signals).