Lecture 6.5

Discriminant Functions: The Perceptron

The Perceptron is one of the oldest discriminant classifiers: a simple online update rule that converges to a perfect linear separator whenever the data are linearly separable.

Learning Objectives
  • State the Perceptron model and its $\{-1, +1\}$ target encoding.
  • Formulate the Perceptron criterion and identify the set of misclassified points $\mathcal{M}$.
  • Derive the stochastic gradient descent update rule for misclassified examples.
  • Give a geometric picture of one update step.
  • State the Perceptron convergence theorem and its key condition.
  • List the limitations of the Perceptron algorithm.

1. The Perceptron Model

Binary targets are encoded as $t_n \in \{-1, +1\}$ (unlike the 0/1 encoding used elsewhere). The discriminant function is a generalized linear model with a sign activation:

$$y(\mathbf{x}) = f\!\bigl(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x})\bigr), \qquad f(a) = \begin{cases} +1 & a \geq 0 \\ -1 & a < 0. \end{cases}$$

The decision rule is: assign $\mathbf{x}$ to $C_1$ if $\mathbf{w}^\top\boldsymbol{\phi}(\mathbf{x}) \geq 0$, else to $C_2$.

2. The Perceptron Criterion

A prediction is correct when $t_n (\mathbf{w}^\top \boldsymbol{\phi}_n) \geq 0$ (both factors have the same sign). A misclassification occurs when this product is negative. Let $\mathcal{M}$ be the set of misclassified indices. The Perceptron criterion is

Perceptron Criterion $$E_P(\mathbf{w}) = -\sum_{n \in \mathcal{M}} t_n\, \mathbf{w}^\top \boldsymbol{\phi}_n.$$

$E_P \geq 0$ always, and $E_P = 0$ when there are no misclassifications. Minimizing $E_P$ drives $\mathbf{w}$ toward correct classification of all points in $\mathcal{M}$.

3. Stochastic Gradient Descent Update

The gradient of $E_P$ with respect to $\mathbf{w}$, restricted to a single misclassified point $n$, is

$$\nabla_{\mathbf{w}} E_P = -t_n\, \boldsymbol{\phi}_n.$$

A stochastic gradient descent step in the direction of $-\nabla_{\mathbf{w}} E_P$ gives the update rule:

Perceptron Update Rule $$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} + \eta\, t_n\, \boldsymbol{\phi}_n,$$

where $\eta > 0$ is the learning rate. If the misclassified point belongs to $C_1$ ($t_n = +1$), the weight vector is nudged toward $\boldsymbol{\phi}_n$; if to $C_2$ ($t_n = -1$), it is nudged away. Only misclassified points trigger an update.

4. Geometric Picture

The weight vector $\mathbf{w}$ defines the normal to the current decision boundary. When a point $\boldsymbol{\phi}_n$ is misclassified, adding $t_n \boldsymbol{\phi}_n$ to $\mathbf{w}$ rotates the boundary so that $\boldsymbol{\phi}_n$ is more likely to land on the correct side. Iterating over misclassified points sweeps the boundary until all training examples are correctly classified — or forever, if the data are not linearly separable.

One Iteration Visualized

Suppose $\boldsymbol{\phi}_n$ is a class-1 point on the wrong side of $\mathbf{w}$ (the inner product is negative). After the update $\mathbf{w} \leftarrow \mathbf{w} + \eta\boldsymbol{\phi}_n$, the new inner product with $\boldsymbol{\phi}_n$ increases by $\eta\|\boldsymbol{\phi}_n\|^2 > 0$, moving toward a positive value (correct classification). One update is not guaranteed to flip the sign, but repeated updates do.

5. Convergence Theorem

Perceptron Convergence Theorem

If the training data are linearly separable, the Perceptron algorithm converges in a finite number of steps to a weight vector $\mathbf{w}^*$ that correctly classifies all training points.

If the data are not linearly separable the algorithm does not converge — $\mathbf{w}$ cycles indefinitely, and in general we cannot know in advance which case applies.

6. Limitations

  • Binary only. The algorithm does not generalize naturally to $K > 2$ classes.
  • Non-unique solution. When linearly separable, many valid weight vectors exist; the one found depends on initialization and the order in which misclassified points are visited.
  • No convergence guarantee. Without linear separability the algorithm runs forever. One cannot tell in advance which case applies.
  • No probabilities. Like all discriminant functions, the Perceptron produces hard class assignments with no uncertainty estimate.
  • Handcrafted features. The basis functions $\boldsymbol{\phi}(\mathbf{x})$ must be chosen manually — a limitation shared with all methods in this lecture. This will be resolved in Lecture 8 (Neural Networks), where the basis functions themselves are learned.