01.01 · Perceptrons & Activation Functions — Deep Dive

Level: Intermediate
Pre-reading: 01 · Neural Networks

The Perceptron: Foundation of Neural Networks

The Perceptron is the simplest neural network — a single neuron that classifies linearly separable data.

A perceptron computes:

\[\text{output} = \begin{cases} 1 & \text{if } w \cdot x + b > 0 \\ 0 & \text{otherwise} \end{cases}\]

This draws a linear decision boundary in the input space.

Perceptron Learning Algorithm

Iterate over training examples:

Make a prediction
If wrong, adjust weights in direction of correct answer
Repeat until all examples are correct

\[w_{t+1} = w_t + y_i (x_i - \hat{y}_i)\]

This converges if data is linearly separable.

The XOR Problem

The perceptron can learn linearly separable patterns but fails on XOR (not linearly separable).

Solution: Use multiple layers with activation functions.

graph LR A["Input Layer"] --> B["Hidden Layer<br/>2 perceptrons"] B --> C["Output Layer<br/>1 perceptron"]

This is why we need deep networks!

Activation Functions: Detailed Comparison

Function	Formula	Range	Derivative	When to Use
ReLU	max(0, x)	[0, ∞)	0 or 1	Hidden layers
Sigmoid	1/(1+e^-x)	(0,1)	σ(x)(1-σ(x))	Binary output
Tanh	(e^x-e^-x)/(e^x+e^-x)	(-1,1)	1-tanh²(x)	Hidden layers
Softmax	e^xi/∑e^xj	Valid prob	Complex	Multi-class output
Leaky ReLU	max(αx, x)	(-∞, ∞)	α or 1	Fix dying ReLU

What's the difference between ReLU and Leaky ReLU?

ReLU outputs 0 for negative inputs, which can cause neurons to die (always output 0). Leaky ReLU outputs a small negative value (α×x) instead, allowing gradients to flow even for negative inputs.

Why is sigmoid not used in hidden layers anymore?

Sigmoids suffer from vanishing gradients — at extremes, the derivative is nearly 0, so gradients don't flow through training. ReLU has constant gradient for positive inputs, enabling better training of deep networks.

Can I use ReLU in the output layer?

Only if your output is non-negative (e.g., image pixels, prices). For binary classification, use sigmoid. For multi-class, use softmax. For regression with any range, consider linear (no activation).

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search