01.02 · Backpropagation Deep Dive

Level: Intermediate
Pre-reading: 01 · Neural Networks · 00.02 · Core Concepts

What is Backpropagation?

Backpropagation is the algorithm that computes gradients of loss with respect to every weight in a neural network, enabling us to update weights via gradient descent.

It works backwards from output to input, using the chain rule from calculus.

Forward Pass

First, compute predictions:

\[z^{[1]} = w^{[1]} x + b^{[1]}$$ $$a^{[1]} = \text{ReLU}(z^{[1]})$$ $$z^{[2]} = w^{[2]} a^{[1]} + b^{[2]}$$ $$\hat{y} = \text{sigmoid}(z^{[2]})\]

Then compute loss:

\[L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]\]

Backward Pass

Use chain rule to propagate error backwards:

\[\frac{\partial L}{\partial w^{[2]}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial w^{[2]}}\]

Continue for deeper layers.

The Chain Rule

For $y = f(g(h(x)))$:

\[\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}\]

This is exactly what backpropagation does — multiply partial derivatives along the computation chain.

Why is backpropagation efficient?

It reuses intermediate calculations. Computing all gradients requires one forward pass + one backward pass = O(n) operations, same as computing loss once.

What happens if gradients are too small (vanishing gradient)?

Small gradients mean weights barely update. Deep networks suffer because gradients shrink exponentially backwards through layers. Solutions: ReLU activation, batch norm, skip connections.

What's the difference between backprop and gradient descent?

Backprop computes gradients. Gradient descent uses those gradients to update weights. Backprop is the math; gradient descent is the algorithm.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search