03 · Optimization & Training

Level: Intermediate
Pre-reading: 00.02 · Core Concepts

Training Deep Networks: Practical Strategies

Training deep networks requires careful attention to hyperparameters and techniques:

Instead of using constant learning rate, decay it over time:

Strategy	How It Works
Step decay	Divide by 2 every N epochs
Exponential decay	Multiply by 0.9 each epoch
Cosine annealing	Follow cosine curve from high to low
Warm-up	Start low, increase to max, then decay

Benefits: - Allows larger initial steps (faster progress) - Smaller final steps (fine-grained convergence) - Often improves final model quality

Batch size affects both training and generalization:

Small batches: Noisier gradients, often better generalization, slower per-epoch
Large batches: Smoother gradients, sometimes worse generalization, faster per-epoch

Sweet spot: 32–256 for most problems.

How you initialize weights matters:

Poor initialization can cause training to stall or fail.

What happens if learning rate is too high?

Loss oscillates or increases. Weights diverge instead of converging. Reduce learning rate.

What's the effect of large batch size?

Smoother gradient estimates, faster training (more samples per second), but sometimes worse generalization and may need smaller learning rate.

Should I warm up the learning rate?

Yes, for most modern training. Start at 0, linearly increase to target, then decay. Prevents instability in early training.