03 · Optimization & Training

Level: Intermediate
Pre-reading: 00.02 · Core Concepts


Training Deep Networks: Practical Strategies

Training deep networks requires careful attention to hyperparameters and techniques:


Learning Rate Scheduling

Instead of using constant learning rate, decay it over time:

Strategy How It Works
Step decay Divide by 2 every N epochs
Exponential decay Multiply by 0.9 each epoch
Cosine annealing Follow cosine curve from high to low
Warm-up Start low, increase to max, then decay

Benefits: - Allows larger initial steps (faster progress) - Smaller final steps (fine-grained convergence) - Often improves final model quality


Batch Size Effects

Batch size affects both training and generalization:

  • Small batches: Noisier gradients, often better generalization, slower per-epoch
  • Large batches: Smoother gradients, sometimes worse generalization, faster per-epoch

Sweet spot: 32–256 for most problems.


Weight Initialization

How you initialize weights matters:

Method Formula When to Use
Xavier/Glorot Uniform in [-√(6/(n+m)), +√(6/(n+m))] Dense layers with tanh
He Normal with std = √(2/n) Dense layers with ReLU
Zero All weights = 0 Only for bias
Random normal N(0, 0.01) Careful — can be too small

Poor initialization can cause training to stall or fail.


Interview Q&A

What happens if learning rate is too high?

Loss oscillates or increases. Weights diverge instead of converging. Reduce learning rate.

What's the effect of large batch size?

Smoother gradient estimates, faster training (more samples per second), but sometimes worse generalization and may need smaller learning rate.

Should I warm up the learning rate?

Yes, for most modern training. Start at 0, linearly increase to target, then decay. Prevents instability in early training.