05.01 · Data Preprocessing
Level: Beginner to Intermediate
Pre-reading: 05 · ML Pipeline
The Art of Data Preprocessing
Data preprocessing is often 50% of the ML work. Garbage in = garbage out.
Common Data Problems
Missing Values
data:
Age: [25, 30, NaN, 45] # <- How do we handle this?
Solutions: 1. Drop row: Fast but loses data 2. Mean imputation: Simple but ignores relationships 3. Predictive imputation: Use other features to predict missing value 4. Forward fill (time series): Use previous value
Categorical Variables
Models usually need numerical input. Convert categories:
One-Hot Encoding:
Color: [Red, Green, Blue, Red]
→ Red: [1, 0, 0, 1]
Green: [0, 1, 0, 0]
Blue: [0, 0, 1, 0]
Label Encoding:
Color: [Red, Green, Blue] → [0, 1, 2]
Use one-hot for tree models; label encoding for neural networks.
Scaling & Normalization
Different features often have different ranges:
Age: [20–80]
Income: [20000–200000]
Without scaling, Income dominates.
Min-Max Scaling: $\(x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \quad \in [0, 1]\)$
Z-Score Standardization: $\(x_{scaled} = \frac{x - \mu}{\sigma} \quad \text{(mean 0, std 1)}\)$
Handling Outliers
Extreme values can skew models.
Detect: Use z-score > 3 or IQR method
Handle: - Remove (if error) - Clip (cap at threshold) - Transform (log, sqrt) - Keep (if real and important)
Class Imbalance
Skewed class distribution:
Class 0: 9,900 examples (99%)
Class 1: 100 examples (1%)
Solutions: - Oversampling: Duplicate minority class - Undersampling: Remove majority class - Class weights: Give minority class higher weight - Different metrics: Use F1, AUC instead of accuracy
Data Quality Checks
Before training, verify:
- ✅ No NaN values (or handled)
- ✅ Proper data types (numeric vs categorical)
- ✅ No extreme outliers (unless expected)
- ✅ Features have variation (not all same value)
- ✅ Balanced train/val/test splits
- ✅ No data leakage (test features don't contain future info)
Should I scale all features?
Tree models don't need scaling. Neural networks and distance-based models (KNN, SVM) need scaling.
How do I handle missing values in time series?
Use forward fill (carry last value forward) or interpolation. For seasonal data, use seasonal decomposition.