05.01 · Data Preprocessing

Level: Beginner to Intermediate
Pre-reading: 05 · ML Pipeline

The Art of Data Preprocessing

Data preprocessing is often 50% of the ML work. Garbage in = garbage out.

Common Data Problems

Missing Values

data:
Age: [25, 30, NaN, 45]  # <- How do we handle this?

Solutions: 1. Drop row: Fast but loses data 2. Mean imputation: Simple but ignores relationships 3. Predictive imputation: Use other features to predict missing value 4. Forward fill (time series): Use previous value

Categorical Variables

Models usually need numerical input. Convert categories:

One-Hot Encoding:

Color: [Red, Green, Blue, Red]

→ Red:   [1, 0, 0, 1]
  Green: [0, 1, 0, 0]
  Blue:  [0, 0, 1, 0]

Label Encoding:

Color: [Red, Green, Blue] → [0, 1, 2]

Use one-hot for tree models; label encoding for neural networks.

Scaling & Normalization

Different features often have different ranges:

Age: [20–80]
Income: [20000–200000]

Without scaling, Income dominates.

Min-Max Scaling: $$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \quad \in [0, 1]$$

Z-Score Standardization: $$x_{scaled} = \frac{x - \mu}{\sigma} \quad \text{(mean 0, std 1)}$$

Handling Outliers

Extreme values can skew models.

Detect: Use z-score > 3 or IQR method

Handle: - Remove (if error) - Clip (cap at threshold) - Transform (log, sqrt) - Keep (if real and important)

Class Imbalance

Skewed class distribution:

Class 0: 9,900 examples (99%)
Class 1: 100 examples (1%)

Solutions: - Oversampling: Duplicate minority class - Undersampling: Remove majority class - Class weights: Give minority class higher weight - Different metrics: Use F1, AUC instead of accuracy

Data Quality Checks

Before training, verify:

✅ No NaN values (or handled)
✅ Proper data types (numeric vs categorical)
✅ No extreme outliers (unless expected)
✅ Features have variation (not all same value)
✅ Balanced train/val/test splits
✅ No data leakage (test features don't contain future info)

Should I scale all features?

Tree models don't need scaling. Neural networks and distance-based models (KNN, SVM) need scaling.

How do I handle missing values in time series?

Use forward fill (carry last value forward) or interpolation. For seasonal data, use seasonal decomposition.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search