05.02 · Feature Engineering
Level: Intermediate
Pre-reading: 05 · ML Pipeline · 05.01 · Data Preprocessing
Feature Engineering: Art & Science
Feature engineering is the process of creating new features from raw data to improve model performance.
It's often more important than model choice!
Types of Feature Engineering
1. Domain-Driven Features
Use domain knowledge to create meaningful features:
Example: Predicting house prices
Raw features: sqft, bedrooms, location
Domain features:
- price_per_sqft = price / sqft
- density = bedrooms / sqft
- proximity_to_transit = distance to nearest transit
2. Interaction Features
Combine features:
age × income = combined purchasing power
hours_studied × previous_scores = learning trend
Captures non-additive relationships.
3. Polynomial Features
Add powers of features:
x → [x, x², x³]
Can capture non-linear relationships without deep networks.
4. Binning/Discretization
Convert continuous to categorical:
Age: [5, 15, 25, 35, 45, ...]
→ AgeGroup: [child, teen, adult, middle-age, senior]
Useful for: - Non-linear relationships - Reducing outlier impact - Creating categorical features
5. Time-Based Features (for time series)
Date: 2024-03-15
→ DayOfWeek: 5 (Friday)
Month: 3
Quarter: 1
IsWeekend: 1
DaysSinceStart: 803
Captures temporal patterns.
6. Aggregation Features
Aggregate over groups or windows:
Rolling average (7-day, 30-day)
Sum of purchases (last week, month, year)
Frequency of events (per day, hour)
Feature Selection
Not all features are useful. Too many features = overfitting, slow training.
Methods
Statistical: - Correlation with target (for regression) - Information gain (for classification) - Mutual information
Model-based: - Feature importance from tree models - Coefficients from linear models - Permutation importance
Domain expert: "Which features make sense?"
Rule of Thumb
Features should correlate with target but not with each other (multicollinearity).
Common Feature Engineering Mistakes
| Mistake | Problem | Fix |
|---|---|---|
| Too many features | Overfitting, slow | Select top features |
| Leakage | Using future info | Only use past data |
| Not scaling | Large features dominate | Normalize/standardize |
| Ignoring domain | Random features | Consult experts |
| Over-engineering | Complex features that overfit | Start simple |
Feature Leakage: The Silent Killer
Data leakage: Using information that wouldn't be available at prediction time.
Example: Predicting loan default using whether loan was already paid off.
Problem: Model works perfectly in development but fails in production.
Solution: Only use features available before making prediction.
Should I engineer features before or after splitting train/test?
After splitting! Compute means/stds/thresholds on training set only, apply to validation/test. Otherwise test set influences training.
How many features should I have?
No fixed rule. Start with 10–50, use feature selection to reduce. Tree models handle more features; neural networks prefer fewer.
Is feature engineering needed with deep learning?
Less critical than traditional ML. Deep networks learn features automatically. But domain features can still help.