05.02 · Feature Engineering

Level: Intermediate
Pre-reading: 05 · ML Pipeline · 05.01 · Data Preprocessing


Feature Engineering: Art & Science

Feature engineering is the process of creating new features from raw data to improve model performance.

It's often more important than model choice!


Types of Feature Engineering

1. Domain-Driven Features

Use domain knowledge to create meaningful features:

Example: Predicting house prices

Raw features: sqft, bedrooms, location

Domain features:
- price_per_sqft = price / sqft
- density = bedrooms / sqft
- proximity_to_transit = distance to nearest transit


2. Interaction Features

Combine features:

age × income = combined purchasing power
hours_studied × previous_scores = learning trend

Captures non-additive relationships.


3. Polynomial Features

Add powers of features:

x → [x, x², x³]

Can capture non-linear relationships without deep networks.


4. Binning/Discretization

Convert continuous to categorical:

Age: [5, 15, 25, 35, 45, ...] 
→ AgeGroup: [child, teen, adult, middle-age, senior]

Useful for: - Non-linear relationships - Reducing outlier impact - Creating categorical features


5. Time-Based Features (for time series)

Date: 2024-03-15

→ DayOfWeek: 5 (Friday)
  Month: 3
  Quarter: 1
  IsWeekend: 1
  DaysSinceStart: 803

Captures temporal patterns.


6. Aggregation Features

Aggregate over groups or windows:

Rolling average (7-day, 30-day)
Sum of purchases (last week, month, year)
Frequency of events (per day, hour)

Feature Selection

Not all features are useful. Too many features = overfitting, slow training.

Methods

Statistical: - Correlation with target (for regression) - Information gain (for classification) - Mutual information

Model-based: - Feature importance from tree models - Coefficients from linear models - Permutation importance

Domain expert: "Which features make sense?"

Rule of Thumb

Features should correlate with target but not with each other (multicollinearity).


Common Feature Engineering Mistakes

Mistake Problem Fix
Too many features Overfitting, slow Select top features
Leakage Using future info Only use past data
Not scaling Large features dominate Normalize/standardize
Ignoring domain Random features Consult experts
Over-engineering Complex features that overfit Start simple

Feature Leakage: The Silent Killer

Data leakage: Using information that wouldn't be available at prediction time.

Example: Predicting loan default using whether loan was already paid off.

Problem: Model works perfectly in development but fails in production.

Solution: Only use features available before making prediction.


Should I engineer features before or after splitting train/test?

After splitting! Compute means/stds/thresholds on training set only, apply to validation/test. Otherwise test set influences training.

How many features should I have?

No fixed rule. Start with 10–50, use feature selection to reduce. Tree models handle more features; neural networks prefer fewer.

Is feature engineering needed with deep learning?

Less critical than traditional ML. Deep networks learn features automatically. But domain features can still help.