05 · ML Pipeline in Practice

Level: Intermediate to Advanced
Pre-reading: All previous sections

End-to-End ML Project

Building a real ML system requires:

graph TD A["1. Problem Statement"] --> B["2. Data Collection"] B --> C["3. Exploratory Analysis"] C --> D["4. Data Preprocessing"] D --> E["5. Feature Engineering"] E --> F["6. Model Selection"] F --> G["7. Baseline Model"] G --> H["8. Iterate & Improve"] H --> I{"Performance<br/>meets goal?"} I -->|No| H I -->|Yes| J["9. Final Evaluation"] J --> K["10. Deploy"] K --> L["11. Monitor & Retrain"]

1. Problem Statement

Define clearly: - What are we predicting? - What data do we have? - What's success? - What are constraints?

2. Data Collection

Gather data: - Internal logs and databases - APIs and web scraping - Crowdsourcing - Synthetic data generation

Key questions: - How much data? - Is it representative? - Are there biases?

3. Exploratory Data Analysis (EDA)

Understand your data:

Summary statistics: Mean, median, std, min, max
Distributions: Histograms, box plots
Correlations: Which features relate to target?
Missing values: How many? Random or systematic?
Outliers: Errors or real extreme cases?

4. Data Preprocessing

Clean and prepare data:

Handling Missing Values

Strategy	When to Use
Drop	Few missing values (<5%), not important feature
Fill with mean/median	Numerical, missing randomly
Fill with mode	Categorical
Forward/backward fill	Time-series data
Predictive filling	Predict from other features

Handling Outliers

Strategy	When to Use
Remove	Clear errors, few outliers
Clip	Few extreme values
Transform	Use log or sqrt (reduces impact)
Keep	Real extreme values (fraud signals, etc.)

Data Normalization/Standardization

Technique	Formula	When
Min-Max	\((x - \min) / (\max - \min)\)	Scale to [0,1]
Z-score	\((x - \mu) / \sigma\)	Scale to mean 0, std 1
Log transform	\(\log(x)\)	Reduce skew, outliers

5. Feature Engineering

Create meaningful features:

Domain knowledge: Business-relevant features
Interactions: Multiply/divide features
Polynomials: Square, cube features
Binning: Convert continuous to categorical
Aggregations: Time windows, counts

6. Model Selection

Choose model(s) to try:

Model Type	When to Use
Linear Regression	Simple, interpretable, fast
Decision Trees	Non-linear, handles categories, interpretable
Random Forest	Robust, handles interactions, less tuning
Neural Networks	Complex non-linearity, lots of data
Gradient Boosting	State-of-the-art, competitive models (XGBoost, LightGBM)

Start simple, increase complexity if needed.

7. Baseline Model

Create simple baseline (median prediction, simple rule):

Why? - Provides reference point - Quick reality check - Highlights value of complex model

8. Iterate & Improve

If Problem	Try
Underfitting (high training loss)	Add features, increase model complexity, train longer
Overfitting (high validation loss)	Add regularization, more data, simpler model
Slow training	Reduce data size, simpler model, use GPU
Poor on specific examples	Analyze failing examples, add relevant features

9. Final Evaluation

Evaluate on held-out test set: - Report accuracy, precision, recall, F1 - Create confusion matrix - Analyze failure cases - Compare to baseline

10. Deployment

Put model in production: - REST API endpoint - Batch processing - Embedded model - A/B testing

11. Monitoring & Retraining

Important: Monitor model performance over time:

Track prediction accuracy
Detect data drift (new patterns)
Retrain periodically
Set up alerts for performance drop

How much data do I need?

Depends on model complexity and problem difficulty. Rules of thumb: 1,000 for simple models, 10,000+ for deep learning. More data always helps.

What's the train/val/test split?

70/15/15 is common. For huge datasets, 99/0.5/0.5 is okay. Never use test set during model development.

How do I handle imbalanced classes?

Oversampling, undersampling, class weights, different metrics (F1, AUC instead of accuracy).

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search