RAG Practical Labs: Hands-On Learning
Welcome to the labs track. These notebooks are structured as college-level exercises that move from foundations to advanced RAG system design.
At a Glance
- Format: Jupyter notebooks with guided exercises and checkpoints
- Time per lab: ~1 to 3 hours
- Difficulty path: Foundations -> Intermediate -> Advanced
- Skills covered: Python, vector math, embeddings, vector databases, retrieval, evaluation
Learning Roadmap
| Lab | Topic | Level | Time | Primary Outcome |
|---|---|---|---|---|
| 0 | Environment Setup | Foundations | 30 min | Working local environment |
| 1 | Vector Math and Distance Metrics | Foundations | 1.5 hr | Core similarity math intuition |
| 2 | Text Embeddings and Semantic Similarity | Foundations | 1.5 hr | Embedding analysis workflow |
| 3 | Vector Databases with Chroma | Intermediate | 2 hr | Persisted vector search system |
| 4 | Real Data Ingestion (MongoDB -> Chroma) | Intermediate | 2.5 hr | End-to-end ingestion pipeline |
| 5 | Exact vs Semantic Match Problem | Intermediate | 2 hr | Failure analysis and diagnosis |
| 6 | Hybrid Search Implementation | Intermediate-Advanced | 2.5 hr | Dense+sparse retrieval pipeline |
| 7 | Complete RAG Pipeline | Advanced | 3 hr | Full system with evaluation |
Lab Details
Lab 0: Environment Setup and Fundamentals
- Notebook: lab_0_environment.ipynb
- You will learn:
- Create and use a Python virtual environment
- Install required packages and validate imports
- Verify CPU/GPU runtime basics
- Run first embedding examples
- Deliverable: Working environment and successful setup checks
Lab 1: Vector Math and Distance Metrics
- Notebook: lab_1_vector_math.ipynb
- You will learn:
- Implement dot product, norms, normalization
- Compare cosine, Euclidean, and Manhattan metrics
- Build geometric intuition in 2D/3D
- Deliverable: Working vector ops and metric comparison visuals
Lab 2: Text Embeddings and Semantic Similarity
- Notebook: lab_2_embeddings.ipynb
- You will learn:
- Generate embeddings with SentenceTransformer
- Inspect embedding dimensions and distribution
- Analyze semantic similarity and nearest neighbors
- Visualize clusters with dimensionality reduction
- Deliverable: Embedding corpus analysis and similarity outputs
Lab 3: Vector Databases with Chroma
- Notebook: lab_3_chroma_basics.ipynb
- You will learn:
- Create and populate a Chroma index
- Store vectors with metadata
- Query, update, delete, and persist data
- Understand ANN/HNSW behavior at a practical level
- Deliverable: Working vector database with metadata-aware search
Lab 4: Real Data Ingestion (MongoDB to Chroma)
- Notebook: lab_4_mongodb_ingestion.ipynb
- You will learn:
- Load open datasets and preprocess records
- Apply chunking strategies
- Preserve metadata through ingestion
- Batch embed and insert at scale
- Deliverable: Ingested dataset with chunked vectors and metadata
Lab 5: Exact vs Semantic Match (Core RAG Failure Mode)
- Notebook: lab_5_exact_match_problem.ipynb
- You will learn:
- Reproduce exact-match failure cases (for example, close IDs)
- Visualize why semantic similarity can return wrong exact records
- Identify when dense-only retrieval is insufficient
- Deliverable: Failure analysis and decision criteria for hybrid search
Lab 6: Hybrid Search Implementation
- Notebook: lab_6_hybrid_search.ipynb
- You will learn:
- Implement sparse retrieval (BM25/TF-IDF)
- Combine dense and sparse ranks with RRF
- Compare retrieval quality across methods
- Evaluate with ranking metrics (for example, MRR/NDCG)
- Deliverable: Hybrid retrieval pipeline with comparative evaluation
Lab 7: Complete RAG Pipeline
- Notebook: lab_7_complete_rag.ipynb
- You will learn:
- Connect ingestion, retrieval, and generation
- Implement multi-stage retrieval and ranking
- Evaluate relevance, faithfulness, and latency
- Discuss production concerns (monitoring, cost, performance)
- Deliverable: End-to-end RAG prototype with evaluation results
Data Sources
| Lab | Dataset | Approx Size | Source |
|---|---|---|---|
| 4 | MongoDB Restaurants | 25K documents | MongoDB Atlas sample data |
| 4 | MongoDB Movies | 23K documents | MongoDB Atlas sample data |
| 5 | Exact-match synthetic set | 1K documents | Generated in lab |
| 6 | Hybrid evaluation set | 5K documents | Mixed generated + real data |
All datasets are generated in-lab or available through free-tier public/sample sources.
Prerequisites Checklist
Before starting labs:
- [ ] Python 3.9+ installed
- [ ] Basic comfort with pandas, NumPy, and scikit-learn
- [ ] Basic machine learning familiarity
- [ ] Read Prerequisites section
- [ ] Jupyter notebooks open and running locally
How to Use These Labs
- Start with Lab 0 and verify setup.
- Progress in order (each lab builds on earlier concepts).
- Complete TODOs and checkpoint sections.
- Compare outputs and write short observations.
- Re-run selected labs after finishing Lab 7 to reinforce intuition.
Academic Evaluation Rubric (Optional)
- Correctness (50%): Solution works and satisfies the prompt
- Understanding (25%): Concepts are explained clearly
- Analysis (15%): Results are interpreted with reasoning
- Code Quality (10%): Code is readable, structured, and reproducible
Tips for Success
- Do not skip Lab 1 math intuition; it improves all later labs.
- Execute every cell and modify parameters to see behavior changes.
- Keep brief notes on errors, fixes, and metric changes.
- Prefer small, reproducible experiments before scaling up.
Ready to begin: Lab 0 - Environment Setup