Lab 0: Environment Setup & First Steps with RAG¶
Level: Foundations | Duration: 30 minutes
Objective¶
Set up a complete development environment for RAG learning and verify all tools are working correctly. By the end of this lab, you'll have all dependencies installed and will create your first embedding.
What You'll Learn¶
- Install essential Python packages for RAG
- Understand the difference between dense and sparse retrieval
- Create your first text embedding
- Verify GPU/CPU availability
- Understand why each library matters
Prerequisites¶
- Python 3.9 or higher
- pip package manager
- ~2GB free disk space for models
Section 1: Core Libraries for RAG¶
The RAG ecosystem consists of:
| Library | Purpose | Why It Matters |
|---|---|---|
| numpy | Numerical computing | Vectors and matrix operations |
| pandas | Data manipulation | Loading, exploring datasets |
| sentence-transformers | Text embeddings | Converting text to vectors |
| scikit-learn | Machine learning | Distance metrics, evaluation |
| chromadb | Vector database | Storing and searching vectors |
| rank-bm25 | Sparse retrieval | Keyword-based ranking (BM25) |
| matplotlib/seaborn | Visualization | Understanding vectors and results |
Installation Command:
# Exercise 1.1: Verify Core Libraries Installation
# Run this cell to check all dependencies are installed
import sys
print(f"Python Version: {sys.version}")
print("-" * 60) # prints ---- 60 times
libraries = { # Map of library names : their purposes
"numpy": "Numerical computing",
"pandas": "Data manipulation",
"sklearn": "Machine learning utilities",
"sentence_transformers": "Text embeddings",
"chromadb": "Vector database",
"rank_bm25": "BM25 ranking algorithm",
"matplotlib": "Plotting",
}
failed = []
for lib, purpose in libraries.items(): # iterating through the map
try:
__import__(lib)
print(f"✓ {lib:25} ({purpose})")
except ImportError:
print(f"✗ {lib:25} MISSING!")
failed.append(lib)
print("-" * 60)
if failed:
print(f"\n⚠️ Missing libraries: {', '.join(failed)}")
print(f"Install with: pip install {' '.join(failed)}")
else:
print("\n✅ All libraries installed successfully!")
Section 2: Your First Embedding¶
Now let's create your first text embedding using a pretrained transformer model!
Key Concept: Embedding models convert text into numerical vectors that capture semantic meaning. Similar texts will have similar vectors.
# Exercise 2.1: Load Embedding Model and Create First Embedding
# Note: First run downloads the model (~500MB), subsequent runs are instant
from sentence_transformers import SentenceTransformer
import numpy as np
print("Loading embedding model: all-MiniLM-L6-v2")
print("(This will download ~500MB on first run...)\n")
# Load a lightweight but effective embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
print("✓ Model loaded successfully!")
print(f" Model name: all-MiniLM-L6-v2")
print(f" Embedding dimension: 384 (each text → 384 numbers)")
print(f" Large enough to capture meaning, small enough for speed\n")
sample_text = "The cat is sitting on the mat"
# Create an embedding for the sample sentence
embedding = model.encode(sample_text)
print(f"Text: '{sample_text}'")
print(f"Embedding shape: {embedding.shape}")
print(f"First 10 values: {embedding[:10]}")
print(f"Min value: {embedding.min():.4f}, Max value: {embedding.max():.4f}")
Loading embedding model: all-MiniLM-L6-v2 (This will download ~500MB on first run...)
Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
✓ Model loaded successfully! Model name: all-MiniLM-L6-v2 Embedding dimension: 384 (each text → 384 numbers) Large enough to capture meaning, small enough for speed Text: 'The cat is sitting on the mat' Embedding shape: (384,) First 10 values: [ 0.12767275 -0.04230328 -0.02484117 0.03399577 -0.03703767 0.04820585 0.02150124 0.04595408 -0.00081577 0.03168121] Min value: -0.1656, Max value: 0.1600
# Exercise 2.2: Embeddings Capture Semantic Similarity
# Test if similar sentences have similar embeddings
from scipy.spatial.distance import cosine
# Create embeddings for related sentences
sentences = [
"The cat is sitting on the mat", # Exact match
"A feline is resting on the rug", # Very similar
"The dog is running in the park", # Different topic
"Cats and dogs are different animals", # Related but different
]
embeddings = model.encode(sentences, show_progress_bar=True)
print("Sentence Similarity Analysis")
print("=" * 70)
print(f"Reference: '{sentences[0]}'\n")
for i, (sent, emb) in enumerate(zip(sentences[1:], embeddings[1:]), 1):
# Cosine similarity = 1 - cosine distance
similarity = 1 - cosine(embeddings[0], emb)
print(f"{i}. '{sent}'")
print(f" Similarity: {similarity:.4f}\n")
print("🔍 Key Insight:")
print(" - Sentence 1 & 2: HIGH similarity (same meaning, different words)")
print(" - Sentence 1 & 3: LOW similarity (different topic)")
print(" - Sentence 1 & 4: MEDIUM similarity (related but distinct)")
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Sentence Similarity Analysis ====================================================================== Reference: 'The cat is sitting on the mat' 1. 'A feline is resting on the rug' Similarity: 0.5514 2. 'The dog is running in the park' Similarity: 0.0801 3. 'Cats and dogs are different animals' Similarity: 0.2852 🔍 Key Insight: - Sentence 1 & 2: HIGH similarity (same meaning, different words) - Sentence 1 & 3: LOW similarity (different topic) - Sentence 1 & 4: MEDIUM similarity (related but distinct)
Section 3: First Vector Database Search¶
Let's create a tiny vector database and search it. This demonstrates the core RAG concept:
- Store document embeddings
- Embed user query
- Find most similar documents
- Return relevant results
# Exercise 3.1: Create a Simple In-Memory Vector Database
# Sample documents (knowledge base)
documents = [
"Python is a programming language used for data science",
"Machine learning models can predict future values",
"Natural language processing helps computers understand text",
"Neural networks are inspired by the human brain",
"Databases store and organize structured data",
"Vector databases are optimized for similarity search",
]
# Embed all documents
doc_embeddings = model.encode(documents, show_progress_bar=False)
print(f"Created vector database with {len(documents)} documents")
print(f"Each embedding is {doc_embeddings.shape[1]} dimensions\n")
# Now search the database
def search_database(query, top_k=3):
# Embed the query -> vectorize the query itself to be used for cosine similarity
query_embedding = model.encode(query, show_progress_bar=False)
# Calculate similarity with all documents
similarities = []
for i, doc_embedding in enumerate(doc_embeddings):
similarity = 1 - cosine(query_embedding, doc_embedding)
#if similarity > 0.5: # Minimum similarity threshold
similarities.append((i, similarity, documents[i]))
# Sort by similarity and return top-k
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Test the search
test_queries = [
"How do machines learn?",
"What stores data?",
"Tell me about vectors"
]
for query in test_queries:
print(f"Query: '{query}'")
print("-" * 70)
results = search_database(query)
for rank, (idx, sim, text) in enumerate(results, 1):
print(f"{rank}. (similarity: {sim:.4f}) {text}")
print()
Created vector database with 6 documents Each embedding is 384 dimensions Query: 'How do machines learn?' ---------------------------------------------------------------------- 1. (similarity: 0.5072) Neural networks are inspired by the human brain 2. (similarity: 0.4105) Natural language processing helps computers understand text 3. (similarity: 0.4092) Machine learning models can predict future values Query: 'What stores data?' ---------------------------------------------------------------------- 1. (similarity: 0.5760) Databases store and organize structured data 2. (similarity: 0.2531) Python is a programming language used for data science 3. (similarity: 0.2487) Vector databases are optimized for similarity search Query: 'Tell me about vectors' ---------------------------------------------------------------------- 1. (similarity: 0.4354) Vector databases are optimized for similarity search 2. (similarity: 0.1974) Neural networks are inspired by the human brain 3. (similarity: 0.1537) Natural language processing helps computers understand text
Section 4: TODO Exercise - Create Your Own Search¶
Your Task: Modify the search function above to:
- Allow users to specify the number of results (top_k parameter)
- Add a minimum similarity threshold (skip results below threshold)
- Show which documents ranked highest
Example Challenge: Find documents related to "learning" with at least 0.5 similarity
Checkpoint ✓¶
By now you should be able to:
- ✓ Import all RAG libraries without errors
- ✓ Create text embeddings using sentence-transformers
- ✓ Understand that similar texts have similar vectors
- ✓ Search a vector database by similarity
- ✓ Understand the basic RAG pipeline
Key Concepts Summary¶
| Concept | What It Is | Why It Matters |
|---|---|---|
| Embedding | Text converted to vector (numbers) | Enables similarity comparison |
| Similarity | How close two vectors are | Finds relevant documents |
| Vector DB | Stores and searches vectors efficiently | Core of RAG systems |
| Cosine Similarity | Angle-based distance metric | Works well for text (range 0-1) |
| Semantic Search | Finding by meaning, not keywords | Captures intent better than keyword search |
🚀 Next Steps¶
- Lab 1: Build vector math from scratch (understand the fundamentals)
- Lab 2: Explore embedding space and visualizations
- Lab 3: Work with real Chroma vector database
- Lab 4: Ingest real MongoDB data
Command Cheat Sheet¶
# Load embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embedding
embedding = model.encode("your text here")
# Calculate similarity
from scipy.spatial.distance import cosine
similarity = 1 - cosine(embedding1, embedding2)
Lab 0 Complete! ✅
You now have a working RAG environment and understand the basic flow of semantic search. Move on to Lab 1 to understand the math behind vectors.
import numpy as np
import matplotlib.pyplot as plt
# Generate random samples from normal distribution
data = np.random.normal(loc=0, scale=1, size=10000)
plt.hist(data, bins=50)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Normal Distribution (μ=0, σ=1)")
plt.show()