Lab 0: Environment Setup & First Steps with RAG¶

Level: Foundations | Duration: 30 minutes

Objective¶

Set up a complete development environment for RAG learning and verify all tools are working correctly. By the end of this lab, you'll have all dependencies installed and will create your first embedding.

What You'll Learn¶

Install essential Python packages for RAG
Understand the difference between dense and sparse retrieval
Create your first text embedding
Verify GPU/CPU availability
Understand why each library matters

Prerequisites¶

Python 3.9 or higher
pip package manager
~2GB free disk space for models

Section 1: Core Libraries for RAG¶

The RAG ecosystem consists of:

Library	Purpose	Why It Matters
numpy	Numerical computing	Vectors and matrix operations
pandas	Data manipulation	Loading, exploring datasets
sentence-transformers	Text embeddings	Converting text to vectors
scikit-learn	Machine learning	Distance metrics, evaluation
chromadb	Vector database	Storing and searching vectors
rank-bm25	Sparse retrieval	Keyword-based ranking (BM25)
matplotlib/seaborn	Visualization	Understanding vectors and results

Installation Command:

In [ ]:

Copied!





# Exercise 1.1: Verify Core Libraries Installation
# Run this cell to check all dependencies are installed

import sys
print(f"Python Version: {sys.version}")
print("-" * 60) # prints ---- 60 times

libraries = { # Map of library names : their purposes
    "numpy": "Numerical computing",
    "pandas": "Data manipulation",
    "sklearn": "Machine learning utilities",
    "sentence_transformers": "Text embeddings",
    "chromadb": "Vector database",
    "rank_bm25": "BM25 ranking algorithm",
    "matplotlib": "Plotting",
}

failed = []
for lib, purpose in libraries.items(): # iterating through the map
    try:
        __import__(lib)
        print(f"✓ {lib:25} ({purpose})")
    except ImportError:
        print(f"✗ {lib:25} MISSING!")
        failed.append(lib)

print("-" * 60)
if failed:
    print(f"\n⚠️ Missing libraries: {', '.join(failed)}")
    print(f"Install with: pip install {' '.join(failed)}")
else:
    print("\n✅ All libraries installed successfully!")
# Exercise 1.1: Verify Core Libraries Installation
# Run this cell to check all dependencies are installed

import sys
print(f"Python Version: {sys.version}")
print("-" * 60) # prints ---- 60 times

libraries = { # Map of library names : their purposes
    "numpy": "Numerical computing",
    "pandas": "Data manipulation",
    "sklearn": "Machine learning utilities",
    "sentence_transformers": "Text embeddings",
    "chromadb": "Vector database",
    "rank_bm25": "BM25 ranking algorithm",
    "matplotlib": "Plotting",
}

failed = []
for lib, purpose in libraries.items(): # iterating through the map
    try:
        __import__(lib)
        print(f"✓ {lib:25} ({purpose})")
    except ImportError:
        print(f"✗ {lib:25} MISSING!")
        failed.append(lib)

print("-" * 60)
if failed:
    print(f"\n⚠️ Missing libraries: {', '.join(failed)}")
    print(f"Install with: pip install {' '.join(failed)}")
else:
    print("\n✅ All libraries installed successfully!")

Section 2: Your First Embedding¶

Now let's create your first text embedding using a pretrained transformer model!

Key Concept: Embedding models convert text into numerical vectors that capture semantic meaning. Similar texts will have similar vectors.

In [14]:

Copied!





# Exercise 2.1: Load Embedding Model and Create First Embedding
# Note: First run downloads the model (~500MB), subsequent runs are instant

from sentence_transformers import SentenceTransformer
import numpy as np

print("Loading embedding model: all-MiniLM-L6-v2")
print("(This will download ~500MB on first run...)\n")

# Load a lightweight but effective embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("✓ Model loaded successfully!")
print(f"  Model name: all-MiniLM-L6-v2")
print(f"  Embedding dimension: 384 (each text → 384 numbers)")
print(f"  Large enough to capture meaning, small enough for speed\n")

sample_text = "The cat is sitting on the mat"

# Create an embedding for the sample sentence
embedding = model.encode(sample_text)

print(f"Text: '{sample_text}'")
print(f"Embedding shape: {embedding.shape}")
print(f"First 10 values: {embedding[:10]}")
print(f"Min value: {embedding.min():.4f}, Max value: {embedding.max():.4f}")
# Exercise 2.1: Load Embedding Model and Create First Embedding
# Note: First run downloads the model (~500MB), subsequent runs are instant

from sentence_transformers import SentenceTransformer
import numpy as np

print("Loading embedding model: all-MiniLM-L6-v2")
print("(This will download ~500MB on first run...)\n")

# Load a lightweight but effective embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("✓ Model loaded successfully!")
print(f"  Model name: all-MiniLM-L6-v2")
print(f"  Embedding dimension: 384 (each text → 384 numbers)")
print(f"  Large enough to capture meaning, small enough for speed\n")

sample_text = "The cat is sitting on the mat"

# Create an embedding for the sample sentence
embedding = model.encode(sample_text)

print(f"Text: '{sample_text}'")
print(f"Embedding shape: {embedding.shape}")
print(f"First 10 values: {embedding[:10]}")
print(f"Min value: {embedding.min():.4f}, Max value: {embedding.max():.4f}")

Loading embedding model: all-MiniLM-L6-v2
(This will download ~500MB on first run...)

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

✓ Model loaded successfully!
  Model name: all-MiniLM-L6-v2
  Embedding dimension: 384 (each text → 384 numbers)
  Large enough to capture meaning, small enough for speed

Text: 'The cat is sitting on the mat'
Embedding shape: (384,)
First 10 values: [ 0.12767275 -0.04230328 -0.02484117  0.03399577 -0.03703767  0.04820585
  0.02150124  0.04595408 -0.00081577  0.03168121]
Min value: -0.1656, Max value: 0.1600

In [15]:

Copied!





# Exercise 2.2: Embeddings Capture Semantic Similarity
# Test if similar sentences have similar embeddings

from scipy.spatial.distance import cosine

# Create embeddings for related sentences
sentences = [
    "The cat is sitting on the mat", # Exact match
    "A feline is resting on the rug",  # Very similar
    "The dog is running in the park",   # Different topic
    "Cats and dogs are different animals",  # Related but different
]

embeddings = model.encode(sentences, show_progress_bar=True)

print("Sentence Similarity Analysis")
print("=" * 70)
print(f"Reference: '{sentences[0]}'\n")

for i, (sent, emb) in enumerate(zip(sentences[1:], embeddings[1:]), 1):
    # Cosine similarity = 1 - cosine distance
    similarity = 1 - cosine(embeddings[0], emb)
    print(f"{i}. '{sent}'")
    print(f"   Similarity: {similarity:.4f}\n")

print("🔍 Key Insight:")
print("   - Sentence 1 & 2: HIGH similarity (same meaning, different words)")
print("   - Sentence 1 & 3: LOW similarity (different topic)")
print("   - Sentence 1 & 4: MEDIUM similarity (related but distinct)")
# Exercise 2.2: Embeddings Capture Semantic Similarity
# Test if similar sentences have similar embeddings

from scipy.spatial.distance import cosine

# Create embeddings for related sentences
sentences = [
    "The cat is sitting on the mat", # Exact match
    "A feline is resting on the rug",  # Very similar
    "The dog is running in the park",   # Different topic
    "Cats and dogs are different animals",  # Related but different
]

embeddings = model.encode(sentences, show_progress_bar=True)

print("Sentence Similarity Analysis")
print("=" * 70)
print(f"Reference: '{sentences[0]}'\n")

for i, (sent, emb) in enumerate(zip(sentences[1:], embeddings[1:]), 1):
    # Cosine similarity = 1 - cosine distance
    similarity = 1 - cosine(embeddings[0], emb)
    print(f"{i}. '{sent}'")
    print(f"   Similarity: {similarity:.4f}\n")

print("🔍 Key Insight:")
print("   - Sentence 1 & 2: HIGH similarity (same meaning, different words)")
print("   - Sentence 1 & 3: LOW similarity (different topic)")
print("   - Sentence 1 & 4: MEDIUM similarity (related but distinct)")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Sentence Similarity Analysis
======================================================================
Reference: 'The cat is sitting on the mat'

1. 'A feline is resting on the rug'
   Similarity: 0.5514

2. 'The dog is running in the park'
   Similarity: 0.0801

3. 'Cats and dogs are different animals'
   Similarity: 0.2852

🔍 Key Insight:
   - Sentence 1 & 2: HIGH similarity (same meaning, different words)
   - Sentence 1 & 3: LOW similarity (different topic)
   - Sentence 1 & 4: MEDIUM similarity (related but distinct)

Section 3: First Vector Database Search¶

Let's create a tiny vector database and search it. This demonstrates the core RAG concept:

Store document embeddings
Embed user query
Find most similar documents
Return relevant results

In [18]:

Copied!





# Exercise 3.1: Create a Simple In-Memory Vector Database

# Sample documents (knowledge base)
documents = [
    "Python is a programming language used for data science",
    "Machine learning models can predict future values",
    "Natural language processing helps computers understand text",
    "Neural networks are inspired by the human brain",
    "Databases store and organize structured data",
    "Vector databases are optimized for similarity search",
]

# Embed all documents
doc_embeddings = model.encode(documents, show_progress_bar=False)

print(f"Created vector database with {len(documents)} documents")
print(f"Each embedding is {doc_embeddings.shape[1]} dimensions\n")

# Now search the database
def search_database(query, top_k=3):
    # Embed the query -> vectorize the query itself to be used for cosine similarity
    query_embedding = model.encode(query, show_progress_bar=False)
    
    # Calculate similarity with all documents
    similarities = []
    for i, doc_embedding in enumerate(doc_embeddings):
        similarity = 1 - cosine(query_embedding, doc_embedding)
        #if similarity > 0.5:  # Minimum similarity threshold
        similarities.append((i, similarity, documents[i]))
    
    # Sort by similarity and return top-k
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

# Test the search
test_queries = [
    "How do machines learn?",
    "What stores data?",
    "Tell me about vectors"
]

for query in test_queries:
    print(f"Query: '{query}'")
    print("-" * 70)
    results = search_database(query)
    for rank, (idx, sim, text) in enumerate(results, 1):
        print(f"{rank}. (similarity: {sim:.4f}) {text}")
    print()
# Exercise 3.1: Create a Simple In-Memory Vector Database

# Sample documents (knowledge base)
documents = [
    "Python is a programming language used for data science",
    "Machine learning models can predict future values",
    "Natural language processing helps computers understand text",
    "Neural networks are inspired by the human brain",
    "Databases store and organize structured data",
    "Vector databases are optimized for similarity search",
]

# Embed all documents
doc_embeddings = model.encode(documents, show_progress_bar=False)

print(f"Created vector database with {len(documents)} documents")
print(f"Each embedding is {doc_embeddings.shape[1]} dimensions\n")

# Now search the database
def search_database(query, top_k=3):
    # Embed the query -> vectorize the query itself to be used for cosine similarity
    query_embedding = model.encode(query, show_progress_bar=False)
    
    # Calculate similarity with all documents
    similarities = []
    for i, doc_embedding in enumerate(doc_embeddings):
        similarity = 1 - cosine(query_embedding, doc_embedding)
        #if similarity > 0.5:  # Minimum similarity threshold
        similarities.append((i, similarity, documents[i]))
    
    # Sort by similarity and return top-k
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

# Test the search
test_queries = [
    "How do machines learn?",
    "What stores data?",
    "Tell me about vectors"
]

for query in test_queries:
    print(f"Query: '{query}'")
    print("-" * 70)
    results = search_database(query)
    for rank, (idx, sim, text) in enumerate(results, 1):
        print(f"{rank}. (similarity: {sim:.4f}) {text}")
    print()

Created vector database with 6 documents
Each embedding is 384 dimensions

Query: 'How do machines learn?'
----------------------------------------------------------------------
1. (similarity: 0.5072) Neural networks are inspired by the human brain
2. (similarity: 0.4105) Natural language processing helps computers understand text
3. (similarity: 0.4092) Machine learning models can predict future values

Query: 'What stores data?'
----------------------------------------------------------------------
1. (similarity: 0.5760) Databases store and organize structured data
2. (similarity: 0.2531) Python is a programming language used for data science
3. (similarity: 0.2487) Vector databases are optimized for similarity search

Query: 'Tell me about vectors'
----------------------------------------------------------------------
1. (similarity: 0.4354) Vector databases are optimized for similarity search
2. (similarity: 0.1974) Neural networks are inspired by the human brain
3. (similarity: 0.1537) Natural language processing helps computers understand text

Section 4: TODO Exercise - Create Your Own Search¶

Your Task: Modify the search function above to:

Allow users to specify the number of results (top_k parameter)
Add a minimum similarity threshold (skip results below threshold)
Show which documents ranked highest

Example Challenge: Find documents related to "learning" with at least 0.5 similarity

Checkpoint ✓¶

By now you should be able to:

✓ Import all RAG libraries without errors
✓ Create text embeddings using sentence-transformers
✓ Understand that similar texts have similar vectors
✓ Search a vector database by similarity
✓ Understand the basic RAG pipeline

Key Concepts Summary¶

Concept	What It Is	Why It Matters
Embedding	Text converted to vector (numbers)	Enables similarity comparison
Similarity	How close two vectors are	Finds relevant documents
Vector DB	Stores and searches vectors efficiently	Core of RAG systems
Cosine Similarity	Angle-based distance metric	Works well for text (range 0-1)
Semantic Search	Finding by meaning, not keywords	Captures intent better than keyword search

🚀 Next Steps¶

Lab 1: Build vector math from scratch (understand the fundamentals)
Lab 2: Explore embedding space and visualizations
Lab 3: Work with real Chroma vector database
Lab 4: Ingest real MongoDB data

Command Cheat Sheet¶

# Load embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embedding
embedding = model.encode("your text here")

# Calculate similarity
from scipy.spatial.distance import cosine
similarity = 1 - cosine(embedding1, embedding2)

Lab 0 Complete! ✅

You now have a working RAG environment and understand the basic flow of semantic search. Move on to Lab 1 to understand the math behind vectors.

In [ ]:

In [2]:

Copied!





import numpy as np
import matplotlib.pyplot as plt

# Generate random samples from normal distribution
data = np.random.normal(loc=0, scale=1, size=10000)

plt.hist(data, bins=50)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Normal Distribution (μ=0, σ=1)")
plt.show()
import numpy as np
import matplotlib.pyplot as plt

# Generate random samples from normal distribution
data = np.random.normal(loc=0, scale=1, size=10000)

plt.hist(data, bins=50)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Normal Distribution (μ=0, σ=1)")
plt.show()

No description has been provided for this image