Lab 1: Vector Math & Distance Metrics¶
Level: Foundations | Duration: 1.5 hours
Objective¶
Understand vectors and distance metrics from first principles. Build geometric intuition for why embeddings work.
What You'll Learn¶
- Vector operations: magnitude, dot product, normalization
- Distance metrics: Euclidean, Manhattan, Cosine similarity
- Why cosine similarity is perfect for text embeddings
- Visualize vectors in 2D and 3D space
- Computational complexity of different metrics
Prerequisites¶
- Lab 0 completed
- Basic Python knowledge
- Comfortable with mathematical notation
Core Concepts Refresher¶
A vector is an ordered list of numbers:
- v = [3, 4] (2D vector in 2D space)
- v = [1, 2, 3] (3D vector)
- v = [v₁, v₂, ..., vₙ] (n-dimensional vector)
# Exercise 1.1: Implement Vector Operations from Scratch (No NumPy!)
import math
class Vector:
"""A simple 2D/3D vector implementation from scratch"""
def __init__(self, components):
self.components = list(components)
self.dim = len(components)
def magnitude(self):
"""Calculate the length (magnitude) of the vector"""
return math.sqrt(sum(x**2 for x in self.components))
def dot_product(self, other):
"""Calculate dot product with another vector"""
if self.dim != other.dim:
raise ValueError("Vectors must have same dimension")
return sum(a * b for a, b in zip(self.components, other.components))
def normalize(self):
"""Return normalized version (magnitude = 1)"""
mag = self.magnitude()
if mag == 0:
raise ValueError("Cannot normalize zero vector")
return Vector([x / mag for x in self.components])
def cosine_similarity(self, other):
"""Calculate cosine similarity with another vector"""
dot = self.dot_product(other)
mag1 = self.magnitude()
mag2 = other.magnitude()
if mag1 == 0 or mag2 == 0:
return 0
return dot / (mag1 * mag2)
def __repr__(self):
return f"Vector({self.components})"
# Test the implementation
v1 = Vector([3, 4])
v2 = Vector([1, 0])
print("Vector Operations Demo")
print("=" * 50)
print(f"v1 = {v1}")
print(f"v2 = {v2}\n")
print(f"Magnitude of v1: {v1.magnitude()}")
print(f" (This is the length of the vector, √(3² + 4²) = 5)\n")
print(f"Dot product (v1 · v2): {v1.dot_product(v2)}")
print(f" (This is 3*1 + 4*0 = 3)\n")
print(f"v1 normalized: {v1.normalize()}")
print(f" (Unit vector pointing in same direction)\n")
print(f"Cosine similarity: {v1.cosine_similarity(v2):.4f}")
print(f" (Ranges from -1 to 1, measures angle between vectors)")
Section 2: Distance Metrics Comparison¶
Different ways to measure distance between vectors, each with different properties:
| Metric | Formula | Best For | Range |
|---|---|---|---|
| Euclidean | √Σ(aᵢ - bᵢ)² | Physical distances, clustering | [0, ∞) |
| Manhattan | Σ|aᵢ - bᵢ| | Grid-like spaces, robust to outliers | [0, ∞) |
| Cosine Similarity | (a · b) / (|a| × |b|) | Text embeddings, high-dimensional data | [-1, 1] |
Key Insight: For text embeddings (384-dimensional vectors), cosine similarity works best because:
- Only cares about direction, not magnitude
- Invariant to document length
- Computationally efficient
- Interpretable (0.9 = very similar, 0.5 = somewhat related, 0.1 = different)
# Exercise 2.1: Implement Distance Metrics
import math
def euclidean_distance(v1, v2):
"""L2 distance"""
if len(v1) != len(v2):
raise ValueError("Vectors must have same dimension")
return math.sqrt(sum((a - b)**2 for a, b in zip(v1, v2)))
def manhattan_distance(v1, v2):
"""L1 distance"""
if len(v1) != len(v2):
raise ValueError("Vectors must have same dimension")
return sum(abs(a - b) for a, b in zip(v1, v2))
def cosine_similarity(v1, v2):
"""Range [0, 1] for normalized vectors, [-1, 1] in general"""
dot = sum(a * b for a, b in zip(v1, v2))
mag1 = math.sqrt(sum(a**2 for a in v1))
mag2 = math.sqrt(sum(b**2 for b in v2))
if mag1 == 0 or mag2 == 0:
return 0
return dot / (mag1 * mag2)
# Compare metrics on sample vectors
test_pairs = [
([1, 0], [0, 1], "Orthogonal (perpendicular)"),
([1, 0], [1, 0.1], "Nearly same direction"),
([1, 0], [2, 0], "Same direction, different magnitude"),
([1, 0], [-1, 0], "Opposite direction"),
]
print("Distance Metrics Comparison")
print("=" * 80)
for v1, v2, desc in test_pairs:
print(f"\n{desc}")
print(f" v1 = {v1}, v2 = {v2}")
print(f" Euclidean: {euclidean_distance(v1, v2):.4f}")
print(f" Manhattan: {manhattan_distance(v1, v2):.4f}")
print(f" Cosine Simil.: {cosine_similarity(v1, v2):+.4f}")
# Convert cosine to distance (1 - similarity)
print(f" Cosine Distance: {1 - cosine_similarity(v1, v2):.4f}")
Distance Metrics Comparison ================================================================================ Orthogonal (perpendicular) v1 = [1, 0], v2 = [0, 1] Euclidean: 1.4142 Manhattan: 2.0000 Cosine Simil.: +0.0000 Cosine Distance: 1.0000 Nearly same direction v1 = [1, 0], v2 = [1, 0.1] Euclidean: 0.1000 Manhattan: 0.1000 Cosine Simil.: +0.9950 Cosine Distance: 0.0050 Same direction, different magnitude v1 = [1, 0], v2 = [2, 0] Euclidean: 1.0000 Manhattan: 1.0000 Cosine Simil.: +1.0000 Cosine Distance: 0.0000 Opposite direction v1 = [1, 0], v2 = [-1, 0] Euclidean: 2.0000 Manhattan: 2.0000 Cosine Simil.: -1.0000 Cosine Distance: 2.0000
# Exercise 2.2: Visualize Vectors in 2D Space
import matplotlib.pyplot as plt
import numpy as np
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Plot 1: Basic vectors
ax = axes[0]
v1, v2 = [1, 2], [2, 1]
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='red', label='v1')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='blue', label='v2')
ax.set_xlim(-0.5, 3), ax.set_ylim(-0.5, 3)
ax.grid(True), ax.set_aspect('equal')
ax.legend(), ax.set_title(f'Vectors\ncos_similarity = {cosine_similarity(v1, v2):.3f}')
ax.set_xlabel('x'), ax.set_ylabel('y')
# Plot 2: Distance illustration
ax = axes[1]
points = [(0, 0), (3, 4), (4, 2), (1, 3)]
colors = ['red', 'white', 'green', 'orange']
for (x, y), c, label in zip(points, colors, ['Origin', 'v1', 'v2', 'v3']):
ax.plot(x, y, 'o', color=c, markersize=10, label=label)
# Add lines showing distances
ax.plot([points[1][0], points[2][0]], [points[1][1], points[2][1]], 'k--', alpha=0.5)
ax.set_xlim(-1, 5), ax.set_ylim(-1, 5)
ax.grid(True), ax.set_aspect('equal')
ax.legend(), ax.set_title(f'Distances\nEuclidean = {euclidean_distance([3,4],[4,2]):.2f}')
ax.set_xlabel('x'), ax.set_ylabel('y')
# Plot 3: Cosine similarity visualization
ax = axes[2]
angles = np.linspace(0, 2*np.pi, 100)
similarity_scores = []
for angle in angles:
v_angle = [np.cos(angle), np.sin(angle)]
v_ref = [1, 0]
similarity_scores.append(cosine_similarity(v_ref, v_angle))
ax.plot(np.degrees(angles), similarity_scores, 'o-', markersize=3)
ax.axhline(y=0, color='k', linestyle='-', alpha=0.3)
ax.axhline(y=0.5, color='g', linestyle='--', alpha=0.5, label='0.5 threshold')
ax.fill_between(np.degrees(angles), 0.5, 1, alpha=0.2, color='green')
ax.set_xlabel('Angle (degrees)'), ax.set_ylabel('Cosine Similarity')
ax.set_title('Cosine Similarity vs Angle'), ax.grid(True)
ax.legend()
plt.tight_layout()
plt.show()
print("\n📊 Key Observations:")
print(" - Cosine similarity is 1 when vectors point in same direction (0°)")
print(" - Cosine similarity is 0 when vectors are perpendicular (90°)")
print(" - Cosine similarity is -1 when vectors point opposite (180°)")
📊 Key Observations: - Cosine similarity is 1 when vectors point in same direction (0°) - Cosine similarity is 0 when vectors are perpendicular (90°) - Cosine similarity is -1 when vectors point opposite (180°)
Challenge Exercise (Optional)¶
Problem: Given vectors v1 = [1, 2, 3] and v2 = [4, 5, 6], calculate:
- Magnitude of each vector
- Normalized versions of each vector
- Dot product
- All three distance metrics
Then predict which metric would work best for finding similar documents.
Summary & Key Takeaways¶
✓ Vectors are ordered lists of numbers (coordinates in space)
✓ Magnitude is the length of a vector: √(sum of squares)
✓ Dot product measures how aligned two vectors are
✓ Cosine similarity is BEST for text embeddings because:
- Only cares about direction (angle), not scale
- Handles text of different lengths
- Computationally efficient
✓ Why embeddings work:
- Similar texts → similar vectors
- Different texts → different vectors
- Distance metric quantifies similarity
Real-World Application¶
When you ask your RAG system:
"What's the capital of France?"
The system:
- Converts "capital of France" → embedding (384-dimensional vector)
- Compares it to stored document embeddings using cosine similarity
- Returns documents with highest similarity scores
- LLM writes answer based on retrieved texts
The magic is that cosine similarity correctly identifies related documents even with different wording!
Lab 1 Complete! ✅
You now understand the mathematical foundation of embeddings and why cosine similarity works. Ready for Lab 2: Creating real text embeddings.