Lab 3: Vector Database Basics with Chroma & MongoDB Atlas Vector Search¶
Level: Intermediate | Duration: 2 hours
Objective¶
Learn how to store, index, and search vectors using a production-grade vector database.
What You'll Learn¶
- Create and configure a Chroma vector database
- Store embeddings with metadata
- Perform similarity search
- Understand HNSW indexing
- Handle update and delete operations
- Persist database to disk
- Replicate the same workflow with MongoDB Atlas Vector Search
Why Chroma?¶
- ✅ Easy to learn (Python API)
- ✅ Built-in embeddings support
- ✅ Metadata filtering
- ✅ Persistent storage
- ✅ Works locally (no server setup)
- ✅ Perfect for prototyping and learning
In [22]:
Copied!
import chromadb
from sentence_transformers import SentenceTransformer
import json
# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_data")
# Create collection
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
print(f"Created Chroma collection: documents")
print(f"Storage space: cosine (best for embeddings)")
print(f"HNSW index: Hierarchical Navigable Small World\n")
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample documents (same set reused for MongoDB section below)
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing helps computers understand text",
"Computer vision analyzes and interprets images",
"Data science combines statistics and programming"
]
# Add documents to collection
for i, doc in enumerate(documents):
embedding = model.encode(doc)
collection.add(
ids=[f"doc_{i}"],
embeddings=[embedding.tolist()],
documents=[doc],
metadatas=[{"source": "sample", "index": i}]
)
print(f"Added {len(documents)} documents to Chroma\n")
# Search the collection
query = "What is deep learning?"
query_embedding = model.encode(query)
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=3
)
print(f"Query: '{query}'")
print("Top-3 Most Similar Documents:")
print("-" * 60)
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
similarity = 1 - dist # Convert distance to similarity
print(f"{i}. (similarity: {similarity:.4f})")
print(f" {doc}\n")
print("✅ Chroma section complete!")
import chromadb
from sentence_transformers import SentenceTransformer
import json
# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_data")
# Create collection
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
print(f"Created Chroma collection: documents")
print(f"Storage space: cosine (best for embeddings)")
print(f"HNSW index: Hierarchical Navigable Small World\n")
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample documents (same set reused for MongoDB section below)
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing helps computers understand text",
"Computer vision analyzes and interprets images",
"Data science combines statistics and programming"
]
# Add documents to collection
for i, doc in enumerate(documents):
embedding = model.encode(doc)
collection.add(
ids=[f"doc_{i}"],
embeddings=[embedding.tolist()],
documents=[doc],
metadatas=[{"source": "sample", "index": i}]
)
print(f"Added {len(documents)} documents to Chroma\n")
# Search the collection
query = "What is deep learning?"
query_embedding = model.encode(query)
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=3
)
print(f"Query: '{query}'")
print("Top-3 Most Similar Documents:")
print("-" * 60)
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
similarity = 1 - dist # Convert distance to similarity
print(f"{i}. (similarity: {similarity:.4f})")
print(f" {doc}\n")
print("✅ Chroma section complete!")
Created Chroma collection: documents Storage space: cosine (best for embeddings) HNSW index: Hierarchical Navigable Small World
Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
Added 5 documents to Chroma Query: 'What is deep learning?' Top-3 Most Similar Documents: ------------------------------------------------------------ 1. (similarity: 0.6180) Deep learning uses neural networks with multiple layers 2. (similarity: 0.5231) Machine learning is a subset of artificial intelligence 3. (similarity: 0.2908) Natural language processing helps computers understand text ✅ Chroma section complete!
Part 2: Same Workflow on MongoDB Atlas Vector Search¶
MongoDB Atlas (free tier M0) supports vector search natively using the same cosine similarity approach.
Prerequisites¶
- Create a free Atlas account if you haven't already.
- Create a free M0 cluster (any region).
- Under Security → Database Access, create a user with read/write access.
- Under Security → Network Access, allow access from your IP (or
0.0.0.0/0for testing). - Copy your connection string from Connect → Drivers (Python 3.12+).
- Install the driver:
pip install pymongo
How Atlas Vector Search works vs Chroma¶
| Step | Chroma | MongoDB Atlas |
|---|---|---|
| Store vectors | collection.add(embeddings=...) |
Insert documents with an embedding field |
| Create index | Automatic (HNSW built-in) | Create a Search Index in Atlas UI or API |
| Query | collection.query(query_embeddings=...) |
$vectorSearch aggregation stage |
| Result score | Cosine distance (0 = identical) | score field via $meta: "vectorSearchScore" |
Key difference: In Atlas you create the vector search index once in the UI, then query it with an aggregation pipeline.
In [24]:
Copied!
from dotenv import load_dotenv
from pathlib import Path
import os
# Optional Atlas import (only required if using SOURCE_MODE="atlas")
try:
from pymongo import MongoClient
except Exception:
MongoClient = None
# Load ATLAS_URI from the first .env found while walking up directories.
for _base in [Path.cwd(), *Path.cwd().parents]:
_candidate = _base / ".env"
if _candidate.exists():
load_dotenv(dotenv_path=_candidate, override=False)
break
# ─────────────────────────────────────────────────────────────
# STEP 1: Set your connection string (keep this private!)
# Copy it from Atlas → Connect → Drivers → Python
# ─────────────────────────────────────────────────────────────
ATLAS_URI = os.getenv("ATLAS_URI", "").strip()
# ─────────────────────────────────────────────────────────────
# STEP 2: Connect and insert documents with their embeddings
# ─────────────────────────────────────────────────────────────
from pymongo import MongoClient
import numpy as np
import ssl as _ssl
# tlsAllowInvalidCertificates=True bypasses corporate TLS inspection.
# ⚠ Use ONLY for local dev/testing — never in production.
# If you are on a personal network remove this option and use:
# mongo_client = MongoClient(ATLAS_URI)
mongo_client = MongoClient(
ATLAS_URI,
tls=True,
tlsAllowInvalidCertificates=True, # workaround for corporate TLS proxy
serverSelectionTimeoutMS=30000,
)
# Quick ping to confirm connection before doing anything else
mongo_client.admin.command("ping")
print("✅ Connected to MongoDB Atlas")
from dotenv import load_dotenv
from pathlib import Path
import os
# Optional Atlas import (only required if using SOURCE_MODE="atlas")
try:
from pymongo import MongoClient
except Exception:
MongoClient = None
# Load ATLAS_URI from the first .env found while walking up directories.
for _base in [Path.cwd(), *Path.cwd().parents]:
_candidate = _base / ".env"
if _candidate.exists():
load_dotenv(dotenv_path=_candidate, override=False)
break
# ─────────────────────────────────────────────────────────────
# STEP 1: Set your connection string (keep this private!)
# Copy it from Atlas → Connect → Drivers → Python
# ─────────────────────────────────────────────────────────────
ATLAS_URI = os.getenv("ATLAS_URI", "").strip()
# ─────────────────────────────────────────────────────────────
# STEP 2: Connect and insert documents with their embeddings
# ─────────────────────────────────────────────────────────────
from pymongo import MongoClient
import numpy as np
import ssl as _ssl
# tlsAllowInvalidCertificates=True bypasses corporate TLS inspection.
# ⚠ Use ONLY for local dev/testing — never in production.
# If you are on a personal network remove this option and use:
# mongo_client = MongoClient(ATLAS_URI)
mongo_client = MongoClient(
ATLAS_URI,
tls=True,
tlsAllowInvalidCertificates=True, # workaround for corporate TLS proxy
serverSelectionTimeoutMS=30000,
)
# Quick ping to confirm connection before doing anything else
mongo_client.admin.command("ping")
print("✅ Connected to MongoDB Atlas")
✅ Connected to MongoDB Atlas
In [28]:
Copied!
DB_NAME = "rag_lab"
COLL_NAME = "documents"
db = mongo_client[DB_NAME]
collection_mg = db[COLL_NAME]
# Drop any previous run data so re-running is safe
collection_mg.drop()
# Build records: each document gets a text field + its embedding vector
records = []
for i, doc in enumerate(documents):
embedding = model.encode(doc).tolist() # list[float] — required by Atlas
records.append({
"_id": f"doc_{i}",
"text": doc,
"source": "lab3_sample",
"embedding": embedding # 384-dim vector (all-MiniLM-L6-v2)
})
collection_mg.insert_many(records)
print(f"Inserted {len(records)} documents into Atlas collection '{COLL_NAME}'")
print("Sample record keys:", list(records[0].keys()))
DB_NAME = "rag_lab"
COLL_NAME = "documents"
db = mongo_client[DB_NAME]
collection_mg = db[COLL_NAME]
# Drop any previous run data so re-running is safe
collection_mg.drop()
# Build records: each document gets a text field + its embedding vector
records = []
for i, doc in enumerate(documents):
embedding = model.encode(doc).tolist() # list[float] — required by Atlas
records.append({
"_id": f"doc_{i}",
"text": doc,
"source": "lab3_sample",
"embedding": embedding # 384-dim vector (all-MiniLM-L6-v2)
})
collection_mg.insert_many(records)
print(f"Inserted {len(records)} documents into Atlas collection '{COLL_NAME}'")
print("Sample record keys:", list(records[0].keys()))
Inserted 5 documents into Atlas collection 'documents' Sample record keys: ['_id', 'text', 'source', 'embedding']
Retrieve one document¶
In [27]:
Copied!
doc = collection_mg.find_one({"_id": "doc_0"}, {"_id": 1, "embedding": 1, "text": 1})
if not doc:
print(f"Document not found: {doc_0}")
else:
emb = doc.get("embedding")
print("Found:", doc["_id"])
print("Text:", (doc.get("text") or "")[:120], "...")
print("Embedding length:", len(emb) if emb else 0)
print("First 10 values:", emb[:10] if emb else None)
doc = collection_mg.find_one({"_id": "doc_0"}, {"_id": 1, "embedding": 1, "text": 1})
if not doc:
print(f"Document not found: {doc_0}")
else:
emb = doc.get("embedding")
print("Found:", doc["_id"])
print("Text:", (doc.get("text") or "")[:120], "...")
print("Embedding length:", len(emb) if emb else 0)
print("First 10 values:", emb[:10] if emb else None)
Found: doc_0 Text: Machine learning is a subset of artificial intelligence ... Embedding length: 384 First 10 values: [-0.04610738530755043, -0.004260687157511711, 0.0698365792632103, 0.035535287111997604, 0.048502057790756226, -0.030225230380892754, 0.001603968907147646, -0.009542404673993587, -0.05142451077699661, -0.0038602121639996767]
STEP 3: Create the Vector Search Index in Atlas UI¶
Before querying you need to create a Search Index once in the Atlas web console:
- Open your cluster → Atlas Search tab → Create Search Index.
- Choose JSON Editor and paste:
{
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 384,
"similarity": "cosine"
}
]
}
- Name the index
vector_indexand click Create. - Wait ~1 minute for the index to build (status changes from Building to Ready).
Then run the next cell.
Mongo DB¶
Allow the IP Address of your machine to access the MongoDB Atlas cluster by adding it to the IP Access List in the MongoDB Atlas dashboard.

After the db is set up, you can inspect the search index to see how documents are being chunked and embedded for retrieval.


In [29]:
Copied!
# ─────────────────────────────────────────────────────────────
# STEP 4: Query Atlas with $vectorSearch
# Mirrors the Chroma query above — same model, same query sentence
# ─────────────────────────────────────────────────────────────
query = "What is deep learning?"
query_embedding = model.encode(query).tolist()
pipeline = [
{
"$vectorSearch": {
"index": "vector_index", # name you gave in Atlas UI
"path": "embedding", # field holding the vector
"queryVector": query_embedding,
"numCandidates": 20, # candidates examined (>= limit)
"limit": 3 # top-K to return
}
},
{
# Retrieve the text + a relevance score. In inclusion projections,
# only _id may be explicitly excluded.
"$project": {
"_id": 0,
"text": 1,
"source": 1,
"score": {"$meta": "vectorSearchScore"}
}
}
]
results_mg = list(collection_mg.aggregate(pipeline))
print(f"Query: '{query}'")
print("Top-3 Most Similar Documents (Atlas Vector Search):")
print("-" * 60)
for i, r in enumerate(results_mg, 1):
print(f"{i}. (score: {r['score']:.4f})")
print(f" {r['text']}\n")
print("-" * 60)
print("✅ Atlas Vector Search section complete!")
print("Next: Lab 4 - Ingest real MongoDB data at scale")
# ─────────────────────────────────────────────────────────────
# STEP 4: Query Atlas with $vectorSearch
# Mirrors the Chroma query above — same model, same query sentence
# ─────────────────────────────────────────────────────────────
query = "What is deep learning?"
query_embedding = model.encode(query).tolist()
pipeline = [
{
"$vectorSearch": {
"index": "vector_index", # name you gave in Atlas UI
"path": "embedding", # field holding the vector
"queryVector": query_embedding,
"numCandidates": 20, # candidates examined (>= limit)
"limit": 3 # top-K to return
}
},
{
# Retrieve the text + a relevance score. In inclusion projections,
# only _id may be explicitly excluded.
"$project": {
"_id": 0,
"text": 1,
"source": 1,
"score": {"$meta": "vectorSearchScore"}
}
}
]
results_mg = list(collection_mg.aggregate(pipeline))
print(f"Query: '{query}'")
print("Top-3 Most Similar Documents (Atlas Vector Search):")
print("-" * 60)
for i, r in enumerate(results_mg, 1):
print(f"{i}. (score: {r['score']:.4f})")
print(f" {r['text']}\n")
print("-" * 60)
print("✅ Atlas Vector Search section complete!")
print("Next: Lab 4 - Ingest real MongoDB data at scale")
Query: 'What is deep learning?' Top-3 Most Similar Documents (Atlas Vector Search): ------------------------------------------------------------ 1. (score: 0.8090) Deep learning uses neural networks with multiple layers 2. (score: 0.7615) Machine learning is a subset of artificial intelligence 3. (score: 0.6454) Natural language processing helps computers understand text ------------------------------------------------------------ ✅ Atlas Vector Search section complete! Next: Lab 4 - Ingest real MongoDB data at scale