Lab 3: Vector Database Basics with Chroma & MongoDB Atlas Vector Search¶

Level: Intermediate | Duration: 2 hours

Objective¶

Learn how to store, index, and search vectors using a production-grade vector database.

What You'll Learn¶

Create and configure a Chroma vector database
Store embeddings with metadata
Perform similarity search
Understand HNSW indexing
Handle update and delete operations
Persist database to disk
Replicate the same workflow with MongoDB Atlas Vector Search

Why Chroma?¶

✅ Easy to learn (Python API)
✅ Built-in embeddings support
✅ Metadata filtering
✅ Persistent storage
✅ Works locally (no server setup)
✅ Perfect for prototyping and learning

In [22]:

Copied!





import chromadb
from sentence_transformers import SentenceTransformer
import json

# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_data")

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

print(f"Created Chroma collection: documents")
print(f"Storage space: cosine (best for embeddings)")
print(f"HNSW index: Hierarchical Navigable Small World\n")

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample documents (same set reused for MongoDB section below)
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing helps computers understand text",
    "Computer vision analyzes and interprets images",
    "Data science combines statistics and programming"
]

# Add documents to collection
for i, doc in enumerate(documents):
    embedding = model.encode(doc)
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[embedding.tolist()],
        documents=[doc],
        metadatas=[{"source": "sample", "index": i}]
    )

print(f"Added {len(documents)} documents to Chroma\n")

# Search the collection
query = "What is deep learning?"
query_embedding = model.encode(query)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=3
)

print(f"Query: '{query}'")
print("Top-3 Most Similar Documents:")
print("-" * 60)
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
    similarity = 1 - dist  # Convert distance to similarity
    print(f"{i}. (similarity: {similarity:.4f})")
    print(f"   {doc}\n")

print("✅ Chroma section complete!")
import chromadb
from sentence_transformers import SentenceTransformer
import json

# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_data")

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

print(f"Created Chroma collection: documents")
print(f"Storage space: cosine (best for embeddings)")
print(f"HNSW index: Hierarchical Navigable Small World\n")

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample documents (same set reused for MongoDB section below)
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing helps computers understand text",
    "Computer vision analyzes and interprets images",
    "Data science combines statistics and programming"
]

# Add documents to collection
for i, doc in enumerate(documents):
    embedding = model.encode(doc)
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[embedding.tolist()],
        documents=[doc],
        metadatas=[{"source": "sample", "index": i}]
    )

print(f"Added {len(documents)} documents to Chroma\n")

# Search the collection
query = "What is deep learning?"
query_embedding = model.encode(query)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=3
)

print(f"Query: '{query}'")
print("Top-3 Most Similar Documents:")
print("-" * 60)
for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
    similarity = 1 - dist  # Convert distance to similarity
    print(f"{i}. (similarity: {similarity:.4f})")
    print(f"   {doc}\n")

print("✅ Chroma section complete!")

Created Chroma collection: documents
Storage space: cosine (best for embeddings)
HNSW index: Hierarchical Navigable Small World

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Added 5 documents to Chroma

Query: 'What is deep learning?'
Top-3 Most Similar Documents:
------------------------------------------------------------
1. (similarity: 0.6180)
   Deep learning uses neural networks with multiple layers

2. (similarity: 0.5231)
   Machine learning is a subset of artificial intelligence

3. (similarity: 0.2908)
   Natural language processing helps computers understand text

✅ Chroma section complete!

Part 2: Same Workflow on MongoDB Atlas Vector Search¶

MongoDB Atlas (free tier M0) supports vector search natively using the same cosine similarity approach.

Prerequisites¶

Create a free Atlas account if you haven't already.
Create a free M0 cluster (any region).
Under Security → Database Access, create a user with read/write access.
Under Security → Network Access, allow access from your IP (or 0.0.0.0/0 for testing).
Copy your connection string from Connect → Drivers (Python 3.12+).
Install the driver:
```
pip install pymongo
```

How Atlas Vector Search works vs Chroma¶

Step	Chroma	MongoDB Atlas
Store vectors	`collection.add(embeddings=...)`	Insert documents with an `embedding` field
Create index	Automatic (HNSW built-in)	Create a Search Index in Atlas UI or API
Query	`collection.query(query_embeddings=...)`	`$vectorSearch` aggregation stage
Result score	Cosine distance (0 = identical)	`score` field via `$meta: "vectorSearchScore"`

Key difference: In Atlas you create the vector search index once in the UI, then query it with an aggregation pipeline.

In [24]:

Copied!





from dotenv import load_dotenv
from pathlib import Path

import os

# Optional Atlas import (only required if using SOURCE_MODE="atlas")
try:
    from pymongo import MongoClient
except Exception:
    MongoClient = None

# Load ATLAS_URI from the first .env found while walking up directories.
for _base in [Path.cwd(), *Path.cwd().parents]:
    _candidate = _base / ".env"
    if _candidate.exists():
        load_dotenv(dotenv_path=_candidate, override=False)
        break
# ─────────────────────────────────────────────────────────────
# STEP 1: Set your connection string (keep this private!)
# Copy it from Atlas → Connect → Drivers → Python
# ─────────────────────────────────────────────────────────────
ATLAS_URI = os.getenv("ATLAS_URI", "").strip()

# ─────────────────────────────────────────────────────────────
# STEP 2: Connect and insert documents with their embeddings
# ─────────────────────────────────────────────────────────────
from pymongo import MongoClient
import numpy as np

import ssl as _ssl

# tlsAllowInvalidCertificates=True bypasses corporate TLS inspection.
# ⚠ Use ONLY for local dev/testing — never in production.
# If you are on a personal network remove this option and use:
#   mongo_client = MongoClient(ATLAS_URI)
mongo_client = MongoClient(
    ATLAS_URI,
    tls=True,
    tlsAllowInvalidCertificates=True,   # workaround for corporate TLS proxy
    serverSelectionTimeoutMS=30000,
)


# Quick ping to confirm connection before doing anything else
mongo_client.admin.command("ping")
print("✅ Connected to MongoDB Atlas")
from dotenv import load_dotenv
from pathlib import Path

import os

# Optional Atlas import (only required if using SOURCE_MODE="atlas")
try:
    from pymongo import MongoClient
except Exception:
    MongoClient = None

# Load ATLAS_URI from the first .env found while walking up directories.
for _base in [Path.cwd(), *Path.cwd().parents]:
    _candidate = _base / ".env"
    if _candidate.exists():
        load_dotenv(dotenv_path=_candidate, override=False)
        break
# ─────────────────────────────────────────────────────────────
# STEP 1: Set your connection string (keep this private!)
# Copy it from Atlas → Connect → Drivers → Python
# ─────────────────────────────────────────────────────────────
ATLAS_URI = os.getenv("ATLAS_URI", "").strip()

# ─────────────────────────────────────────────────────────────
# STEP 2: Connect and insert documents with their embeddings
# ─────────────────────────────────────────────────────────────
from pymongo import MongoClient
import numpy as np

import ssl as _ssl

# tlsAllowInvalidCertificates=True bypasses corporate TLS inspection.
# ⚠ Use ONLY for local dev/testing — never in production.
# If you are on a personal network remove this option and use:
#   mongo_client = MongoClient(ATLAS_URI)
mongo_client = MongoClient(
    ATLAS_URI,
    tls=True,
    tlsAllowInvalidCertificates=True,   # workaround for corporate TLS proxy
    serverSelectionTimeoutMS=30000,
)


# Quick ping to confirm connection before doing anything else
mongo_client.admin.command("ping")
print("✅ Connected to MongoDB Atlas")

✅ Connected to MongoDB Atlas

In [28]:

Copied!





DB_NAME       = "rag_lab"
COLL_NAME     = "documents"
db            = mongo_client[DB_NAME]
collection_mg = db[COLL_NAME]

# Drop any previous run data so re-running is safe
collection_mg.drop()

# Build records: each document gets a text field + its embedding vector
records = []
for i, doc in enumerate(documents):
    embedding = model.encode(doc).tolist()   # list[float] — required by Atlas
    records.append({
        "_id":       f"doc_{i}",
        "text":      doc,
        "source":    "lab3_sample",
        "embedding": embedding               # 384-dim vector (all-MiniLM-L6-v2)
    })

collection_mg.insert_many(records)
print(f"Inserted {len(records)} documents into Atlas collection '{COLL_NAME}'")
print("Sample record keys:", list(records[0].keys()))
DB_NAME       = "rag_lab"
COLL_NAME     = "documents"
db            = mongo_client[DB_NAME]
collection_mg = db[COLL_NAME]

# Drop any previous run data so re-running is safe
collection_mg.drop()

# Build records: each document gets a text field + its embedding vector
records = []
for i, doc in enumerate(documents):
    embedding = model.encode(doc).tolist()   # list[float] — required by Atlas
    records.append({
        "_id":       f"doc_{i}",
        "text":      doc,
        "source":    "lab3_sample",
        "embedding": embedding               # 384-dim vector (all-MiniLM-L6-v2)
    })

collection_mg.insert_many(records)
print(f"Inserted {len(records)} documents into Atlas collection '{COLL_NAME}'")
print("Sample record keys:", list(records[0].keys()))

Inserted 5 documents into Atlas collection 'documents'
Sample record keys: ['_id', 'text', 'source', 'embedding']

Retrieve one document¶

In [27]:

Copied!





doc = collection_mg.find_one({"_id": "doc_0"}, {"_id": 1, "embedding": 1, "text": 1})
if not doc:
    print(f"Document not found: {doc_0}")
else:
    emb = doc.get("embedding")
    print("Found:", doc["_id"])
    print("Text:", (doc.get("text") or "")[:120], "...")
    print("Embedding length:", len(emb) if emb else 0)
    print("First 10 values:", emb[:10] if emb else None)
doc = collection_mg.find_one({"_id": "doc_0"}, {"_id": 1, "embedding": 1, "text": 1})
if not doc:
    print(f"Document not found: {doc_0}")
else:
    emb = doc.get("embedding")
    print("Found:", doc["_id"])
    print("Text:", (doc.get("text") or "")[:120], "...")
    print("Embedding length:", len(emb) if emb else 0)
    print("First 10 values:", emb[:10] if emb else None)

Found: doc_0
Text: Machine learning is a subset of artificial intelligence ...
Embedding length: 384
First 10 values: [-0.04610738530755043, -0.004260687157511711, 0.0698365792632103, 0.035535287111997604, 0.048502057790756226, -0.030225230380892754, 0.001603968907147646, -0.009542404673993587, -0.05142451077699661, -0.0038602121639996767]

STEP 3: Create the Vector Search Index in Atlas UI¶

Before querying you need to create a Search Index once in the Atlas web console:

Open your cluster → Atlas Search tab → Create Search Index.
Choose JSON Editor and paste:

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "cosine"
    }
  ]
}

Name the index vector_index and click Create.
Wait ~1 minute for the index to build (status changes from Building to Ready).

Then run the next cell.

Mongo DB¶

Allow the IP Address of your machine to access the MongoDB Atlas cluster by adding it to the IP Access List in the MongoDB Atlas dashboard.

After the db is set up, you can inspect the search index to see how documents are being chunked and embedded for retrieval.

In [29]:

Copied!





# ─────────────────────────────────────────────────────────────
# STEP 4: Query Atlas with $vectorSearch
# Mirrors the Chroma query above — same model, same query sentence
# ─────────────────────────────────────────────────────────────
query = "What is deep learning?"
query_embedding = model.encode(query).tolist()

pipeline = [
    {
        "$vectorSearch": {
            "index":        "vector_index",   # name you gave in Atlas UI
            "path":         "embedding",      # field holding the vector
            "queryVector":  query_embedding,
            "numCandidates": 20,              # candidates examined (>= limit)
            "limit":        3                 # top-K to return
        }
    },
    {
        # Retrieve the text + a relevance score. In inclusion projections,
        # only _id may be explicitly excluded.
        "$project": {
            "_id":     0,
            "text":    1,
            "source":  1,
            "score":   {"$meta": "vectorSearchScore"}
        }
    }
]

results_mg = list(collection_mg.aggregate(pipeline))

print(f"Query: '{query}'")
print("Top-3 Most Similar Documents (Atlas Vector Search):")
print("-" * 60)
for i, r in enumerate(results_mg, 1):
    print(f"{i}. (score: {r['score']:.4f})")
    print(f"   {r['text']}\n")
print("-" * 60)
print("✅ Atlas Vector Search section complete!")
print("Next: Lab 4 - Ingest real MongoDB data at scale")
# ─────────────────────────────────────────────────────────────
# STEP 4: Query Atlas with $vectorSearch
# Mirrors the Chroma query above — same model, same query sentence
# ─────────────────────────────────────────────────────────────
query = "What is deep learning?"
query_embedding = model.encode(query).tolist()

pipeline = [
    {
        "$vectorSearch": {
            "index":        "vector_index",   # name you gave in Atlas UI
            "path":         "embedding",      # field holding the vector
            "queryVector":  query_embedding,
            "numCandidates": 20,              # candidates examined (>= limit)
            "limit":        3                 # top-K to return
        }
    },
    {
        # Retrieve the text + a relevance score. In inclusion projections,
        # only _id may be explicitly excluded.
        "$project": {
            "_id":     0,
            "text":    1,
            "source":  1,
            "score":   {"$meta": "vectorSearchScore"}
        }
    }
]

results_mg = list(collection_mg.aggregate(pipeline))

print(f"Query: '{query}'")
print("Top-3 Most Similar Documents (Atlas Vector Search):")
print("-" * 60)
for i, r in enumerate(results_mg, 1):
    print(f"{i}. (score: {r['score']:.4f})")
    print(f"   {r['text']}\n")
print("-" * 60)
print("✅ Atlas Vector Search section complete!")
print("Next: Lab 4 - Ingest real MongoDB data at scale")

Query: 'What is deep learning?'
Top-3 Most Similar Documents (Atlas Vector Search):
------------------------------------------------------------
1. (score: 0.8090)
   Deep learning uses neural networks with multiple layers

2. (score: 0.7615)
   Machine learning is a subset of artificial intelligence

3. (score: 0.6454)
   Natural language processing helps computers understand text

------------------------------------------------------------
✅ Atlas Vector Search section complete!
Next: Lab 4 - Ingest real MongoDB data at scale