Lab Data Sources & Setup Guide

Open-Source MongoDB Datasets

All MongoDB sample datasets are publicly available and free to use for educational purposes.

MongoDB Atlas Sample Datasets

MongoDB provides free sample datasets that you can use directly. No credit card needed for local development.

1. Restaurants Dataset (25K documents)

Size: ~24,000 documents
Fields: name, address, borough, cuisine, grades, phone, website
Perfect for: Lab 4 (ingestion) and Lab 6 (hybrid search)
Use Case: Restaurant search and recommendation

Quick Start:

# Option 1: Download JSON directly (see script below)
# Option 2: Use MongoDB Atlas (free tier)
# Option 3: Load from provided restaurants.json in /data folder

2. Movies Dataset (23K documents)

Size: ~23,000 documents
Fields: title, plot, genres, runtime, imdb rating, cast, directors
Perfect for: Lab 4 and semantic search on descriptions
Use Case: Movie/media recommendation system

Data Download Instructions

For Local Development (Recommended)

Option A: Download Pre-sampled Data from This Project

cd docs/labs/data/
# Files will be provided (restaurants_sample.json, movies_sample.json)

Option B: Download from MongoDB via Python

import json
import urllib.request

# Restaurants dataset
url = "https://raw.githubusercontent.com/mongodb/mongo-python-driver/master/bson/json_util.py"
# Or use MongoDB Atlas sample datasets directly

# Instructions in Lab 4

Option C: Use MongoDB Atlas (Free Account)

Create free MongoDB Atlas account: https://www.mongodb.com/cloud/atlas
Create free cluster (512MB storage)
Load sample dataset from Atlas UI
Connect via connection string in python

How Laboratory Data is Used

Lab 4: MongoDB Ingestion Pipeline

Dataset: Restaurants (or Movies)
Process:
Load ~1000 documents
Extract text fields (name, cuisine, reviews)
Chunk documents (fixed-size or semantic)
Generate embeddings
Store in Chroma with metadata

Expected Output: Vector database with 1000-5000 chunks

Lab 5: Exact Match Problem

Dataset: Generated custom dataset with order IDs
Challenge: Order #1766 vs #1767 issue
Solution: Hybrid search

Lab 6: Hybrid Search Evaluation

Dataset: Restaurants reviews + structured metadata
Comparison: Semantic vs Keyword vs Hybrid
Metric: How many top-3 results are truly relevant

Sample Data Included

This project includes smaller sample datasets (500 documents each) in: - docs/labs/data/restaurants_sample.json (500 restaurants) - docs/labs/data/exact_match_sample.json (100 order records) - docs/labs/data/hybrid_test_set.json (200 documents with references)

These are perfect for testing locally without downloading large files.

Data Schema Reference

Restaurants JSON Structure

{
  "_id": "ObjectId",
  "name": "Restaurant Name",
  "address": {
    "street": "123 Main St",
    "zipcode": "10001",
    "borough": "Manhattan",
    "coord": [-73.9, 40.7]
  },
  "cuisine": "Italian",
  "phone": "212-555-1234",
  "grades": [
    {
      "date": "2021-01-15",
      "grade": "A",
      "score": 13
    }
  ],
  "website": "http://example.com"
}

Movies JSON Structure

{
  "_id": "ObjectId",
  "title": "Movie Title",
  "plot": "Full plot description...",
  "genres": ["Action", "Drama"],
  "runtime": 120,
  "rated": "PG-13",
  "imdb": {
    "rating": 8.5,
    "votes": 250000
  },
  "cast": ["Actor 1", "Actor 2"],
  "directors": ["Director 1"],
  "year": 2021
}

License & Attribution

MongoDB Sample Datasets: Creative Commons License - see https://www.mongodb.com/docs/atlas/sample-data/
Custom Datasets: Created for educational purposes

Troubleshooting Data Issues

Issue: "File not found" - Solution: Download sample data using Lab 4 instructions

Issue: "Encoding errors reading JSON" - Solution: Specify UTF-8 encoding: open(file, encoding='utf-8')

Issue: "Out of memory with full dataset" - Solution: Use provided sample datasets or limit documents: json_data[:1000]

Next Steps

Lab 4: Will guide you through loading this data
Create vector embeddings: Convert text fields to vectors
Store in database: Chunk and embed at scale
Evaluate results: Measure quality of retrieval

See Lab 4: MongoDB Ingestion for detailed walkthrough.