8.04 · Deep Dive: MongoDB for Document Analytics
Level: Intermediate Time to read: 15 min Pre-reading: 08 · Tools Ecosystem After reading: You'll understand MongoDB's strengths for semi-structured data and when it's suitable for analytics.
MongoDB: Document Database
MongoDB is a document-oriented NoSQL database that stores data as JSON-like documents. It's optimized for flexible schemas and operational workloads, not analytics.
Architecture
Traditional RDBMS: MongoDB:
┌─────────────────┐ ┌──────────────────────────┐
│ Customers Table │ │ customers Collection │
├─────────────────┤ ├──────────────────────────┤
│ id │ name │ age│ │ {_id: 1, name: "John", │
├─────┼──────┼────┤ │ age: 30, │
│ 1 │ John │ 30 │ │ orders: [{id: 101, ...}]│
│ 2 │ Jane │ 28 │ │ } │
└─────┴──────┴────┘ │ {_id: 2, name: "Jane", │
│ age: 28, ...} │
└──────────────────────────┘
Strengths
✅ Flexible schema - Add fields without migrations ✅ Nested data - Embed related documents ✅ Developer-friendly - JSON-like format ✅ Scalable - Built-in horizontal scaling (sharding) ✅ Fast writes - Optimized for insert/update operations
Weaknesses for Analytics
❌ Complex aggregations - Denormalized data is hard to analyze ❌ No JOINs - Relationships are embedded, limiting query flexibility ❌ High storage - Denormalization increases data size ❌ Limited SQL - MongoDB Query Language is less powerful than SQL ❌ No columnar compression - Row-like storage, not optimized for analytics scans
When to Use MongoDB
Good Fit
- ✅ Product catalogs (varied attributes per product)
- ✅ User profiles (flexible, evolving schemas)
- ✅ Content management systems
- ✅ Real-time operational analytics
- ✅ IoT sensor data (document per sensor reading)
Poor Fit
- ❌ Complex analytics queries (JOINs across collections)
- ❌ Ad-hoc business intelligence
- ❌ Time-series data (use Cassandra or TimescaleDB)
- ❌ Large-scale aggregations (millions of groups)
- ❌ Financial/audit data (need strict schemas)
MongoDB for Analytics: The Pattern
Instead of building analytics directly in MongoDB, follow this pattern:
MongoDB (Operational)
↓ Export/CDC
Parquet Files (Cloud Storage)
↓ Load
BigQuery / Snowflake (DWH)
↓ Transform (dbt)
Gold Layer
↓
BI Tools / Dashboards
MongoDB Aggregation Pipeline (Limited Analytics)
// MongoDB aggregation pipeline (limited capability)
db.customers.aggregate([
{
$match: {
created_at: { $gte: new Date('2024-01-01') }
}
},
{
$group: {
_id: "$city",
num_customers: { $sum: 1 },
avg_lifetime_value: { $avg: "$lifetime_value" }
}
},
{
$sort: { avg_lifetime_value: -1 }
},
{
$limit: 10
}
]);
// Equivalent SQL (more powerful):
// SELECT city,
// COUNT(*) as num_customers,
// AVG(lifetime_value) as avg_lifetime_value
// FROM customers
// WHERE created_at >= '2024-01-01'
// GROUP BY city
// ORDER BY avg_lifetime_value DESC
// LIMIT 10;
Note: Aggregation pipelines work but are less flexible and slower than SQL queries on data warehouses.
MongoDB Atlas for Analytics
MongoDB Atlas = Managed MongoDB cloud + BI connectors
MongoDB Atlas (cloud)
↓
Option 1: Aggregation pipelines (in-DB analytics, limited)
OR
Option 2: BI Connectors → Tableau, Power BI (using aggregation pipeline)
OR
Option 3: Export to Data Warehouse → BigQuery/Snowflake
Example: Export to BigQuery
// Use MongoDB's native connectors or tools like Stitch/Fivetran
// to periodically export collections to BigQuery
// MongoDB collection
db.orders.find({})
// Becomes BigQuery table
SELECT * FROM `project.dataset.orders`
// Now run analytics
SELECT
customer_id,
COUNT(*) as num_orders,
SUM(amount) as total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;
Comparison: MongoDB vs. Data Warehouses
| Feature | MongoDB | BigQuery | Snowflake |
|---|---|---|---|
| Flexibility | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
| Query Speed (analytics) | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Aggregations | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| JOINs | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Schema evolution | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Write performance | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Cost for analytics | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
Key Takeaways
- MongoDB is operational, not analytical - Designed for OLTP, not OLAP
- Export to data warehouse for serious analytics (BigQuery, Snowflake)
- Aggregation pipelines have limited functionality vs. SQL
- Use MongoDB for product data, not analytical data
- Combine with BigQuery/Snowflake for complete analytics stack
- For analytics, use cloud DWH as single source of truth