Domain Architecture – ML / AI Services
Last Updated: December 29, 2025
1. Purpose
Centralize model serving, feature retrieval, experimentation, and inference APIs consumed by other domains.
Module Location: Distributed across domains (feature store, model serving)
Related Documentation:
2. Capabilities
| Capability | Description |
|---|---|
| Feature Store API | Retrieve feature vectors (Redis or Cassandra) |
| Recommendation Service | Given userId, return recommended products/posts |
| Fraud Detection | Score order events |
| Anomaly Scoring | Evaluate telemetry metrics |
| Moderation (optional) | Classify chat/social content |
3. Data Sources
- Consumes domain events (orders, telemetry, posts, chat messages)
- Curated warehouse tables (fact_orders, dim_product)
- Feature registry YAML (versioned)
4. Feature Store
- Redis key pattern: feature:{featureGroup}:{entityId}
- TTL for volatile features (recent activity).
- Batch loader populates from warehouse nightly.
5. Inference APIs
REST:
- GET /api/v1/recommendations/products?userId=U1
- POST /api/v1/fraud/score { orderId }
gRPC (optional future):
service Recommendations {
rpc GetProductRecommendations(UserId) returns (ProductList);
}
6. Model Management
Metadata store (Postgres):
- models(id, name, version, status, created_at)
- model_metrics(model_id, metric_name, value, recorded_at)
Deployment Strategy:
- Blue/Green via separate endpoint version
- Model version header: X-Model-Version
7. Events Produced
| Event | Purpose |
|---|---|
| ml.fraud.order.scored.v1 | Downstream actions (manual review) |
| ml.recommendation.served.v1 | Observability / A/B tracking |
| ml.anomaly.detected.v1 | IoT / Inventory reaction |
8. Testing
| Type | Focus |
|---|---|
| Unit | Feature extraction logic |
| Integration | Event ingestion to feature store |
| Performance | Recommendation latency p95 |
| Drift Monitoring (manual) | Compare feature distributions over time |
9. Observability
Metrics:
- inference_latency_ms
- model_version_request_count
- feature_cache_hit_ratio
- fraud_score_distribution (histogram buckets)
Tracing:
- Inference call spans with model version attribute.
10. Security
- Auth required for inference (JWT).
- Rate limiting per client key.
- Model artifacts stored in object storage with signed URLs (future).
11. Implementation Order
- Feature store scaffold + Redis integration
- Simple rule-based recommendation (no ML yet)
- Fraud scoring stub (random score)
- Inference REST endpoints + tracing
- Event emission for consumption audit
- Replace stub with lightweight ML (e.g., collaborative filtering mock)
- Add gRPC service
- Introduce model metadata & A/B version routing
12. Interoperability Checklist
- [ ] Consumes only public domain event types
- [ ] Does not introduce direct DB coupling to other domains
- [ ] Features versioned & documented
- [ ] Recommendation response stable & backward compatible
13. Exit Criteria
- Recommendation latency p95 < 150ms (local)
- Fraud scoring event produced for each paid order
- Feature cache > 80% hit rate after warmup
Related Documentation
Architecture References
- System Overview - Platform architecture
- Data Platform - Feature engineering
- Data Architecture - OLAP integration
- E-Commerce Domain - Fraud detection
- IoT Domain - Anomaly detection
Implementation Guides
- API Protocols Guide - gRPC patterns
- Microservices Patterns
- Testing Strategy
ADRs
Current Status: Planned 📋
Last Review: December 29, 2025