03 - Evals, Observability, and Production Readiness
This module converts "it works on my laptop" into a repeatable production process.
Production Readiness Stack
flowchart LR
A[Golden Dataset] --> B[Pipeline Run]
B --> C[Automated Metrics]
B --> D[Trace Collection]
C --> E[Quality Gate]
D --> E
E --> F{Pass Thresholds?}
F -- Yes --> G[Deploy]
F -- No --> H[Fix + Re-test]
Minimum Evaluation Kit
- Golden dataset with representative and failure cases
- Regression run before each prompt/model/retriever change
- Metrics split by stage (retrieval, generation, tool use)
Baseline Evaluation Types
- Unit tests for deterministic logic
- Prompt regression tests
- Retrieval quality tests
- LLM output quality tests
- Agent task completion tests
- Human review for high-risk paths
- Online experiments when applicable
Suggested Metrics
| Layer | Metrics |
|---|---|
| Retrieval | hit rate, context precision, MRR |
| Generation | faithfulness, relevance, citation accuracy |
| Agent | task success rate, tool-call accuracy, retry rate |
| Operations | latency p95, error rate, cost per successful task |
Day-by-Day Alignment
| Day | Use this page for | Deliverable |
|---|---|---|
| Day 15 | Eval strategy and dataset design | JSONL eval starter set |
| Day 16 | Automated eval gates | Pass-fail gate config |
| Day 17 | Tracing and observability | Structured log sample |
| Day 18 | Safety and policy controls | Middleware checklist |
| Day 19 | SLOs and rollout rules | Deployment checklist |
| Day 20 | Postmortem and improvement loop | Incident review note |
| Day 21 | Production readiness review | Scorecard with gaps |
Step-by-Step Production Readiness Flow
| Step | Action | Output |
|---|---|---|
| 1 | Create a small representative eval set | Baseline dataset |
| 2 | Define pass thresholds before changes | Explicit release bar |
| 3 | Log quality, latency, and cost together | Comparable run history |
| 4 | Gate deploys on measurable thresholds | Safer shipping process |
| 5 | Turn failures into backlog actions | Continuous improvement loop |
Example Code: Tiny Eval Dataset and Gate
eval_set = [
{
"question": "How do I reset an expired API key?",
"expected_source": "security-runbook",
"must_include": ["identity verification", "security portal"],
},
{
"question": "When should a request be escalated?",
"expected_source": "support-policy",
"must_include": ["high risk", "manual review"],
},
]
def release_gate(metrics: dict) -> bool:
return (
metrics["faithfulness"] >= 0.85
and metrics["retrieval_hit_rate"] >= 0.80
and metrics["latency_p95_ms"] <= 3500
and metrics["cost_per_success"] <= 0.08
)
baseline_metrics = {
"faithfulness": 0.88,
"retrieval_hit_rate": 0.82,
"latency_p95_ms": 3100,
"cost_per_success": 0.06,
}
print({"ship": release_gate(baseline_metrics), "metrics": baseline_metrics})
Example Code: Structured Trace Record
{
"trace_id": "trace-1042",
"prompt_version": "rag-v3",
"model": "gpt-4.1-mini",
"retrieved_chunks": [
"policy-0",
"runbook-1"
],
"tool_calls": [],
"faithfulness": 0.9,
"latency_ms": 1840,
"status": "pass"
}
Interview Q: What should block an LLM release?
Model Answer: I block the release when key quality, safety, latency, or cost metrics miss agreed thresholds, or when high-risk cases lack review coverage. The exact gate should be explicit before the change is tested.
Why this matters: It shows you can operationalize quality, not just observe it.
Interview Q: Why is tracing as important as evaluation?
Model Answer: Evaluation tells me that quality moved, but tracing tells me why it moved. Without traces of prompts, retrieved context, and tool behavior, I can detect regressions but not diagnose them quickly.
Why this matters: Strong candidates connect observability to faster iteration and safer production support.
Observability Tools to Know
- LangSmith and Langfuse for LLM traces
- OpenTelemetry for unified instrumentation
- MLflow or Weights and Biases for experiment tracking
- Custom dashboards for latency, cost, and failure taxonomy
Observability Checklist
- [ ] Prompt version logged
- [ ] Model version logged
- [ ] Retrieved context logged
- [ ] Tool inputs/outputs logged
- [ ] User feedback captured
- [ ] Alerting for latency/cost spikes
Deployment Concerns to Explain in Interviews
- Secrets management and least-privilege access
- Timeout, retry, and fallback strategy
- Config separation from code
- Rollback plan and release gates
Baseline Reliability Checklist
- [ ] Timeout and retry policies are explicit
- [ ] Fallback model/provider is defined
- [ ] Secrets are stored in a secret manager
- [ ] Idempotency is enforced for side-effecting workflows
- [ ] Alert thresholds exist for cost and latency spikes
Quick Lab (20 min)
Eval and observability micro-lab
- Build a tiny dataset with 10 questions.
- Define pass thresholds for 3 metrics.
- Run one baseline and one modified pipeline version.
- Decide deploy/no-deploy based on your gate.
Next: 04 STAR Story System