Skip to content

Production Considerations

Production readiness is not a single feature.

It is a consistent operating model across reliability, security, cost, and governance.

The current code is useful as a baseline because it shows the main moving parts clearly, but it also reveals where production safeguards are still missing: auth is a placeholder, routing is hard-coded, memory is transient, and traces are printed.

Readiness flow

flowchart TB
    D[Design Review] --> T[Load and Failure Tests]
    T --> S[Security Controls]
    S --> O[Observability Baseline]
    O --> R[Release Gates]
    R --> P[Post-Release Monitoring]

    style D fill:#1976d2,color:#fff
    style T fill:#1976d2,color:#fff
    style S fill:#ff9800,color:#fff
    style O fill:#ff9800,color:#fff
    style R fill:#1976d2,color:#fff
    style P fill:#ff9800,color:#fff

Production checklist

Domain Must-have controls Practical baseline
Reliability Timeouts, retries, fallback responses Track P95 latency and tool error rate
Security AuthN, AuthZ, secret hygiene Separate read-only and write-capable tools
Cost Token and tool usage budgeting Add request-level cost attribution
Governance Audit trail and release process Versioned prompts and tool contracts

Production controls mapped to the repository

Control Applied to Why it matters
Request validation src/api/server.py Prevents malformed prompts from entering the pipeline
Tool allowlist src/llm/router.py Keeps the router from calling unsafe tools
Persistent storage src/memory/store.py Preserves memory across process restarts
Structured telemetry src/observability/tracer.py Makes operational analysis searchable and machine-readable
Authentication src/security/auth.py Protects API entry points and privileged tools

Capacity planning formula

A simple planning formula estimates required workers under expected load.

\[ W = \frac{Q \cdot L}{U} \]
Symbol Meaning
W Minimum worker count
Q Target requests per second
L Average processing latency in seconds
U Desired utilization per worker

Example: for Q=20, L=0.3, and U=0.6, you need at least 10 workers to stay within target utilization.

What is the most common production mistake?

Teams launch without route-level and tool-level telemetry. Without this, failures appear as generic model errors and are hard to isolate.

How do we roll out safely?

Use staged releases with shadow traffic and automated rollback triggers. Promote only after SLO and error budget checks remain stable.

What should be productionized first in this codebase?

Add structured observability and real authentication before expanding routing logic. Those two changes reduce operational and security risk immediately.

--8<-- "_abbreviations.md"