Skip to content

Module 19 — Cost & Latency: what does this actually cost to run?

Every capability added in the previous modules — RAG retrieval, attribution scoring, adversarial checks, draft generation — consumes time and money. Module 19 is where the engineering decisions of the whole course get priced.

The question is not whether these capabilities are worth having. They are. The question is: at what point does the cost of running them exceed the value they deliver, and how do you find that point before it finds you?


Token cost: the Monday ticket end to end

At full pipeline capability (Modules 1–18 active), the Monday ticket consumes:

Component                           Tokens (input)    Tokens (output)
────────────────────────────────────────────────────────────────────
System prompt                            600               —
RAG retrieved documents (k=3)            600               —
Conversation history                      80               —
Ticket text                               70               —
Classification output                     —                30
Attribution scores                        —                80
Draft responses (×3)                      —               240
Resolution time estimate                  —                20
────────────────────────────────────────────────────────────────────
Total                                  1,350              370
Grand total: ~1,720 tokens per ticket

At current API pricing, that's a known cost per ticket. Multiply by volume:

Daily ticket volume:          500
Tokens per ticket:          1,720
Daily token consumption:  860,000

Monthly (22 working days): ~18,900,000 tokens

That number needs to sit in front of whoever approves the infrastructure budget before the system goes live — not after.


Latency: where the time goes

Token cost is the budget question. Latency is the user experience question. Your analyst needs a routed ticket with draft responses before their 10 AM call. The pipeline has to finish in time.

Step                          Typical latency
────────────────────────────────────────────
Tokenization [M01]                  <10ms
RAG retrieval [M18]               80–200ms
Model inference (classification)  400–900ms
Attribution scoring [M14]         100–300ms
Draft generation (×3) [M06]       600–1200ms
Resolution estimate [M07]           <10ms
────────────────────────────────────────────
Total pipeline:               ~1.2–2.6 seconds

For a ticket arriving at 9:47 AM with a 10 AM deadline, 2.6 seconds is fine. For a high-volume queue processing 500 tickets simultaneously, 2.6 seconds per ticket with sequential processing means the last ticket waits 21 minutes. Parallelise or the SLA breaks.


The cost-capability trade-off matrix

Not every ticket needs the full pipeline. Build a tiered execution model:

Tier 1 — Fast path (high confidence tickets):
  Conditions: top_prob ≥ 0.80, H_norm ≤ 0.35, no adversarial flags
  Steps run:  M01 tokenize → M03 logits → M02/M04 checks → M06 auto-route
  Skipped:    RAG, attribution, draft generation, resolution estimate
  Tokens:     ~750 per ticket
  Latency:    ~500ms
  % of volume: ~45% of tickets

Tier 2 — Standard path (clarification queue):
  Conditions: top_prob < 0.80 OR H_norm > 0.35
  Steps run:  Full pipeline
  Tokens:     ~1,720 per ticket
  Latency:    ~2.0s
  % of volume: ~48% of tickets

Tier 3 — Full path (escalation + adversarial flags):
  Conditions: H_norm ≥ 0.80 OR adversarial flag OR high-FNR segment
  Steps run:  Full pipeline + extended RAG (k=5) + senior analyst alert
  Tokens:     ~2,200 per ticket
  Latency:    ~2.8s
  % of volume: ~7% of tickets

Blended cost across tiers:

(0.45 × 750) + (0.48 × 1,720) + (0.07 × 2,200)
= 337.5 + 825.6 + 154
= ~1,317 tokens per ticket (average)

vs 1,720 tokens if every ticket ran the full pipeline. A 23% cost reduction by routing tickets to the appropriate processing tier.


The latency budget for the Monday ticket

The Monday ticket is Tier 2 — standard path. At 9:47 AM:

09:47:00.000  Ticket arrives
09:47:00.010  Tokenization complete
09:47:00.190  RAG retrieval complete (3 documents returned)
09:47:00.890  Classification complete (logits → softmax → routing decision)
09:47:01.140  Attribution scoring complete (flags generated)
09:47:02.340  Draft responses generated (×3)
09:47:02.350  Resolution estimate calculated
09:47:02.350  Ticket appears in analyst queue with full context

Total: 2.35 seconds from arrival to analyst view. Analyst has 12 minutes and 57 seconds before the 10 AM call.


Where costs grow unexpectedly

Conversation history accumulation (Module 1): as users reply to tickets, history grows. A ticket with 10 back-and-forth messages adds ~400 tokens of history. At scale, implement the summarisation strategy from Module 1 — don't let history grow unbounded.

RAG document length: if policy documents are retrieved in full rather than as chunks, a single document can be 800–1,200 tokens. Three such documents fill the context window. Enforce chunk size at indexing time (Module 18).

Draft generation at temperature 1.0: stochastic generation is inherently slower than deterministic inference. Generating three drafts costs approximately 3× the output token budget. If latency is critical, generate one draft deterministically and offer "generate alternatives" as an on-demand action.

Concurrent volume spikes: Monday mornings, post-patch deployments, and incident-adjacent periods all cause volume spikes. A 3× spike with sequential processing triples your latency. Design for burst capacity, not average load.


Monitoring cost and latency in production

These are not set-and-forget metrics. Add them to your weekly operations review alongside accuracy and FNR:

Weekly cost and latency report:

Metric                          Target      This week
──────────────────────────────────────────────────────
Avg tokens per ticket            ≤1,400      1,317 ✅
Monthly token projection      ≤20M/mo    ~14.5M ✅
Tier 1 ticket share              ≥40%        45% ✅
P95 latency (Tier 2)              ≤3.0s       2.4s ✅
P95 latency (Tier 3)              ≤4.0s       3.1s ✅
Latency SLA breaches              0            0 ✅
RAG retrieval timeout rate        <0.5%       0.3% ✅

If Tier 1 share drops significantly, your ticket population is getting more ambiguous — investigate whether that's real (harder tickets) or a model degradation signal (Module 20).


Checklist

  • [ ] Have you calculated total token consumption at expected daily and monthly volume?
  • [ ] Is the tiered execution model implemented — are high-confidence tickets skipping expensive steps?
  • [ ] Have you parallelised pipeline execution where steps are independent?
  • [ ] Are conversation histories being summarised to prevent unbounded token growth?
  • [ ] Are RAG documents chunked to enforce maximum retrieval token cost?
  • [ ] Are cost and latency metrics in your weekly operations review?
  • [ ] Have you capacity-planned for volume spikes (Monday mornings, post-patch periods)?

Cost and latency are not obstacles to building the right system. They are constraints that force you to be precise about which capabilities matter most, for which tickets, under which conditions.

Module 20 addresses what happens over time: ticket patterns shift, language evolves, new incident types emerge — and a model that was well-calibrated in week one starts to drift.