Module 18 — Retrieval-Augmented Generation: giving the model live knowledge¶
Fine-tuning (Module 17) updates the model's weights to reflect patterns in your data. But weights are fixed at training time. By the time a fine-tuned model is deployed, the world has already moved on: new patches have been released, new policies written, new incident patterns emerged.
Retrieval-Augmented Generation (RAG) solves a different problem: it gives the model access to information it could not have been trained on, at the moment it needs it.
The gap that RAG fills¶
On the Monday ticket, the model had to classify based entirely on the ticket text and whatever patterns it had learned during training. It didn't know:
- What the Friday patch notice actually said
- Whether other users had submitted similar tickets that morning
- What the current policy is for MFA-related lockouts
- Whether there's an active incident already open for the VPN issue
A model without RAG is reasoning in a vacuum. With RAG, it reasons with context.
How RAG works¶
At inference time — when the ticket arrives, before classification — a retrieval step runs:
1. Ticket arrives: "locked out + VPN dropping + patch notice last Friday"
2. Retrieval step:
→ convert ticket to embedding vector
→ search knowledge base for nearest neighbours
→ return top-3 most relevant documents
3. Retrieved documents:
[Doc 1] Patch notice from Friday:
"Security patch KB-2847 addresses authentication flow.
Known side effect: MFA tokens may require re-enrollment.
Estimated affected users: ~12%"
[Doc 2] Incident log opened 08:23 Monday:
"Multiple VPN disconnections reported — under investigation
by network team. Likely related to KB-2847."
[Doc 3] Policy: Account lockout + patch correlation:
"If lockout coincides with a patch deployment within 72h,
treat as potential security incident pending investigation."
4. Augmented prompt sent to classifier:
[System] + [Retrieved docs] + [Ticket text] → classification
The model now knows what the patch was, that an incident is already open, and that policy mandates escalation. The ticket that previously scored 0.039 on security_incident now scores much higher.
The knowledge base: what goes in it¶
RAG is only as good as what you retrieve. For a service desk copilot, the knowledge base should contain:
Document type Update frequency Priority
────────────────────────────────────────────────────────
Active incident logs Real-time Critical
Patch and change notices Per deployment High
Security bulletins Per release High
Routing policies Per policy update High
Past resolved tickets Daily Medium
Known error patterns Weekly Medium
User account history Real-time Situational
The key insight: documents that change frequently are exactly the ones the model can't have learned in training. Real-time incident logs, today's patch notice, this week's security bulletin — these are the highest-value RAG inputs.
Retrieval quality: the precision-recall trade-off¶
Retrieving too few documents risks missing the relevant one. Retrieving too many floods the context window (Module 1) with noise.
For the Monday ticket, test retrieval quality:
Retrieval experiment — top-k documents:
k=1: Retrieved patch notice (correct) — missed incident log
k=3: Retrieved patch notice + incident log + policy (all correct)
k=5: Retrieved patch notice + incident log + policy +
unrelated ticket from 3 weeks ago +
general VPN troubleshooting guide (noise added)
k=10: Context window fills with low-relevance documents,
classifier performance degrades
For this ticket type, k=3 is optimal. This will vary by ticket category — run the experiment for each major ticket type and set k per category.
Token budget impact¶
RAG has a direct cost in Module 1 terms. Each retrieved document consumes tokens:
Without RAG:
System prompt: ~600 tokens
Conversation history: ~80 tokens
Current ticket: ~70 tokens
Reserved output: ~500 tokens
────────────────────────────────────
Total: ~1,250 / 4,000 tokens
With RAG (k=3, avg doc length 200 tokens):
System prompt: ~600 tokens
Retrieved documents: ~600 tokens ← new
Conversation history: ~80 tokens
Current ticket: ~70 tokens
Reserved output: ~500 tokens
────────────────────────────────────
Total: ~1,850 / 4,000 tokens
Still within budget. But if retrieved documents are long (full policy documents, detailed incident logs), they can push you past the limit. Chunk documents into retrievable segments of ~150-200 tokens before indexing. Retrieve chunks, not whole documents.
RAG and the Monday ticket: before and after¶
Without RAG:
Ticket text only → logits [3.2, 2.8, 0.5] → probabilities [0.576, 0.386, 0.039]
H_norm = 0.74 → clarification queue
Analyst required
With RAG (k=3 retrieval):
Ticket + patch notice + incident log + policy →
logits [2.1, 2.6, 3.4] → probabilities [0.198, 0.328, 0.474]
Top class: security_incident (0.474)
H_norm = 0.81 → escalate immediately
Pipeline decision: escalate, do not wait for clarification
Analyst notified: high-entropy security escalation with supporting documents
The model didn't get smarter. It got better information. The classification shifted from uncertain-account-unlock to probable-security-incident because the context now included the policy that says patch + lockout within 72h = treat as security incident.
What RAG does not fix¶
RAG retrieves — it does not reason. If the retrieved documents are:
- Outdated: retrieval will surface wrong information confidently
- Irrelevant: noise in context degrades classification
- Contradictory: the model may average across conflicting signals incorrectly
- Missing: if the relevant document isn't in the knowledge base, RAG adds nothing
Maintain the knowledge base as carefully as you maintain the model. A stale knowledge base is worse than no RAG — it gives the model false confidence.
Checklist¶
- [ ] Is your knowledge base segmented by document type with appropriate update frequencies?
- [ ] Have you chunked documents into retrieval-appropriate sizes (≤200 tokens per chunk)?
- [ ] Have you measured optimal k per ticket category?
- [ ] Have you recalculated token budgets after adding retrieved documents?
- [ ] Do you have a staleness policy — how old can a document be before it's excluded from retrieval?
- [ ] Are retrieved documents shown to analysts alongside the classification, for transparency?
- [ ] Are retrieval misses (no relevant document found) logged for knowledge base gap analysis?
RAG gives the model knowledge it couldn't have been trained on. Fine-tuning gives the model patterns it kept getting wrong. Together they address two different kinds of gap — and both leave the cost and latency questions for Module 19.