03.02 · Advanced RAG Patterns

Level: Advanced
Pre-reading: 03 · RAG · 03.01 · RAG Pipeline

Naive RAG Limitations

Basic RAG (embed → search → inject) degrades in several scenarios:

Scenario	Problem	Advanced Solution
Query is vague	Embeddings don't match relevant code	HyDE — generate hypothetical answer first
Multi-hop reasoning	Answer requires reading 3+ connected files	Multi-hop retrieval / graph RAG
Code generates wrong query	Identifier mismatch (camelCase vs spaced)	Self-query with structured filters
Retrieved chunks are too noisy	Low signal-to-noise ratio	Contextual compression
Agent needs to verify a fact	Single chunk insufficient	Iterative retrieval

HyDE — Hypothetical Document Embeddings

Instead of embedding the raw query, ask the LLM to generate what the ideal answer might look like, then embed and search for that.

graph LR
    A[Query: Where is order total calculation?] --> B[LLM: generate hypothetical code snippet]
    B --> C["hypothetical: 'BigDecimal calculateTotal(List items)...'"]
    C --> D[Embed hypothetical → vector search]
    D --> E[Much better retrieval match]

Why it works: The hypothetical document shares vocabulary (class names, method names) with actual code — producing a much closer embedding to the real files than the natural language query would.

Multi-Hop Retrieval

Some bug investigations require following a chain of references:

graph LR
    A[Ticket: Bug in checkout] --> B[Retrieve: CheckoutService]
    B --> C[Found reference to PaymentGateway.charge]
    C --> D[Retrieve: PaymentGateway]
    D --> E[Found external API call to StripeClient]
    E --> F[Retrieve: StripeClient]
    F --> G[Full context: 3 hops deep]

The agent iteratively retrieves, reads, identifies cross-references, and retrieves again. LangGraph makes this loop easy to express as a graph with conditional edges.

Self-Query Retrieval

For structured codebases with rich metadata, let the LLM generate a structured filter query instead of relying purely on vector search.

User query: "Find all Kafka consumer methods in the notification service"

LLM generates:
{
  "filter": { "service": "notification-service", "annotation": "@KafkaListener" },
  "semantic_query": "Kafka consumer message processing"
}

The filter drastically narrows the search space; the semantic query handles the content match.

Contextual Compression

Retrieved chunks often contain boilerplate that adds noise. Before injecting into the prompt, compress each chunk to keep only the relevant portion.

graph LR
    A[Query: fix null check in createOrder] --> B[Retrieved: full OrderController.java]
    B --> C[Compression LLM: extract only createOrder method]
    C --> D[Compressed chunk: 100 tokens instead of 800]
    D --> E[Inject into main prompt]

Use a cheap, fast model (GPT-4o-mini) for compression. This reduces the tokens sent to the expensive model while improving signal quality.

RAPTOR — Recursive Abstractive Processing for Tree-Organised Retrieval

For very large codebases (1000+ files), standard flat retrieval can miss high-level patterns.

RAPTOR builds a tree of summaries:

Level 0: Raw code chunks (leaf nodes)
Level 1: Method-level summaries
Level 2: Class-level summaries  
Level 3: Service-level summaries
Level 4: Domain/architecture summary

Retrieval can happen at any level, then drill down. Useful for answering "what service handles payments?" (level 3–4) vs. "how is the payment amount validated?" (level 0–1).

Agentic RAG

The natural evolution: instead of RAG being a fixed pre-processing step, the agent decides when and what to retrieve as part of its ReAct loop.

graph TD
    A[Agent: Thought] --> B{Do I have enough context?}
    B -->|No| C[Action: retrieve_context query]
    C --> D[Observation: relevant chunks]
    D --> A
    B -->|Yes| E[Action: generate_solution]

This is more flexible but requires careful loop budgeting to avoid excessive retrieval calls.

When should you use HyDE vs. standard embedding search?

Use HyDE when your users ask natural language questions that are semantically far from technical code (e.g., "where is the billing logic?" vs. BillingService.calculateInvoice()). In practice, HyDE adds ~1 LLM call of latency — acceptable for interactive queries, potentially too slow for batch pipeline steps.

How do you evaluate RAG quality?

Use RAGAS (Retrieval-Augmented Generation Assessment) — a framework that measures: Faithfulness (does the answer match the retrieved context?), Answer Relevancy (is the answer relevant to the query?), Context Recall (did retrieval find everything needed?). Set minimum threshold scores as a CI gate on your RAG pipeline.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search