03.01 · RAG Pipeline Deep Dive

Level: Intermediate
Pre-reading: 03 · RAG · 01.02 · Embeddings & Vector Search


Indexing Pipeline

The indexing pipeline runs offline (or on change events) to keep the vector index fresh.

graph TD
    A[Source: GitHub Repos] --> B[File Loader]
    C[Source: Confluence] --> B
    D[Source: JIRA] --> B
    B --> E[Preprocessor · sanitize, extract metadata]
    E --> F[Chunker · split by language-aware boundaries]
    F --> G[Embedding Model · nomic-embed-code]
    G --> H[Vector Store · Weaviate or pgvector]
    H --> I[Metadata Store · file path, service, last modified]

Trigger strategies:

Trigger When Tradeoff
Full re-index Weekly scheduled job Simple but stale between runs
PR merge webhook On every merged PR Always fresh, more complex
File change watcher On local save during development Only useful for local dev agents
On-demand Agent requests fresh index for a service Accurate but adds latency to first run

Chunking Strategies in Practice

Code Chunking — Java / Spring Boot

Parse .java files with Tree-sitter and emit one chunk per class member:

Chunk 1:
  content: "public class OrderController { ... }"
  metadata: { file: "OrderController.java", class: "OrderController", service: "order-service" }

Chunk 2:
  content: "@PostMapping createOrder(@RequestBody OrderRequest req) { ... }"
  metadata: { file: "OrderController.java", method: "createOrder", line: 34 }

Documentation Chunking — Confluence / Markdown

Use recursive character splitting with 512-token chunks and 64-token overlap:

Chunk 1: First 512 tokens of the page
Chunk 2: Tokens 448–960 (64-token overlap with Chunk 1)
...

Overlap ensures that context at chunk boundaries is not lost.


Metadata Filtering

Before embedding similarity search, filter the corpus to reduce noise:

graph LR
    A[Query: fix bug in order-service] --> B[Extract: service=order-service]
    B --> C[Pre-filter: WHERE service = 'order-service']
    C --> D[Vector search on filtered subset]
    D --> E[Top-K results]

Metadata Is Critical

In a large codebase with 50+ microservices, searching all embeddings returns irrelevant results from other services. Always tag chunks with service, language, file_type, and last_modified for effective pre-filtering.


Combine vector similarity (semantic) with keyword matching (BM25) for better recall.

Query Type Best Search
"fix NullPointerException in payment flow" Vector (semantic)
"PaymentServiceImpl.processRefund" Keyword (exact identifier)
"slow checkout caused by DB lock" Hybrid (semantic description + technical term)

LangChain's EnsembleRetriever and Weaviate's hybrid search mode both support this out of the box.


Reranking

After retrieval, a cross-encoder reranker re-scores each chunk against the query with higher accuracy than embedding similarity alone.

graph LR
    A[Query] --> B[Initial Retrieval: top-20 by vector similarity]
    B --> C[Reranker: score each of 20 against query]
    C --> D[Top-5 by reranker score injected into prompt]
Reranker Provider Notes
Rerank-3 Cohere Best-in-class, API-based
cross-encoder/ms-marco HuggingFace Open source, self-hosted
ColBERT HuggingFace Token-level matching, strong on code

The cost of reranking ~20 documents is small; the quality improvement is large.


Prompt Augmentation Pattern

Once chunks are retrieved, inject them into the prompt with clear delimiters:

You are a Java developer fixing a bug in the order-service.

=== RELEVANT CODE CONTEXT ===
File: order-service/src/main/java/.../OrderController.java (lines 28-55)
[chunk content]

File: order-service/src/test/.../OrderControllerTest.java (lines 10-45)
[chunk content]
=== END CONTEXT ===

Using only the code above, diagnose and fix the following bug:
[JIRA ticket description]

The explicit delimiters help the model distinguish retrieved context from instructions.


How do you handle stale embeddings when code changes?

Use a PR merge webhook to trigger re-indexing of only the changed files. Store last_modified as metadata on each chunk and periodically run a staleness check. For critical services, consider re-indexing on every commit to main.

What is the right chunk size for Java code?

For method-level chunks: aim for 200–800 tokens. Include the method signature, body, and immediately relevant context (class-level annotations, fields referenced). For controller classes, one chunk per endpoint method works well. Avoid chunking in the middle of a method body.