03.01 · RAG Pipeline Deep Dive

Level: Intermediate
Pre-reading: 03 · RAG · 01.02 · Embeddings & Vector Search

Indexing Pipeline

The indexing pipeline runs offline (or on change events) to keep the vector index fresh.

graph TD
    A[Source: GitHub Repos] --> B[File Loader]
    C[Source: Confluence] --> B
    D[Source: JIRA] --> B
    B --> E[Preprocessor · sanitize, extract metadata]
    E --> F[Chunker · split by language-aware boundaries]
    F --> G[Embedding Model · nomic-embed-code]
    G --> H[Vector Store · Weaviate or pgvector]
    H --> I[Metadata Store · file path, service, last modified]

Trigger strategies:

Trigger	When	Tradeoff
Full re-index	Weekly scheduled job	Simple but stale between runs
PR merge webhook	On every merged PR	Always fresh, more complex
File change watcher	On local save during development	Only useful for local dev agents
On-demand	Agent requests fresh index for a service	Accurate but adds latency to first run

Chunking Strategies in Practice

Code Chunking — Java / Spring Boot

Parse .java files with Tree-sitter and emit one chunk per class member:

Chunk 1:
  content: "public class OrderController { ... }"
  metadata: { file: "OrderController.java", class: "OrderController", service: "order-service" }

Chunk 2:
  content: "@PostMapping createOrder(@RequestBody OrderRequest req) { ... }"
  metadata: { file: "OrderController.java", method: "createOrder", line: 34 }

Documentation Chunking — Confluence / Markdown

Use recursive character splitting with 512-token chunks and 64-token overlap:

Chunk 1: First 512 tokens of the page
Chunk 2: Tokens 448–960 (64-token overlap with Chunk 1)
...

Overlap ensures that context at chunk boundaries is not lost.

Metadata Filtering

Before embedding similarity search, filter the corpus to reduce noise:

graph LR
    A[Query: fix bug in order-service] --> B[Extract: service=order-service]
    B --> C[Pre-filter: WHERE service = 'order-service']
    C --> D[Vector search on filtered subset]
    D --> E[Top-K results]

Metadata Is Critical

In a large codebase with 50+ microservices, searching all embeddings returns irrelevant results from other services. Always tag chunks with service, language, file_type, and last_modified for effective pre-filtering.

Hybrid Search

Combine vector similarity (semantic) with keyword matching (BM25) for better recall.

Query Type	Best Search
`"fix NullPointerException in payment flow"`	Vector (semantic)
`"PaymentServiceImpl.processRefund"`	Keyword (exact identifier)
`"slow checkout caused by DB lock"`	Hybrid (semantic description + technical term)

LangChain's EnsembleRetriever and Weaviate's hybrid search mode both support this out of the box.

Reranking

After retrieval, a cross-encoder reranker re-scores each chunk against the query with higher accuracy than embedding similarity alone.

graph LR
    A[Query] --> B[Initial Retrieval: top-20 by vector similarity]
    B --> C[Reranker: score each of 20 against query]
    C --> D[Top-5 by reranker score injected into prompt]

Reranker	Provider	Notes
Rerank-3	Cohere	Best-in-class, API-based
cross-encoder/ms-marco	HuggingFace	Open source, self-hosted
ColBERT	HuggingFace	Token-level matching, strong on code

The cost of reranking ~20 documents is small; the quality improvement is large.

Prompt Augmentation Pattern

Once chunks are retrieved, inject them into the prompt with clear delimiters:

You are a Java developer fixing a bug in the order-service.

=== RELEVANT CODE CONTEXT ===
File: order-service/src/main/java/.../OrderController.java (lines 28-55)
[chunk content]

File: order-service/src/test/.../OrderControllerTest.java (lines 10-45)
[chunk content]
=== END CONTEXT ===

Using only the code above, diagnose and fix the following bug:
[JIRA ticket description]

The explicit delimiters help the model distinguish retrieved context from instructions.

How do you handle stale embeddings when code changes?

Use a PR merge webhook to trigger re-indexing of only the changed files. Store last_modified as metadata on each chunk and periodically run a staleness check. For critical services, consider re-indexing on every commit to main.

What is the right chunk size for Java code?

For method-level chunks: aim for 200–800 tokens. Include the method signature, body, and immediately relevant context (class-level annotations, fields referenced). For controller classes, one chunk per endpoint method works well. Avoid chunking in the middle of a method body.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search