01.02 · Embeddings & Vector Search — Deep Dive

Level: Intermediate
Pre-reading: 01 · AI & LLM Foundations · 01.01 · How LLMs Work

What Are Embeddings?

An embedding is a dense numerical vector that represents text (or code) in a high-dimensional space where similar meaning = similar direction.

graph LR
    A["'fix NullPointerException'"] --> B[Embedding Model]
    B --> C["[0.12, -0.83, 0.41, ...]  1536-dim vector"]
    D["'handle null pointer error'"] --> B
    B --> E["[0.13, -0.81, 0.44, ...]  very similar!"]

Two sentences that are semantically related produce vectors with high cosine similarity, even if they share no exact words.

Why Embeddings Matter for Dev Automation

Use Case	How Embeddings Help
Find relevant code	Embed entire codebase, search by natural language query
JIRA ticket → microservice	Embed all service READMEs, find closest match to ticket description
Similar bug lookup	Find past bugs that are semantically similar to a new one
Semantic test deduplication	Identify test cases testing the same behaviour
Documentation retrieval	Pull the right Confluence page into the agent's context

Embedding Models

Model	Dims	Strengths	Provider
text-embedding-3-large	3072	Best quality, code + text	OpenAI
text-embedding-3-small	1536	Cost-efficient, still strong	OpenAI
embed-english-v3.0	1024	Best-in-class for RAG retrieval	Cohere
all-MiniLM-L6-v2	384	Fast, small, self-hosted	Sentence Transformers
nomic-embed-code	768	Optimised for source code	Nomic AI

Code Embeddings

For indexing a Java/Spring Boot codebase, use a code-specific embedding model like nomic-embed-code or OpenAI's text-embedding-3-large. They produce better similarity scores for function signatures and class names than general-purpose models.

Vector Databases

Vector databases store embeddings and support approximate nearest neighbour (ANN) search at scale.

Database	Hosting	Strengths	Best For
Pinecone	Managed cloud	Simple API, serverless tier	SaaS products, fast start
Weaviate	Self-hosted or cloud	GraphQL API, multi-tenancy	Enterprise, complex schemas
Qdrant	Self-hosted or cloud	Rust-based, very fast	High-perf, self-hosted
ChromaDB	Self-hosted	Embeddable, zero config	Prototypes, local dev
pgvector	Postgres extension	Stays in existing DB	When you already use PostgreSQL
Redis Vector	Redis Stack	Low-latency, in-memory	Session-level search

How Similarity Search Works

graph TD
    A[Query: fix login bug] --> B[Embed query → vector Q]
    C[Code Corpus] --> D[Embed each file/chunk → vectors V1..Vn stored in DB]
    B --> E[ANN Search: top-k vectors closest to Q]
    D --> E
    E --> F[Return top 5 relevant code chunks]
    F --> G[Inject into LLM context as RAG]

Cosine similarity is the standard distance metric:

\[ \text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|} \]

Score ranges from -1 (opposite) to 1 (identical). In practice, relevant results score > 0.7.

Chunking Strategy

How you split documents before embedding significantly affects retrieval quality.

Strategy	Description	Best For
Fixed-size	Split every N tokens	Simple, predictable
Sentence	Split on sentence boundaries	Prose documentation
Recursive character	Split on `\n\n`, `\n`, `.` in order of preference	Mixed content
Semantic	Group sentences by meaning similarity	High quality, expensive
Code-aware	Split on class/method boundaries	Source code indexing

Chunk Size Trade-off

Too small → relevant context split across chunks, loses coherence.
Too large → a chunk covers many topics, retrieval becomes noisy.
For Java code: split at class boundaries (~500–1500 tokens per class).

When would you use pgvector instead of Pinecone?

When your application already uses PostgreSQL and the corpus is under ~10M vectors. pgvector avoids a separate service dependency and keeps retrieval inside your existing ACID transaction boundary. For larger scale or serverless deployments, a dedicated vector DB is better.

What is the difference between embedding similarity and keyword search?

Keyword search (BM25, Elasticsearch) matches exact or stemmed terms — fast but brittle to paraphrasing. Embedding similarity matches by meaning — "fix null pointer" matches "handle missing reference" regardless of word overlap. Production RAG systems often use hybrid search (keyword + vector) for the best recall.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search