01.02 · Embeddings & Vector Search — Deep Dive
Level: Intermediate
Pre-reading: 01 · AI & LLM Foundations · 01.01 · How LLMs Work
What Are Embeddings?
An embedding is a dense numerical vector that represents text (or code) in a high-dimensional space where similar meaning = similar direction.
graph LR
A["'fix NullPointerException'"] --> B[Embedding Model]
B --> C["[0.12, -0.83, 0.41, ...] 1536-dim vector"]
D["'handle null pointer error'"] --> B
B --> E["[0.13, -0.81, 0.44, ...] very similar!"]
Two sentences that are semantically related produce vectors with high cosine similarity, even if they share no exact words.
Why Embeddings Matter for Dev Automation
| Use Case | How Embeddings Help |
|---|---|
| Find relevant code | Embed entire codebase, search by natural language query |
| JIRA ticket → microservice | Embed all service READMEs, find closest match to ticket description |
| Similar bug lookup | Find past bugs that are semantically similar to a new one |
| Semantic test deduplication | Identify test cases testing the same behaviour |
| Documentation retrieval | Pull the right Confluence page into the agent's context |
Embedding Models
| Model | Dims | Strengths | Provider |
|---|---|---|---|
| text-embedding-3-large | 3072 | Best quality, code + text | OpenAI |
| text-embedding-3-small | 1536 | Cost-efficient, still strong | OpenAI |
| embed-english-v3.0 | 1024 | Best-in-class for RAG retrieval | Cohere |
| all-MiniLM-L6-v2 | 384 | Fast, small, self-hosted | Sentence Transformers |
| nomic-embed-code | 768 | Optimised for source code | Nomic AI |
Code Embeddings
For indexing a Java/Spring Boot codebase, use a code-specific embedding model like nomic-embed-code or OpenAI's text-embedding-3-large. They produce better similarity scores for function signatures and class names than general-purpose models.
Vector Databases
Vector databases store embeddings and support approximate nearest neighbour (ANN) search at scale.
| Database | Hosting | Strengths | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Simple API, serverless tier | SaaS products, fast start |
| Weaviate | Self-hosted or cloud | GraphQL API, multi-tenancy | Enterprise, complex schemas |
| Qdrant | Self-hosted or cloud | Rust-based, very fast | High-perf, self-hosted |
| ChromaDB | Self-hosted | Embeddable, zero config | Prototypes, local dev |
| pgvector | Postgres extension | Stays in existing DB | When you already use PostgreSQL |
| Redis Vector | Redis Stack | Low-latency, in-memory | Session-level search |
How Similarity Search Works
graph TD
A[Query: fix login bug] --> B[Embed query → vector Q]
C[Code Corpus] --> D[Embed each file/chunk → vectors V1..Vn stored in DB]
B --> E[ANN Search: top-k vectors closest to Q]
D --> E
E --> F[Return top 5 relevant code chunks]
F --> G[Inject into LLM context as RAG]
Cosine similarity is the standard distance metric:
Score ranges from -1 (opposite) to 1 (identical). In practice, relevant results score > 0.7.
Chunking Strategy
How you split documents before embedding significantly affects retrieval quality.
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N tokens | Simple, predictable |
| Sentence | Split on sentence boundaries | Prose documentation |
| Recursive character | Split on \n\n, \n, . in order of preference |
Mixed content |
| Semantic | Group sentences by meaning similarity | High quality, expensive |
| Code-aware | Split on class/method boundaries | Source code indexing |
Chunk Size Trade-off
Too small → relevant context split across chunks, loses coherence.
Too large → a chunk covers many topics, retrieval becomes noisy.
For Java code: split at class boundaries (~500–1500 tokens per class).
When would you use pgvector instead of Pinecone?
When your application already uses PostgreSQL and the corpus is under ~10M vectors. pgvector avoids a separate service dependency and keeps retrieval inside your existing ACID transaction boundary. For larger scale or serverless deployments, a dedicated vector DB is better.
What is the difference between embedding similarity and keyword search?
Keyword search (BM25, Elasticsearch) matches exact or stemmed terms — fast but brittle to paraphrasing. Embedding similarity matches by meaning — "fix null pointer" matches "handle missing reference" regardless of word overlap. Production RAG systems often use hybrid search (keyword + vector) for the best recall.