Embedding Models: From Word2Vec to Sentence Transformers
Not all embedding models are created equal. This section traces the evolution and explains the trade-offs.
Word2Vec (2013): The Beginning
Word2Vec was a breakthrough: train a simple neural network to predict nearby words, and use the hidden layer weights as embeddings.
The Skip-gram Model
Given a word, predict surrounding words:
By training on billions of words, the model learns that: - "King" and "Queen" should have similar hidden states (appear in similar contexts) - "Good" and "Bad" should be far apart - "Paris" - "France" + "Germany" ≈ "Berlin"
Limitations
- Single embedding per word (no context sensitivity)
- "Bank" (river) and "Bank" (financial) get the same embedding
- Trained only on small context windows
Code Example
# Using gensim (pre-trained Word2Vec)
from gensim.models import KeyedVectors
# Load pre-trained model
model = KeyedVectors.load_word2vec_format('word2vec.bin', binary=True)
# Get embedding for a word
embedding = model['king']
print(embedding.shape) # (300,)
# Find similar words
similar = model.most_similar('king')
# [('queen', 0.9), ('prince', 0.87), ...]
# Perform analogy
result = model.most_similar(positive=['king', 'woman'], negative=['man'])
# Returns something like 'queen'
GloVe (2014): Combining Frequency & Semantics
GloVe (Global Vectors) improves on Word2Vec by explicitly incorporating global word frequency statistics.
The key insight: Word frequency and contextual similarity together form meaning.
where \(X_{ij}\) is the frequency of word \(i\) appearing near word \(j\).
- Words that co-occur frequently should have similar embeddings
- Common words (like "the") are downweighted by function \(f\)
Properties
- Better than Word2Vec at capturing global patterns
- Still single embedding per word
- Performed very well on benchmarks (2014-2018 era)
BERT and Contextual Embeddings (2018)
BERT introduced context-dependent embeddings: the same word gets different vectors depending on surrounding context.
Why This Matters
"The bank approved my loan" → "bank" embedding A
"I sat on the bank of the river" → "bank" embedding B
BERT produces different embeddings for the same word based on context!
How BERT Works (Simplified)
- Break text into tokens (subword pieces)
- Use a transformer neural network with attention layers
- Output: one embedding per token that incorporates context
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "The bank is on the river."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state shape: (1, sequence_length, 768)
# Each word has a 768-dimensional context-aware embedding
word_embeddings = outputs.last_hidden_state[0]
Challenge: Which Token to Use?
For a full document, we need one vector, not one per token.
Options:
1. Use the special [CLS] token (designed for classification)
2. Average all token embeddings
3. Use max pooling
This gets messy for document-level tasks.
Sentence Transformers (2019): Purpose-Built for RAG
Sentence-BERT (SBERT) adds a pooling layer specifically for creating sentence/document embeddings:
Input: "This is a sentence."
↓
[Tokenize]
↓
[BERT encoder: transformer layers with attention]
↓
[Token embeddings from BERT: 8 vectors of 768-dim each]
↓
[Mean pooling: average the 8 vectors]
↓
Output: Single 768-dimensional sentence embedding
This is perfect for RAG because: - One embedding per document/paragraph - Trained on sentence/semantic similarity tasks - Fast to compute
Popular Models
| Model | Dimensions | Performance | Speed |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 | 96.3 (benchmark) | ⚡⚡⚡ Fast |
all-mpnet-base-v2 |
768 | 97.8 | ⚡⚡ Medium |
paraphrase-MiniLM-L6-v2 |
384 | Good | ⚡⚡⚡ Fast |
multi-qa-MiniLM-L6-cos-v1 |
384 | Great for Q&A | ⚡⚡⚡ Fast |
Code Example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy hound",
"Python is a programming language"
]
embeddings = model.encode(documents)
print(embeddings.shape) # (3, 384)
# Compute similarity between docs 1 and 2
similarity = embeddings[0] @ embeddings[1] / (
np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
print(f"Similarity: {similarity:.3f}") # 0.89
Modern Approaches: BGE, E5, and Beyond
Recent models push performance further:
BGE (BAAI General Embedding)
- Originally from Alibaba
- Excellent multilingual support
- Larger variants (768-1024 dims) for better quality
- Good balance of speed and quality
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-base-en-v1.5') # 768-dim
embedding = model.encode("Query text")
E5 (Entailment, Embeddings, Expansion)
- Multilingual by design
- Trained on diverse tasks
- Very strong performance on benchmarks
OpenAI's text-embedding Models
text-embedding-3-small(512 dims, very strong)text-embedding-3-large(3072 dims, state-of-art)- Cost: $0.02 per 1M tokens
- Benefit: Highest quality, OpenAI's latest research
Comparison Table
| Model | Year | Dims | Quality | Speed | Cost | Notes |
|---|---|---|---|---|---|---|
| Word2Vec | 2013 | 300 | Basic | Fast | Free | No context |
| GloVe | 2014 | 300 | Good | Fast | Free | No context |
| BERT | 2018 | 768 | Good | Slow | Free | Context-aware, needs pooling |
| SBERT | 2019 | 384-768 | Excellent | Fast | Free | Best for RAG |
| BGE | 2023 | 768 | Excellent | Fast | Free | Strong multilingual |
| OpenAI | 2024 | 512-3072 | State-of-art | Medium | $$ | Best quality |
How to Choose a Model for Your RAG
Decision Tree:
Do you have budget and care about best quality?
├─ YES → Use OpenAI text-embedding-3-small or -large
└─ NO →
Is speed critical (need <10ms)?
├─ YES → Use all-MiniLM-L6-v2 (384-dim)
└─ NO →
Do you need multilingual support?
├─ YES → Use BGE or E5
└─ NO → Use all-mpnet-base-v2 (768-dim)
Code: Comparing Embedding Models
from sentence_transformers import SentenceTransformer
import numpy as np
import time
models = {
'MiniLM': 'all-MiniLM-L6-v2',
'MPN': 'all-mpnet-base-v2',
'BGE': 'BAAI/bge-base-en-v1.5'
}
text = "The quick brown fox jumps over the lazy dog"
for name, model_id in models.items():
model = SentenceTransformer(model_id)
start = time.time()
emb = model.encode(text)
elapsed = time.time() - start
print(f"{name:10} | Dims: {len(emb):4} | Time: {elapsed*1000:.1f}ms")
Output:
Important Limitation (for RAG)
All neural embedding models capture semantic similarity, not exact match:
- Order #1766 and Order #1767 are similar in embedding space
- Customer names "John" and "Joan" are similar
- Product SKU "SKU-001A" and "SKU-001B" are similar
This is the core problem RAG systems must solve with hybrid search!
Summary
| Model Family | Key Innovation | Trade-off |
|---|---|---|
| Word2Vec | Simple, effective | No context |
| GloVe | Frequency + context | No context |
| BERT | Context-aware | Complex pooling needed |
| SBERT | Sentence-level pooling | Still semantic only |
| BGE/E5 | Modern, multilingual | Need to download |
| OpenAI | Best quality | Cost per API call |
Next Steps
Now that you understand embeddings, let's learn how to search with them: