Sparse Retrieval: BM25 and Keyword Search
Sparse retrieval uses term frequencies to find documents with matching keywords. It's the opposite of dense retrieval.
This is how search engines like Google work at their core.
Why "Sparse"?
A sparse vector has mostly zeros and a few non-zero values.
Example: 100 unique words in corpus
Dense embedding: [-0.23, 0.45, 0.12, ..., -0.08] (all non-zero)
Sparse vector: [0, 0, 5, 0, 0, ..., 3, 0, ...] (many zeros)
↑ word appears 5 times
↑ word appears 3 times
BM25: The Standard Algorithm
BM25 (Best Matching 25) is the industry-standard sparse retrieval algorithm. It scores documents based on:
- Term Frequency (TF): How often the term appears
- Inverse Document Frequency (IDF): How rare the term is
- Document Length Normalization: Longer docs don't automatically win
BM25 Formula
For a term \(t\) in document \(d\):
where: - \(k_1\) ≈ 1.5 (controls TF saturation) - \(b\) ≈ 0.75 (controls length normalization) - \(|d|\) = document length - \(L_{avg}\) = average document length
Don't memorize this! The key insight: rarer terms in relevant-length documents score higher.
Why BM25 Solves the Order #1766 Problem
Remember the problem:
Dense: "Order #1766" and "Order #1767" are similar
Sparse: "Order #1766" and "Order #1767" are completely different
Why? Because the exact match "1766" is unique:
- Appears in Order #1766 docs: 1-2 times
- Appears in ALL docs: 1 time total
- IDF("1766") = log(total_docs / 1) = huge!
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
documents = [
"Order #1766 has been confirmed",
"Order #1767 is pending",
"Order #1765 is shipped",
"Cat food is on sale",
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Get term scores for first document
feature_names = vectorizer.get_feature_names_out()
scores = tfidf_matrix[0].toarray().flatten()
print("Document: 'Order #1766 has been confirmed'")
for name, score in zip(feature_names, scores):
if score > 0:
print(f" {name}: {score:.3f}")
# Output shows that "1766" has very high TF-IDF score!
BM25 in Python
Using rank-bm25 library (pure Python):
from rank_bm25 import BM25Okapi
import numpy as np
# Tokenize documents
documents = [
"Order #1766 has been confirmed",
"Order #1767 is pending",
"Order #1765 is shipped",
]
corpus = [doc.split() for doc in documents]
# Create BM25 index
bm25 = BM25Okapi(corpus)
# Query
query = "Order #1766".split()
scores = bm25.get_scores(query)
print("Scores:", scores)
# Output: [10.2, 2.1, 1.9]
# Document 0 (Order #1766) scores MUCH higher!
top_indices = np.argsort(-scores)[:3]
for i in top_indices:
print(f"{documents[i]}: {scores[i]:.2f}")
Output:
Perfect! BM25 correctly prioritizes the exact match.
Elasticsearch: Production BM25
For production, use Elasticsearch:
from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9200'])
# Index documents
for i, doc in enumerate(documents):
es.index(index="orders", id=i, document={"text": doc})
# Search with BM25
query = "Order #1766"
results = es.search(index="orders", body={
"query": {
"match": {
"text": query
}
},
"size": 10
})
for hit in results['hits']['hits']:
print(f"{hit['_source']['text']}: {hit['_score']:.2f}")
Pros of BM25
✅ Perfect for exact matches: "1766" matches only Order #1766
✅ Fast: Inverted indices are highly optimized
✅ Interpretable: Scores directly relate to term frequency
✅ Handles stop words: "the", "a" are downweighted automatically
✅ Scalable: Used by Google, Elasticsearch, etc.
Cons of BM25
❌ Bad for synonyms: "cat" doesn't match "feline"
❌ No semantic understanding: "dog breed" and "a breed of dog" are different
❌ Requires exact words: Typos break matching
❌ Ambiguity: "bank" matches both financial and river banks equally
Comparison: Dense vs Sparse
| Query | Dense | Sparse (BM25) |
|---|---|---|
| "What's my order status?" | ✅ Great | ❌ Poor (no exact match to "status") |
| "Order #1766" | ❌ Bad (similar to #1767) | ✅ Perfect |
| "Tell me about my order" | ✅ Great | ⚠ OK (finds "order" keyword) |
| "feline animal pictures" | ✅ Great (understands "cat") | ❌ Bad (no "cat" keyword) |
Sparse Retrieval in Vector Databases
Modern vector databases support sparse vectors:
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, PointStruct
client = QdrantClient(":memory:")
# Create sparse index
client.create_collection(
collection_name="sparse_orders",
sparse_vectors_config=SparseVectorConfig(...),
)
# Add with sparse vectors (from BM25 or TF-IDF)
for i, doc in enumerate(documents):
sparse_vector = create_sparse_vector(doc) # From BM25
client.upsert(
collection_name="sparse_orders",
points=[PointStruct(
id=i,
vector=sparse_vector,
payload={"text": doc}
)]
)
# Search
query = "Order #1766"
sparse_query_vector = create_sparse_vector(query)
results = client.search(
collection_name="sparse_orders",
query_vector=sparse_query_vector,
limit=10
)
Summary
| Aspect | BM25 Sparse |
|---|---|
| What | Term frequency matching |
| Scoring | TF × IDF × length normalization |
| Speed | ⚡ Fast (inverted index) |
| Strength | Perfect for exact matches & keywords |
| Weakness | No semantic understanding |
| Best for | Structured queries, IDs, exact keywords |
The Key Insight for RAG
BM25 is how you solve the Order #1766 problem for the keyword part.
But you still need to handle semantic queries like "What's my order status?"
That's where hybrid search comes in.
Next Steps
→ Hybrid Search — Combine dense + sparse to get the best of both!
→ The Exact Match Problem — Full solution strategy