Metadata Filtering: Enforcing Hard Constraints
Metadata filtering adds another layer: filter by structured properties before or after similarity search.
This is especially important for the Order #1766 problem.
The Concept
Store structured information alongside embeddings:
Document: "Your Order #1766 shipped via FedEx"
├─ Text (embedded): "Your Order #1766 shipped via FedEx"
└─ Metadata (structured):
├─ order_id: "1766"
├─ status: "shipped"
├─ carrier: "fedex"
└─ created_date: "2024-01-15"
Strategies
Strategy 1: Pre-Filtering (Filter → Search)
Filter documents by metadata before similarity search:
# Only consider Order #1766 documents
filtered_docs = [doc for doc in all_docs if doc.metadata['order_id'] == '1766']
# Then search within this subset
results = semantic_search(query, filtered_docs)
Pros: Reduces search space, guarantees only relevant docs
Cons: Need to know filter criteria upfront
Strategy 2: Post-Filtering (Search → Filter)
Perform similarity search first, then filter results:
# Search across all documents
all_results = semantic_search(query, all_docs, top_k=100)
# Then filter results
filtered_results = [r for r in all_results if r.metadata['order_id'] == '1766']
# Return top-k from filtered
return filtered_results[:10]
Pros: Flexible, can apply multiple filters
Cons: Might not return enough results if many are filtered out
Strategy 3: Integrated Filtering (Search with Constraints)
Modern vector databases let you specify filters within queries:
# Qdrant example
results = client.search(
collection_name="orders",
query_vector=query_embedding,
limit=10,
query_filter=models.Filter(
must=[
models.FieldCondition(
key="order_id",
match=models.MatchValue(value="1766")
)
]
)
)
Pros: Most efficient, enforced at index level
Cons: Requires vector database support
Metadata Filter Types
Exact Match
# Find documents with exact order ID
filter = {"order_id": {"$eq": "1766"}}
# Elasticsearch
{"term": {"order_id": "1766"}}
# Qdrant
models.MatchValue(value="1766")
Range
# Find orders placed in 2024
filter = {"created_date": {"$gte": "2024-01-01", "$lt": "2025-01-01"}}
# Elasticsearch
{"range": {"created_date": {"gte": "2024-01-01"}}}
# Qdrant
models.RangeCondition(gte=1704067200, lt=1735689600)
In List
# Find orders with status in ['shipped', 'delivered']
filter = {"status": {"$in": ["shipped", "delivered"]}}
# Elasticsearch
{"terms": {"status": ["shipped", "delivered"]}}
# Qdrant
models.HasIdCondition(has_id=[1, 2, 3])
Boolean Combinations
# (order_id = "1766") AND (status = "shipped")
filter = {
"$and": [
{"order_id": {"$eq": "1766"}},
{"status": {"$eq": "shipped"}}
]
}
# (order_id = "1766") OR (order_id = "1767")
filter = {
"$or": [
{"order_id": {"$eq": "1766"}},
{"order_id": {"$eq": "1767"}}
]
}
The Key Solution for Order #1766
# User asks: "What about order 1766?"
# Step 1: Extract order ID from query
order_id = extract_order_id(query) # Returns "1766"
# Step 2: Create metadata filter
metadata_filter = {
"order_id": {"$eq": order_id}
}
# Step 3: Search with filter
results = vector_db.search(
query_embedding,
filter=metadata_filter,
top_k=10
)
# Result: ONLY Order #1766 documents returned, not #1767!
Example: Chroma with Metadata Filtering
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="orders")
# Add documents with metadata
collection.add(
ids=["1", "2", "3"],
documents=[
"Order #1766 has been confirmed",
"Order #1767 is pending",
"Order #1766 tracking info available"
],
metadatas=[
{"order_id": "1766", "status": "confirmed"},
{"order_id": "1767", "status": "pending"},
{"order_id": "1766", "status": "confirmed"}
]
)
# Query with metadata filter
results = collection.query(
query_texts=["What about my order?"],
n_results=5,
where={"order_id": {"$eq": "1766"}} # Filter!
)
# Returns ONLY Order #1766 documents
Example: Qdrant with Metadata Filtering
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, MatchValue, Filter, PointStruct
client = QdrantClient(":memory:")
# Create collection and add points with metadata
client.create_collection(
collection_name="orders",
vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)
client.upsert(
collection_name="orders",
points=[
PointStruct(
id=1,
vector=[0.1]*384,
payload={"order_id": "1766", "status": "confirmed"}
),
PointStruct(
id=2,
vector=[0.11]*384,
payload={"order_id": "1767", "status": "pending"}
),
]
)
# Search with metadata filter
results = client.search(
collection_name="orders",
query_vector=[0.12]*384,
query_filter=Filter(
must=[
FieldCondition(
key="order_id",
match=MatchValue(value="1766")
)
]
),
limit=10
)
Combining Filtering Strategies
Best practice: Use all three layers:
class SmartRetriever:
def retrieve(self, query, user_context):
# Layer 1: Extract constraints from query
order_id = extract_order_id(query)
# Layer 2: Metadata filter (hard constraint)
metadata_filter = None
if order_id:
metadata_filter = {"order_id": {"$eq": order_id}}
# Layer 3: Hybrid search (dense + sparse)
dense_results = self.dense_search(query, filter=metadata_filter)
sparse_results = self.bm25_search(query, filter=metadata_filter)
# Layer 4: Combine
hybrid_results = self.combine(dense_results, sparse_results)
return hybrid_results
Performance Considerations
Pre-filtering
1,000,000 documents
├─ Filter: order_id = "1766"
│ ├─ Remaining: ~10 docs
│ └─ Search in 10: very fast
Good when filters significantly reduce search space.
Post-filtering
1,000,000 documents
├─ Search in all: ~100ms
├─ Filter results: instant
└─ Might lose results if top-k all filtered out
Good when filters are loose.
Integrated filtering
1,000,000 documents
├─ Index structure optimized for filters
└─ Search within filtered index: fastest
Best when available.
Metadata Best Practices
What to Store
metadata = {
# For filtering (store efficiently)
"order_id": "1766", # String
"status": "shipped", # Enum
"customer_id": "123", # String
"created_timestamp": 1704067200, # Unix timestamp (searchable range)
# For context/display (don't need to filter)
"customer_name": "John Doe", # Can be in text
"total_amount": "$99.99", # Can be in text
}
Indexing Strategy
# Index fields you'll filter on
# Don't index fields you won't filter on
# Good: Small, exact values
"order_id": "1766"
"status": "shipped"
# Bad: Large text values
"full_order_description": "Order #1766 ..." # Use in documents instead
Summary
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Pre-filter | Extractable filters from query | Fastest | Requires upfront extraction |
| Post-filter | Flexible filters | Simple | Might miss results |
| Integrated | Vector DB supports it | Most efficient | Requires specific DB |
Key Insight for Order #1766
Exact ID matching should ALWAYS use metadata filters, never semantic similarity.
# ❌ WRONG (semantic matching might fail)
results = semantic_search(query) # Might return #1767
# ✅ RIGHT (use metadata filter)
if "1766" in query:
results = semantic_search(
query,
filter={"order_id": "1766"}
)
Next Steps
→ Re-ranking — Improve result quality further
→ The Exact Match Problem — Putting it all together