Chunking Strategies: Preserving Token Identity
How you split documents into chunks dramatically affects retrieval quality, especially for exact matches.
The Chunking Problem
What you embed and index matters more than you might think.
Bad Chunking: Loss of Context
Documents:
"Customer John Doe placed Order #1766 on 2024-01-15
with items {laptop, mouse, keyboard} totaling $3,500.
Shipping address: 123 Main St. Order #1766 will ship
within 2 business days..."
Naive Chunking (by size, 200 chars):
Chunk 1: "Customer John Doe placed Order #1766 on 2024-01-15
with items {laptop, mouse, keyboard} totaling..."
Chunk 2: "...totaling $3,500. Shipping address: 123 Main St.
Order #1766 will ship within 2 business days..."
Problem: Order ID appears in the middle/end of chunks,
loses prominence when compared sidestep with other orders
Good Chunking: Preserving Identity
Structured Chunking (for Order #1766):
Chunk 1: "ORDER_ID: #1766"
Chunk 2: "Order #1766 Customer: John Doe"
Chunk 3: "Order #1766 Date: 2024-01-15"
Chunk 4: "Order #1766 Items: {laptop, mouse, keyboard}"
Chunk 5: "Order #1766 Total: $3,500"
Chunk 6: "Order #1766 Shipping: 123 Main St"
Chunk 7: "Order #1766 Delivery: within 2 business days"
Benefit: Order ID appears at the BEGINNING of every chunk,
making it prominent in embeddings and BM25 scoring
Chunking Strategies
Strategy 1: Naive Fixed-Size
Document: "A long text..."
Split every 256 tokens (overlap 64):
Chunk 1: tokens[0:256]
Chunk 2: tokens[192:448] (192 = 256 - 64 overlap)
Chunk 3: tokens[384:640]
...
Pros: Simple
Cons: Might split sentences, loses structure, bad for exact matches
def naive_chunking(text, chunk_size=256, overlap=64):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunks.append(' '.join(words[start:end]))
start = end - overlap
return chunks
Strategy 2: Semantic (Sentence-Based)
Document: "Customer placed order. Status: confirmed.
Shipping tomorrow..."
Split by sentences:
Chunk 1: "Customer placed order."
Chunk 2: "Status: confirmed."
Chunk 3: "Shipping tomorrow..."
Pros: Respects sentence boundaries, natural
Cons: Loses context between sentences
import nltk
from nltk.tokenize import sent_tokenize
def semantic_chunking(text):
return sent_tokenize(text)
Strategy 3: Recursive (Sentences → Smaller Units)
Document structure:
├─ Section 1
│ └─ Paragraph
│ └─ Sentence 1
│ └─ Sentence 2
Split at largest boundaries first, then smaller:
Chunks = [
[Sentence 1, Sentence 2], # Try keeping sentences together
[Sentence 1], # If too big, split
[Sentence 2]
]
Pros: Respects document structure
Cons: More complex
Strategy 4: Structured (FOR RAG—Best for IDs/Exact Matches)
For documents with structured data (orders, products, accounts):
def structured_chunking(document):
"""
Extract structured fields and create separate chunks.
Perfect for orders, products, profiles, etc.
"""
chunks = []
# Field-level chunks (Order ID most prominent)
if 'order_id' in document:
chunks.append(f"ORDER_ID: {document['order_id']}")
if 'customer_name' in document:
chunks.append(f"Order #{document['order_id']} Customer: {document['customer_name']}")
if 'created_date' in document:
chunks.append(f"Order #{document['order_id']} Date: {document['created_date']}")
if 'items' in document:
items_str = ', '.join(document['items'])
chunks.append(f"Order #{document['order_id']} Items: {items_str}")
if 'total' in document:
chunks.append(f"Order #{document['order_id']} Total: {document['total']}")
# Full summary (for context)
summary = f"Order #{document['order_id']}: {document.get('status', 'pending')} - {len(document.get('items', []))} items"
chunks.append(summary)
return chunks
# Example
order = {
'order_id': '1766',
'customer_name': 'John Doe',
'created_date': '2024-01-15',
'items': ['laptop', 'mouse'],
'total': '$3,500',
'status': 'shipped'
}
chunks = structured_chunking(order)
# Chunks:
# ["ORDER_ID: 1766",
# "Order #1766 Customer: John Doe",
# "Order #1766 Date: 2024-01-15",
# ...,
# "Order #1766: shipped - 2 items"]
Pros: - ✅ IDs always prominent - ✅ Perfect for structured data - ✅ Works well with BM25 (exact match) - ✅ Works well with metadata filtering
Cons: - ❌ Requires knowing document structure - ❌ Less suitable for unstructured text
Strategy 5: Hybrid (Structured + Semantic)
Combine structured and unstructured:
def hybrid_chunking(document):
"""
For documents with both structured and unstructured parts.
"""
chunks = []
# 1. Structured chunks (ID-focused)
if 'order_id' in document:
chunks.append(f"ORDER_ID: {document['order_id']}")
chunks.append(f"Order #{document['order_id']} Status: {document.get('status', 'unknown')}")
# 2. Semantic chunks (description/notes)
if 'description' in document:
description_chunks = sent_tokenize(document['description'])
for i, chunk in enumerate(description_chunks):
# Prepend ID for prominence
full_chunk = f"Order #{document['order_id']}: {chunk}"
chunks.append(full_chunk)
# 3. Include context for key dates/amounts
if 'created_date' in document:
chunks.append(f"Order #{document['order_id']} Placed: {document['created_date']}")
return chunks
Chunking Strategy: Task-Based Decision
Is your data...
Structured? (orders, products, accounts)
├─ Use structured chunking
├─ Put ID/key at beginning of EACH chunk
└─ Create separate chunks per field
Unstructured? (articles, PDFs, news)
├─ Use semantic (sentence-based)
├─ But prepend important IDs/keywords to chunks
└─ Use recursive if very long
Mixed? (structured + narrative)
└─ Use hybrid approach
Example: Orders vs Articles
Orders (Structured)
order_doc = {
'order_id': '1766',
'customer': 'john@example.com',
'date': '2024-01-15',
'items': [
{'sku': 'LAPTOP-001', 'name': 'Dell Laptop', 'qty': 1, 'price': 1200},
{'sku': 'MOUSE-002', 'name': 'Wireless Mouse', 'qty': 2, 'price': 50},
],
'total': 1300,
'status': 'shipped'
}
# Structured chunking - best approach
chunks = [
"ORDER_ID: 1766",
"Order #1766 Customer: john@example.com",
"Order #1766 Date: 2024-01-15",
"Order #1766 Status: shipped",
"Order #1766 Item: LAPTOP-001 Dell Laptop (Qty: 1, Price: $1200)",
"Order #1766 Item: MOUSE-002 Wireless Mouse (Qty: 2, Price: $50 each)",
"Order #1766 Total: $1300"
]
Articles (Unstructured)
article = {
'title': 'How to Use Embeddings',
'url': 'https://example.com/article-123',
'body': "Embeddings are... They work by... Applications include..."
}
# Semantic chunking (but prepend context)
chunks = [
"Article 'How to Use Embeddings': Embeddings are vector representations of text...",
"Article 'How to Use Embeddings': They work by training neural networks on large corpora...",
"Article 'How to Use Embeddings': Applications include search, recommendation systems, and more..."
]
Impact on Retrieval Quality
Without Good Chunking
Query: "Order #1766"
Dense Search Results:
- Order #1766 ... in middle of 500-token chunk ≈ 0.87 similarity
- Order #1767 ... in middle of 500-token chunk ≈ 0.86 similarity
- ❌ Can't distinguish by embedding alone
BM25 Results (if not chunked well):
- Order #1766 ... buried in text, low TF ≈ 3.2 score
- Order #1767 ... nearby, similar TF ≈ 3.1 score
- ❌ Similar scores
With Good Structured Chunking
Query: "Order #1766"
Dense Search Results:
- "ORDER_ID: 1766" ≈ 0.95 similarity (very specific)
- "Order #1766 Status: shipped" ≈ 0.93 similarity
- ❌ Still might mix with #1767
BM25 Results:
- "ORDER_ID: 1766" ≈ 8.5 score (exact match at START)
- "Order #1766 Customer: ..." ≈ 8.3 score
- ✅ #1767 chunks don't match "1766" at all!
Hybrid Combined:
- "ORDER_ID: 1766" → 0.95 + 8.5 = 9.45 ✅
- "Order #1766 ..." → 0.93 + 8.3 = 9.23 ✅
- #1767 chunks → ~0.85 + 0 = 0.85 (lost!)
Best Practices
-
Put important identifiers FIRST in chunks
-
Create separate chunks for structured fields
-
Repeat context in each chunk
-
Keep chunks reasonably-sized
-
Test with your actual queries
Code Example: Comparing Chunking Strategies
from sentence_transformers import SentenceTransformer
doc = """
Order #1766 placed by john@example.com on 2024-01-15.
Items: Dell Laptop ($1200), Wireless Mouse x2 ($50 each).
Total: $1300. Status: shipped. Tracking: FedEx.
Arrives by Jan 20, 2024.
"""
# Strategy A: Naive
naive_chunks = naive_chunking(doc, chunk_size=100)
# Strategy B: Structured
structured_chunks = structured_chunking({
'order_id': '1766',
'customer': 'john@example.com',
'date': '2024-01-15',
'items': ['Dell Laptop', 'Wireless Mouse'],
'total': '$1300',
'status': 'shipped'
})
# Compare embeddings
encoder = SentenceTransformer('all-MiniLM-L6-v2')
query = "Order #1766"
query_emb = encoder.encode(query)
print("Naive chunking:")
for i, chunk in enumerate(naive_chunks):
sim = cosine_similarity(query_emb, encoder.encode(chunk))
print(f" Chunk {i}: {sim:.3f} - '{chunk[:50]}...'")
print("\nStructured chunking:")
for i, chunk in enumerate(structured_chunks):
sim = cosine_similarity(query_emb, encoder.encode(chunk))
print(f" Chunk {i}: {sim:.3f} - '{chunk}'")
# Structured chunking will show higher similarity for ID-focused chunks!
Summary
| Strategy | Best For | Considers IDs? | Support Exact Match? |
|---|---|---|---|
| Naive | Quick prototypes | ❌ No | ❌ Poor |
| Semantic | Unstructured text | ❌ No | ⚠ Medium |
| Recursive | Large documents | ❌ No | ⚠ Medium |
| Structured | Orders, products, profiles | ✅ Yes | ✅ Great |
| Hybrid | Mixed documents | ✅ Yes | ✅ Great |
Recommendation
For RAG systems with exact identifiers (Order #1766, SKU-001, etc.): - Use structured chunking - Put IDs at the BEGINNING of every chunk - Create separate chunks per field
This maximally helps: - Dense retrieval (ID prominent in embedding) - Sparse retrieval (BM25 scores ID matches highest) - Metadata filtering (IDs easy to extract)
Next Steps
→ RAG Pipeline — Putting everything together