Probability & Statistics Foundations
For RAG systems to work well, you need to understand term importance and how to score documents. This section builds the statistical foundations.
Probability Basics
What Is Probability?
A probability is a number between 0 and 1 representing how likely something is:
- P(event) = 0: The event is impossible
- P(event) = 0.5: The event is equally likely or unlikely
- P(event) = 1: The event is certain
Conditional Probability
Given that event \(A\) happened, what's the probability that event \(B\) happens?
Example: In a document collection, what's the probability that a document is relevant given that it contains the word "order"?
Information Theory: Measuring Importance
The more surprising a term is (appears in fewer documents), the more informative it is.
Inverse Document Frequency (IDF)
The inverse document frequency of a term measures its importance:
where: - \(N\) = total number of documents - \(df_t\) = number of documents containing term \(t\)
Why Logarithm?
The logarithm compresses large ratios into manageable numbers. More importantly, IDF grows as terms become rarer, which is what we want:
- If a term appears in 1 document out of 1000: \(\text{IDF} = \log(1000/1) = \log(1000) \approx 6.9\)
- If a term appears in 500 documents out of 1000: \(\text{IDF} = \log(1000/500) = \log(2) \approx 0.3\)
The rare term gets a higher score!
Example
Suppose we have 1000 documents:
| Term | Appears In | IDF |
|---|---|---|
| "Order" (common) | 500 docs | \(\log(1000/500) = 0.30\) |
| "transaction" (uncommon) | 50 docs | \(\log(1000/50) = 2.30\) |
| "order_#1766" (very rare) | 1 doc | \(\log(1000/1) = 6.91\) |
Notice: Exact identifiers like "order_#1766" get very high IDF scores because they're unique!
Term Frequency: Measuring Presence
Term Frequency (TF) counts how often a term appears in a document:
However, longer documents naturally have higher term frequencies. We often normalize by document length:
TF-IDF: Combining Frequency and Importance
TF-IDF is the product:
This gives high scores to terms that are: 1. Common in the document (high TF) 2. Rare across all documents (high IDF)
Example
Document: "Customer with Order #1766 made a purchase of $100. Order #1766 is confirmed."
| Term | TF | IDF | TF-IDF |
|---|---|---|---|
| "order" | 2/13 ≈ 0.15 | 0.30 | 0.045 |
| "#1766" | 2/13 ≈ 0.15 | 6.91 | 1.04 |
| "purchase" | 1/13 ≈ 0.08 | 4.50 | 0.36 |
The exact identifier "#1766" dominates the TF-IDF score because it's specific and rare!
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"Customer with Order #1766 made a purchase",
"Order #1767 is pending verification"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Get term scores for first document
feature_names = vectorizer.get_feature_names_out()
scores = tfidf_matrix[0].toarray().flatten()
for name, score in zip(feature_names, scores):
if score > 0:
print(f"{name}: {score:.3f}")
Why This Matters for RAG
In sparse retrieval (the keyword/BM25 approach), TF-IDF and similar scoring methods are how we find documents containing exact matches:
- Search for "Order #1766"?
- → Find documents with very high TF-IDF for "1766" (because it's unique)
- → Will correctly return Order #1766 documents, NOT Order #1767
This is why BM25 search (an improved TF-IDF variant) works well for exact matches! It's mathematically designed to prioritize rare, specific terms—exactly what order numbers are.
Distributions and Probability Densities
When working with embeddings and vectors, we often assume they follow a distribution—a mathematical description of how likely different values are.
Gaussian (Normal) Distribution
The most common distribution:
where: - \(\mu\) = mean (center) - \(\sigma\) = standard deviation (spread)
Why it matters: Many embedding models produce outputs that are approximately normally distributed. This is important for understanding statistical properties of similarity scores.
import numpy as np
import matplotlib.pyplot as plt
# Generate random samples from normal distribution
data = np.random.normal(loc=0, scale=1, size=10000)
plt.hist(data, bins=50)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Normal Distribution (μ=0, σ=1)")
plt.show()
Summary Table
| Concept | Formula | Purpose in RAG |
|---|---|---|
| TF | count of term in doc | Measures how often a term appears |
| IDF | \(\log(N / df_t)\) | Measures how rare/important a term is |
| TF-IDF | TF × IDF | Keyword similarity scoring |
| BM25 | Extended TF-IDF | Production-grade sparse search |
Key Insight: Why Hybrid Search Works
- Semantic search (embeddings + cosine similarity) → captures meaning
- Sparse search (BM25/TF-IDF) → captures exact terms
For "Order #1766": - Semantic search might return Order #1767 too (because they're semantically similar) - Sparse search returns ONLY documents with "1766" in them (exact match)
Hybrid = both together = best of both worlds!
Practice Problems
Problem 1
If a collection has 10,000 documents and a term appears in 100 of them, what is the IDF?
Solution: \(\text{IDF} = \log(10000 / 100) = \log(100) \approx 4.61\)
Problem 2
A document has 200 words, and the term "customer" appears 5 times. What is the normalized TF?
Solution: \(\text{TF}_{\text{norm}} = 5 / 200 = 0.025\)
Problem 3
If IDF("customer") = 2.0 and \(\text{TF}_{\text{norm}}\) = 0.025, what is TF-IDF?
Solution: \(\text{TF-IDF} = 0.025 \times 2.0 = 0.05\)
Next Steps
Now you have the mathematical foundations! Move to Understanding Embeddings to see how modern text → numbers conversion works.