Vector Spaces and High-Dimensional Geometry
When you move from 2D or 3D to hundreds of dimensions (as with embeddings), geometry behaves very differently. This section explains why, and what it means for RAG systems.
From 2D to Many Dimensions
2D Space (Easy to Visualize)
Distance between points is easy to compute and visualize.
384-Dimensional Space (Embeddings)
We can't visualize this, but we can reason about it mathematically.
In an embedding space with \(d\) dimensions, every point is described by \(d\) numbers.
For sentence-BERT: \(d = 384\).
The Curse of Dimensionality
Surprising Fact: Random Vectors Are Almost Perpendicular
In high dimensions, random vectors point in almost completely different directions!
Let's measure this: generate two random 384-dimensional vectors and compute their similarity.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Generate random vectors
dim = 384
v1 = np.random.randn(dim)
v2 = np.random.randn(dim)
# Normalize
v1 = v1 / np.linalg.norm(v1)
v2 = v2 / np.linalg.norm(v2)
# Compute similarity
sim = np.dot(v1, v2)
print(f"Similarity: {sim:.4f}")
# Output: Something like 0.0234 (very close to 0, meaning perpendicular)
Why? In high dimensions, there's "room" in all directions, so random directions are nearly orthogonal.
Consequence for RAG
This has important implications:
- Random documents are very different (good—we can distinguish them)
- But "similar" documents might only be moderately close! (0.7-0.8 similarity is actually quite close in high dimensions)
- Noise and minor variations are magnified
The Volume Paradox
Another strange property: in high dimensions, most of the volume is on the surface.
Consider a hypersphere (n-dimensional sphere) of radius \(r\):
As \(d\) increases, the volume becomes increasingly concentrated at the surface (the boundary), not at the center.
Implication
In a RAG system with embeddings in a database: - Points near the surface are sparsely distributed - Real clusters of similar documents are relatively rare - Most of the volume is "empty space"
This is why approximate nearest neighbor (ANN) algorithms work so well—they exploit this structure to skip most of the empty space.
Intrinsic Dimensionality
Not all 384 dimensions are equally important!
Most real data (embeddings from documents) actually lie on a lower-dimensional manifold within the full 384-dimensional space.
The intrinsic dimensionality is much lower than 384.
Example
Imagine 1 million documents, each with a 384-dimensional embedding. But all documents are variations on just 50 key topics. The data effectively lives in ~50 dimensions, even though represented in 384.
Implication for RAG
- You don't need extremely high-dimensional spaces
- 384 dimensions is often more than enough
- But each extra dimension adds computational cost
- This is why dimensionality reduction techniques (PCA, autoencoders) can work
Distance Concentration
As dimensions increase, most pairwise distances converge to the same value!
For random data in \(d\) dimensions:
where \(C\) is a constant.
Visualization
import numpy as np
import matplotlib.pyplot as plt
distances = []
for d in [2, 10, 50, 100, 384]:
# Generate 100 random points
points = np.random.randn(100, d)
# Compute pairwise distances
from scipy.spatial.distance import pdist
dists = pdist(points)
distances.append(dists)
# Plot distributions
for d, dists in zip([2, 10, 50, 100, 384], distances):
plt.hist(dists, bins=30, alpha=0.5, label=f"d={d}")
plt.xlabel("Distance")
plt.ylabel("Frequency")
plt.legend()
plt.title("Pairwise Distances in Different Dimensions")
plt.show()
Result: As \(d\) increases, histogram becomes narrower and more peaked—distances are more similar to each other.
Why This Matters
- In 2D, "close" vs "far" are clearly different
- In 384D, most points are at roughly the same distance from each other
- This makes finding "nearest" neighbors harder (requires more precision)
- Reinforces why ANN algorithms are necessary
Cosine Similarity in High Dimensions
When vectors are normalized (as embeddings typically are), we use cosine similarity:
In high dimensions with random data:
Actual similarities might range from -0.2 to 0.2, with most around 0.
For actual document embeddings, clus ters of related documents form clouds with mutual similarities like 0.6-0.9.
Visualizing High-Dimensional Data
You can't plot 384 dimensions, but you can project to 2D or 3D:
TSNE (t-Distributed Stochastic Neighbor Embedding)
Preserves local neighbor structure:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# embeddings shape: (1000, 384)
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings)
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap='viridis')
plt.title("Document Embeddings (TSNE projection)")
plt.show()
UMAP (Uniform Manifold Approximation and Projection)
Faster than TSNE, also preserves structure:
from umap import UMAP
umap = UMAP(n_components=2)
embeddings_2d = umap.fit_transform(embeddings)
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels)
plt.show()
These projections show clusters of similar documents and help debug embedding quality, but remember: projections lose information!
Norms in High Dimensions
In high dimensions, most randomly-generated vectors have similar magnitude (norm).
For \(d = 384\): \(\|v\| \approx 19.6\)
Implication: When we normalize vectors (divide by their norm), we're "leveling the playing field" so that vectors in different dimensions have equal importance.
Most embedding models output normalized vectors:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Text")
# Most models normalize automatically
magnitude = np.linalg.norm(embedding)
print(f"Magnitude: {magnitude:.4f}") # Should be ~1.0
Practical Implications for RAG
| Property | Impact on RAG |
|---|---|
| Random vectors are perpendicular | Most documents are distinguishable |
| Distance concentration | Need precise distance computation |
| Intrinsic dimensionality < 384 | Could use smaller models if needed |
| Normalized vectors have equal norm | Fair comparison between vectors |
Summary
High-dimensional spaces are weird: - Most volume is on the surface - Random points are far apart - Most pairwise distances are similar in magnitude - "Close" and "far" are relative
But these properties actually make embedding-based search work well once you understand them.
Next Steps
Now let's apply this knowledge:
→ Distance Metrics - Detailed math of different similarity measures
→ Exact vs Approximate Search - How to search among millions of vectors efficiently