Lab 2: Text Embeddings & Semantic Similarity¶
Level: Foundations | Duration: 1.5 hours
Objective¶
Understand how modern transformer models convert text to meaningful vectors and explore the embedding space.
What You'll Learn¶
- Load pretrained sentence transformer models
- Generate embeddings for diverse texts
- Explore embedding properties (dimensions, scale, distribution)
- Compare similarity across different topics
- Visualize high-dimensional embeddings with t-SNE
- Understand limitations of embeddings
Key Concepts¶
- Sentence-Transformers: Specialized transformers for sentence embeddings
- Semantic Similarity: How embeddings capture meaning
- Dimensionality Reduction: Visualizing 384D vectors in 2D
- Token vs Sentence Embeddings: Differences and use cases
In [21]:
Copied!
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample corpus for exploration
corpus = {
"Science": [ # 3 sentences each
"The photosynthesis process converts light energy into chemical energy",
"Quantum mechanics describes the behavior of matter at atomic scales",
"DNA contains the genetic instructions for all living organisms"
],
"Sports": [
"Football teams compete to score touchdowns in the end zone",
"Basketball players dribble and shoot to score points",
"Tennis matches involve hitting a ball across a net"
],
"Food": [
"Pizza is made with dough, sauce, cheese, and various toppings",
"Sushi is a Japanese dish made with rice and raw fish",
"Pasta dishes are popular in Italian cuisine"
]
}
# Generate embeddings
embeddings = {}
texts = []
labels = []
# Create Embeddings for each sentence under each category
for category, docs in corpus.items(): # Corpus is a map of category and doc
embeddings[category] = model.encode(docs, show_progress_bar=False)
for doc in docs:
texts.append(doc)
labels.append(category)
print(f"Generated {len(texts)} embeddings")
print(f"Each embedding has {embeddings[list(embeddings.keys())[0]].shape[1]} dimensions\n")
print("-" * 60)
print(f"labels has size {len(labels)} and Contains :\n " )
print( labels)
print(f"texts has size {len(texts)} and Contains :\n " )
print( texts)
print("-" * 60)
print("Sample Embedding for a Science Document:")
print(embeddings["Science"][0][:10]) # Print first 10 dimensions of the first Science embedding
print("-" * 60)
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample corpus for exploration
corpus = {
"Science": [ # 3 sentences each
"The photosynthesis process converts light energy into chemical energy",
"Quantum mechanics describes the behavior of matter at atomic scales",
"DNA contains the genetic instructions for all living organisms"
],
"Sports": [
"Football teams compete to score touchdowns in the end zone",
"Basketball players dribble and shoot to score points",
"Tennis matches involve hitting a ball across a net"
],
"Food": [
"Pizza is made with dough, sauce, cheese, and various toppings",
"Sushi is a Japanese dish made with rice and raw fish",
"Pasta dishes are popular in Italian cuisine"
]
}
# Generate embeddings
embeddings = {}
texts = []
labels = []
# Create Embeddings for each sentence under each category
for category, docs in corpus.items(): # Corpus is a map of category and doc
embeddings[category] = model.encode(docs, show_progress_bar=False)
for doc in docs:
texts.append(doc)
labels.append(category)
print(f"Generated {len(texts)} embeddings")
print(f"Each embedding has {embeddings[list(embeddings.keys())[0]].shape[1]} dimensions\n")
print("-" * 60)
print(f"labels has size {len(labels)} and Contains :\n " )
print( labels)
print(f"texts has size {len(texts)} and Contains :\n " )
print( texts)
print("-" * 60)
print("Sample Embedding for a Science Document:")
print(embeddings["Science"][0][:10]) # Print first 10 dimensions of the first Science embedding
print("-" * 60)
Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
Generated 9 embeddings Each embedding has 384 dimensions ------------------------------------------------------------ labels has size 9 and Contains : ['Science', 'Science', 'Science', 'Sports', 'Sports', 'Sports', 'Food', 'Food', 'Food'] texts has size 9 and Contains : ['The photosynthesis process converts light energy into chemical energy', 'Quantum mechanics describes the behavior of matter at atomic scales', 'DNA contains the genetic instructions for all living organisms', 'Football teams compete to score touchdowns in the end zone', 'Basketball players dribble and shoot to score points', 'Tennis matches involve hitting a ball across a net', 'Pizza is made with dough, sauce, cheese, and various toppings', 'Sushi is a Japanese dish made with rice and raw fish', 'Pasta dishes are popular in Italian cuisine'] ------------------------------------------------------------ Sample Embedding for a Science Document: [-0.01501783 0.07587931 -0.07658472 0.06114178 0.00673075 -0.02079478 0.02115133 0.02275151 0.04248194 0.08286117] ------------------------------------------------------------
In [24]:
Copied!
# Exercise: Query-based Similarity Search
# We embed a query sentence and rank ALL corpus sentences by similarity to it.
# scipy's cosine() returns cosine DISTANCE (0 = identical, 2 = opposite).
# We convert to cosine SIMILARITY: similarity = 1 - distance (range: -1 to 1)
query = "I enjoy learning about space and atoms"
query_embedding = model.encode(query)
print("Query embedding (first 10 dimensions):", query_embedding[:10])
print(f"Query: \"{query}\"\n")
print(f"{'Sentence':<75} {'Category':<10} {'Similarity':>10}")
print("-" * 100)
# Collect (sentence, category, similarity) for every corpus sentence
results = []
for category, docs in corpus.items():
for i, doc in enumerate(docs):
doc_embedding = embeddings[category][i]
similarity = 1 - cosine(query_embedding, doc_embedding)
results.append((doc, category, similarity))
# Sort by similarity descending so the most relevant sentence appears first
results.sort(key=lambda x: x[2], reverse=True)
for sentence, category, sim in results:
print(f"{sentence:<75} {category:<10} {sim:>10.4f}")
print("\nTop match:", results[0][0])
print("Why? The query mentions 'space and atoms' — closest to Science sentences.")
# Exercise: Query-based Similarity Search
# We embed a query sentence and rank ALL corpus sentences by similarity to it.
# scipy's cosine() returns cosine DISTANCE (0 = identical, 2 = opposite).
# We convert to cosine SIMILARITY: similarity = 1 - distance (range: -1 to 1)
query = "I enjoy learning about space and atoms"
query_embedding = model.encode(query)
print("Query embedding (first 10 dimensions):", query_embedding[:10])
print(f"Query: \"{query}\"\n")
print(f"{'Sentence':<75} {'Category':<10} {'Similarity':>10}")
print("-" * 100)
# Collect (sentence, category, similarity) for every corpus sentence
results = []
for category, docs in corpus.items():
for i, doc in enumerate(docs):
doc_embedding = embeddings[category][i]
similarity = 1 - cosine(query_embedding, doc_embedding)
results.append((doc, category, similarity))
# Sort by similarity descending so the most relevant sentence appears first
results.sort(key=lambda x: x[2], reverse=True)
for sentence, category, sim in results:
print(f"{sentence:<75} {category:<10} {sim:>10.4f}")
print("\nTop match:", results[0][0])
print("Why? The query mentions 'space and atoms' — closest to Science sentences.")
Query embedding (first 10 dimensions): [ 0.01144386 -0.09826361 -0.00242209 0.08385342 0.01051387 -0.00424104 0.06950426 0.03673395 0.0961988 0.06115947] Query: "I enjoy learning about space and atoms" Sentence Category Similarity ---------------------------------------------------------------------------------------------------- Quantum mechanics describes the behavior of matter at atomic scales Science 0.4030 DNA contains the genetic instructions for all living organisms Science 0.1428 The photosynthesis process converts light energy into chemical energy Science 0.1199 Sushi is a Japanese dish made with rice and raw fish Food 0.0889 Pizza is made with dough, sauce, cheese, and various toppings Food 0.0809 Pasta dishes are popular in Italian cuisine Food 0.0808 Tennis matches involve hitting a ball across a net Sports 0.0101 Basketball players dribble and shoot to score points Sports -0.0151 Football teams compete to score touchdowns in the end zone Sports -0.0396 Top match: Quantum mechanics describes the behavior of matter at atomic scales Why? The query mentions 'space and atoms' — closest to Science sentences.