Lab 2: Text Embeddings & Semantic Similarity¶

Level: Foundations | Duration: 1.5 hours

Objective¶

Understand how modern transformer models convert text to meaningful vectors and explore the embedding space.

What You'll Learn¶

Load pretrained sentence transformer models
Generate embeddings for diverse texts
Explore embedding properties (dimensions, scale, distribution)
Compare similarity across different topics
Visualize high-dimensional embeddings with t-SNE
Understand limitations of embeddings

Key Concepts¶

Sentence-Transformers: Specialized transformers for sentence embeddings
Semantic Similarity: How embeddings capture meaning
Dimensionality Reduction: Visualizing 384D vectors in 2D
Token vs Sentence Embeddings: Differences and use cases

In [21]:

Copied!





from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample corpus for exploration
corpus = {
    "Science": [ # 3 sentences each
        "The photosynthesis process converts light energy into chemical energy",
        "Quantum mechanics describes the behavior of matter at atomic scales",
        "DNA contains the genetic instructions for all living organisms"
    ],
    "Sports": [
        "Football teams compete to score touchdowns in the end zone",
        "Basketball players dribble and shoot to score points",
        "Tennis matches involve hitting a ball across a net"
    ],
    "Food": [
        "Pizza is made with dough, sauce, cheese, and various toppings",
        "Sushi is a Japanese dish made with rice and raw fish",
        "Pasta dishes are popular in Italian cuisine"
    ]
}

# Generate embeddings
embeddings = {}
texts = []
labels = []

# Create Embeddings for each sentence under each category
for category, docs in corpus.items(): # Corpus is a map of category and doc
    embeddings[category] = model.encode(docs, show_progress_bar=False)
    for doc in docs:
        texts.append(doc)
        labels.append(category)


print(f"Generated {len(texts)} embeddings")
print(f"Each embedding has {embeddings[list(embeddings.keys())[0]].shape[1]} dimensions\n")
print("-" * 60)
print(f"labels has size {len(labels)} and Contains :\n " )
print( labels)
print(f"texts has size {len(texts)} and Contains :\n " )
print( texts)
print("-" * 60)
print("Sample Embedding for a Science Document:")
print(embeddings["Science"][0][:10])  # Print first 10 dimensions of the first Science embedding
print("-" * 60)
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample corpus for exploration
corpus = {
    "Science": [ # 3 sentences each
        "The photosynthesis process converts light energy into chemical energy",
        "Quantum mechanics describes the behavior of matter at atomic scales",
        "DNA contains the genetic instructions for all living organisms"
    ],
    "Sports": [
        "Football teams compete to score touchdowns in the end zone",
        "Basketball players dribble and shoot to score points",
        "Tennis matches involve hitting a ball across a net"
    ],
    "Food": [
        "Pizza is made with dough, sauce, cheese, and various toppings",
        "Sushi is a Japanese dish made with rice and raw fish",
        "Pasta dishes are popular in Italian cuisine"
    ]
}

# Generate embeddings
embeddings = {}
texts = []
labels = []

# Create Embeddings for each sentence under each category
for category, docs in corpus.items(): # Corpus is a map of category and doc
    embeddings[category] = model.encode(docs, show_progress_bar=False)
    for doc in docs:
        texts.append(doc)
        labels.append(category)


print(f"Generated {len(texts)} embeddings")
print(f"Each embedding has {embeddings[list(embeddings.keys())[0]].shape[1]} dimensions\n")
print("-" * 60)
print(f"labels has size {len(labels)} and Contains :\n " )
print( labels)
print(f"texts has size {len(texts)} and Contains :\n " )
print( texts)
print("-" * 60)
print("Sample Embedding for a Science Document:")
print(embeddings["Science"][0][:10])  # Print first 10 dimensions of the first Science embedding
print("-" * 60)

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

Generated 9 embeddings
Each embedding has 384 dimensions

------------------------------------------------------------
labels has size 9 and Contains :
 
['Science', 'Science', 'Science', 'Sports', 'Sports', 'Sports', 'Food', 'Food', 'Food']
texts has size 9 and Contains :
 
['The photosynthesis process converts light energy into chemical energy', 'Quantum mechanics describes the behavior of matter at atomic scales', 'DNA contains the genetic instructions for all living organisms', 'Football teams compete to score touchdowns in the end zone', 'Basketball players dribble and shoot to score points', 'Tennis matches involve hitting a ball across a net', 'Pizza is made with dough, sauce, cheese, and various toppings', 'Sushi is a Japanese dish made with rice and raw fish', 'Pasta dishes are popular in Italian cuisine']
------------------------------------------------------------
Sample Embedding for a Science Document:
[-0.01501783  0.07587931 -0.07658472  0.06114178  0.00673075 -0.02079478
  0.02115133  0.02275151  0.04248194  0.08286117]
------------------------------------------------------------

In [24]:

Copied!





# Exercise: Query-based Similarity Search
# We embed a query sentence and rank ALL corpus sentences by similarity to it.
# scipy's cosine() returns cosine DISTANCE (0 = identical, 2 = opposite).
# We convert to cosine SIMILARITY: similarity = 1 - distance  (range: -1 to 1)

query = "I enjoy learning about space and atoms"
query_embedding = model.encode(query)
print("Query embedding (first 10 dimensions):", query_embedding[:10])
print(f"Query: \"{query}\"\n")
print(f"{'Sentence':<75} {'Category':<10} {'Similarity':>10}")
print("-" * 100)

# Collect (sentence, category, similarity) for every corpus sentence
results = []
for category, docs in corpus.items():
    for i, doc in enumerate(docs):
        doc_embedding = embeddings[category][i]
        similarity = 1 - cosine(query_embedding, doc_embedding)
        results.append((doc, category, similarity))

# Sort by similarity descending so the most relevant sentence appears first
results.sort(key=lambda x: x[2], reverse=True)

for sentence, category, sim in results:
    print(f"{sentence:<75} {category:<10} {sim:>10.4f}")

print("\nTop match:", results[0][0])
print("Why? The query mentions 'space and atoms' — closest to Science sentences.")

# Exercise: Query-based Similarity Search
# We embed a query sentence and rank ALL corpus sentences by similarity to it.
# scipy's cosine() returns cosine DISTANCE (0 = identical, 2 = opposite).
# We convert to cosine SIMILARITY: similarity = 1 - distance  (range: -1 to 1)

query = "I enjoy learning about space and atoms"
query_embedding = model.encode(query)
print("Query embedding (first 10 dimensions):", query_embedding[:10])
print(f"Query: \"{query}\"\n")
print(f"{'Sentence':<75} {'Category':<10} {'Similarity':>10}")
print("-" * 100)

# Collect (sentence, category, similarity) for every corpus sentence
results = []
for category, docs in corpus.items():
    for i, doc in enumerate(docs):
        doc_embedding = embeddings[category][i]
        similarity = 1 - cosine(query_embedding, doc_embedding)
        results.append((doc, category, similarity))

# Sort by similarity descending so the most relevant sentence appears first
results.sort(key=lambda x: x[2], reverse=True)

for sentence, category, sim in results:
    print(f"{sentence:<75} {category:<10} {sim:>10.4f}")

print("\nTop match:", results[0][0])
print("Why? The query mentions 'space and atoms' — closest to Science sentences.")

Query embedding (first 10 dimensions): [ 0.01144386 -0.09826361 -0.00242209  0.08385342  0.01051387 -0.00424104
  0.06950426  0.03673395  0.0961988   0.06115947]
Query: "I enjoy learning about space and atoms"

Sentence                                                                    Category   Similarity
----------------------------------------------------------------------------------------------------
Quantum mechanics describes the behavior of matter at atomic scales         Science        0.4030
DNA contains the genetic instructions for all living organisms              Science        0.1428
The photosynthesis process converts light energy into chemical energy       Science        0.1199
Sushi is a Japanese dish made with rice and raw fish                        Food           0.0889
Pizza is made with dough, sauce, cheese, and various toppings               Food           0.0809
Pasta dishes are popular in Italian cuisine                                 Food           0.0808
Tennis matches involve hitting a ball across a net                          Sports         0.0101
Basketball players dribble and shoot to score points                        Sports        -0.0151
Football teams compete to score touchdowns in the end zone                  Sports        -0.0396

Top match: Quantum mechanics describes the behavior of matter at atomic scales
Why? The query mentions 'space and atoms' — closest to Science sentences.