04 · Word Embeddings

Level: Intermediate
Pre-reading: 03.02 · Transfer Learning

What is an Embedding?

An embedding is a dense numerical vector representing the semantic meaning of text or data.

graph LR A["Word: 'king'"] --> B["Embedding<br/>[-0.2, 0.5, -0.1, ...]<br/>300 dimensions"]

Key insight: Similar meaning → similar vectors.

Embeddings capture semantic relationships:

Without embeddings: - Can't easily compare words - Can't find semantic similarity - No way to measure "distance" in meaning

How do we know two embeddings are similar?

\[\text{cosine similarity} = \frac{u \cdot v}{|u| |v|} \quad \text{(ranges -1 to 1)}\]

One of the first popular embedding methods. Two variants:

Predict context words from target word:

Sentence: "The quick brown fox"
Input: "brown"
Targets: "quick", "fox"

Predict target word from context:

Input: "quick", "fox"
Target: "brown"

Both learn embeddings by doing unsupervised learning on corpus.

Combines global matrix factorization with local context windows:

\[J = \sum_{i,j} f(X_{ij})(w_i^T w_j + b_i + b_j - \log X_{ij})^2\]

Result: High-quality embeddings capturing both global and local statistics.

Word2Vec + subword information:

Older embeddings (Word2Vec, GloVe): Same embedding for word regardless of context.

Modern embeddings (BERT, GPT): Contextual — embedding changes based on surrounding words.

Example: - "bank" in "river bank" vs "savings bank" → different embeddings - Captures nuance that static embeddings miss

Why are embeddings dense rather than one-hot encoded?

One-hot: huge, sparse (1000 words = 1000-dimensional vector). Dense: small, compact (100-300 dimensions), captures semantic relationships.

How do I use embeddings for downstream tasks?

Use as input to classification/regression model. Or compute similarity between embeddings for search/recommendation.