04.01 · Word Embedding Models — Deep Dive

Level: Intermediate
Pre-reading: 04 · Word Embeddings


Word2Vec in Detail

Word2Vec learns embeddings by predicting context from words (or vice versa).

Skip-gram Training

Training example:
Sentence: "The quick brown fox jumps"
Window size: 2

Center: "quick" → Context: "The", "brown"
Center: "brown" → Context: "quick", "fox"
Center: "fox"   → Context: "brown", "jumps"

The model learns to maximize similarity between word vectors and context vectors.


GloVe vs Word2Vec

Word2Vec GloVe
Method Prediction task (skip-gram/CBOW) Matrix factorization + prediction
Computation Fast, uses only local windows Uses global co-occurrence matrix
Quality Good Often better on analogy tasks
Speed Fast Slower (matrix factorization)

FastText: Subword Embeddings

Instead of embedding words, embed character n-grams:

Word: "running"
Character 3-grams: "run", "unn", "nni", "nin", "ing"

Embedding = average of all n-gram embeddings

Benefits: - Handles OOV (out-of-vocabulary) words - Better for morphologically rich languages - Robust to misspellings


Contextual Embeddings

Modern approach: embeddings depend on context.

BERT

Bidirectional Encoder Representations from Transformers

Uses Transformer encoder with masked language modeling: - Mask random words - Predict them from context - Learn contextual embeddings

ELMo

Stacks bidirectional LSTMs, concatenates all layers → contextual embedding.


Using Embeddings in Practice

query_embedding = model.embed("best pizza in NYC")
doc_embeddings = [model.embed(doc) for doc in documents]
similarities = [cosine_similarity(query, doc) for doc in doc_embeddings]
top_k = argsort(similarities)[-k:]

For Classification

document_embedding = average(embeddings of words in document)
classifier = LogisticRegression()
classifier.fit(document_embeddings, labels)

For Clustering

embeddings = [model.embed(doc) for doc in documents]
clusters = KMeans(embeddings, k=5)

Should I use pre-trained or train my own embeddings?

Use pre-trained for most cases. Only train your own if you have very specialized vocabulary or domain.

How many dimensions should embeddings have?

Typical: 100–300. Larger = more expressive but slower. Smaller = faster but less capable.

How do I compare embeddings from different models?

Carefully! Embeddings from different models aren't directly comparable. Normalize or use relative similarity (nearest neighbors) instead of absolute values.