04.01 · Word Embedding Models — Deep Dive
Level: Intermediate
Pre-reading: 04 · Word Embeddings
Word2Vec in Detail
Word2Vec learns embeddings by predicting context from words (or vice versa).
Skip-gram Training
Training example:
Sentence: "The quick brown fox jumps"
Window size: 2
Center: "quick" → Context: "The", "brown"
Center: "brown" → Context: "quick", "fox"
Center: "fox" → Context: "brown", "jumps"
The model learns to maximize similarity between word vectors and context vectors.
GloVe vs Word2Vec
| Word2Vec | GloVe | |
|---|---|---|
| Method | Prediction task (skip-gram/CBOW) | Matrix factorization + prediction |
| Computation | Fast, uses only local windows | Uses global co-occurrence matrix |
| Quality | Good | Often better on analogy tasks |
| Speed | Fast | Slower (matrix factorization) |
FastText: Subword Embeddings
Instead of embedding words, embed character n-grams:
Word: "running"
Character 3-grams: "run", "unn", "nni", "nin", "ing"
Embedding = average of all n-gram embeddings
Benefits: - Handles OOV (out-of-vocabulary) words - Better for morphologically rich languages - Robust to misspellings
Contextual Embeddings
Modern approach: embeddings depend on context.
BERT
Bidirectional Encoder Representations from Transformers
Uses Transformer encoder with masked language modeling: - Mask random words - Predict them from context - Learn contextual embeddings
ELMo
Stacks bidirectional LSTMs, concatenates all layers → contextual embedding.
Using Embeddings in Practice
For Semantic Search
query_embedding = model.embed("best pizza in NYC")
doc_embeddings = [model.embed(doc) for doc in documents]
similarities = [cosine_similarity(query, doc) for doc in doc_embeddings]
top_k = argsort(similarities)[-k:]
For Classification
document_embedding = average(embeddings of words in document)
classifier = LogisticRegression()
classifier.fit(document_embeddings, labels)
For Clustering
embeddings = [model.embed(doc) for doc in documents]
clusters = KMeans(embeddings, k=5)
Should I use pre-trained or train my own embeddings?
Use pre-trained for most cases. Only train your own if you have very specialized vocabulary or domain.
How many dimensions should embeddings have?
Typical: 100–300. Larger = more expressive but slower. Smaller = faster but less capable.
How do I compare embeddings from different models?
Carefully! Embeddings from different models aren't directly comparable. Normalize or use relative similarity (nearest neighbors) instead of absolute values.