Experiment 2: Top-p (Nucleus Sampling) Sweep¶

The Question¶

What does top-p actually do? How does it differ from temperature?

Temperature changes all probabilities uniformly (stretches the distribution)
Top-p does something different: it filters out low-probability tokens entirely
But the cutoff is dynamic based on certainty—how?

What You'll Do¶

Keep temperature fixed and vary top_p, watching how many tokens "survive" the nucleus filter.

The Setup¶

Same fake logits as Experiment 1:

logits = np.array([2.2, 1.8, 1.4, 0.9, 0.2, 0.1, -0.3, -0.6, -0.8, -1.0], dtype=float)

Your Experiment: Isolate Top-p¶

Fix temperature=1.0 and run with these top_p values:

Run	Temperature	Top-p	What to observe
1	1.0	0.2	Very tight nucleus; maybe only 1-2 tokens
2	1.0	0.5	~50% cumulative probability; ~3-4 tokens
3	1.0	0.8	~80% of mass; ~6-7 tokens
4	1.0	0.95	Nearly all tokens, just removing tail
5	1.0	1.0	No filtering; no tokens removed

How to Run¶

Replace the settings:

settings = [
    dict(name='Top-p=0.2 (very tight)',      temperature=1.0, top_k=0, top_p=0.2),
    dict(name='Top-p=0.5 (medium)',          temperature=1.0, top_k=0, top_p=0.5),
    dict(name='Top-p=0.8 (loose)',           temperature=1.0, top_k=0, top_p=0.8),
    dict(name='Top-p=0.95 (nearly all)',     temperature=1.0, top_k=0, top_p=0.95),
    dict(name='Top-p=1.0 (no filter)',       temperature=1.0, top_k=0, top_p=1.0),
]

Run:

python3 experiments/local/sampling_simulator.py

What to Watch For¶

As you decrease top_p from 1.0 → 0.2, observe:

How many tokens have non-zero probability?
At p=0.2, maybe only 1-2 tokens survive
At p=1.0, all 10 tokens have a chance

Why? Top-p filters by cumulative probability. Lower p = stricter filter = fewer tokens.

Entropy trend: Does it decrease as p decreases?
Yes! Fewer candidates = lower entropy = more predictable
Do the probabilities of surviving tokens change?
Yes. When a token survives the filter, it gets renormalized (rescaled up so the nucleus totals 100%)

Example: - Before filter: tokens have probabilities [0.30, 0.25, 0.20, 0.15, ...] - After top-p=0.75: only first 3 survive; they renormalize to [0.40, 0.33, 0.27]

Example output:

=== Top-p=0.2 (very tight) ===
Entropy: 0.721
approve       0.667
reject        0.333
review        0.000
escalate      0.000
...

=== Top-p=0.8 (loose) ===
Entropy: 2.234
approve       0.240
reject        0.198
review        0.167
escalate      0.145
delay         0.102
audit         0.078
optimize      0.050
notify        0.015
...

The Insight¶

Top-p selects a dynamic nucleus of tokens.

Unlike top-k (which always keeps exactly k tokens regardless of confidence), top-p adjusts based on the model's certainty:

When the model is confident (one token much higher prob): nucleus is small
When the model is uncertain (many tokens similar prob): nucleus is larger

This is often called "adaptive" or "self-adjusting" filtering.

Math Behind It¶

Compute softmax probabilities from logits
Sort tokens by probability (descending)
Accumulate probabilities: token 1 = 0.30, token 1+2 = 0.55, token 1+2+3 = 0.75, ...
Keep tokens until cumulative sum ≥ p
Renormalize: divide all survivors by their sum (so they sum to 1)
Discard the rest (set their probability to 0)

Real-World Implication¶

Safety-critical tasks (medical diagnosis, financial advice): Use top-p=0.5 or lower (strict nucleus)
Creative tasks (ideation, content generation): Use top-p=0.9+ (loose nucleus)
Balanced tasks: top-p=0.75 to 0.85 is a common default

Key Difference from Temperature¶

Aspect	Temperature	Top-p
Mechanism	Scales all logits uniformly	Filters + renormalizes
Effect	Makes all tokens more/less likely proportionally	Creates a candidate set
Dynamic?	No—fixed for a run	Yes—adapts to confidence
Order	Preserves ranking (most likely stays most likely)	Preserves ranking within nucleus

Next Step¶

You now understand: - Temperature = stretches distribution - Top-p = selects a nucleus

Ready to see how they interact? Move to:

👉 Experiment 3: Top-k Sweep →

Optional: Dive Deeper¶

Try varying both temperature and top-p to see the interaction:

settings = [
    dict(name='T=0.5, p=0.5',  temperature=0.5, top_k=0, top_p=0.5),
    dict(name='T=1.0, p=0.5',  temperature=1.0, top_k=0, top_p=0.5),
    dict(name='T=1.5, p=0.5',  temperature=1.5, top_k=0, top_p=0.5),
]

Notice: Temperature affects how confident the distribution is about its nucleus; top-p controls which tokens get a shot.