Local Sampling Simulator Guide¶
What is This?¶
A free, offline simulator that teaches you exactly what LLM sampling parameters do by:
- Starting with fake logits (pretend next-token odds)
- Applying parameter transformations (temperature, filters, penalties)
- Sampling thousands of times to estimate the distribution
- Showing results: entropy, top tokens, survival rate
Key idea: Parameter math is identical in all models. Learn it here, apply it everywhere.
What It Is (and Isn't)¶
โ What It Covers¶
- Temperature scaling (logits รท T)
- Softmax โ probabilities
- Top-k truncation (keep top k ranked tokens)
- Top-p (nucleus) truncation (keep smallest set with cumulative prob โฅ p)
- Min-p filtering (keep prob โฅ min_p ร max_prob)
- Typical-p filtering (keep tokens with "typical" information content)
- Repetition penalties (discourage tokens seen recently)
โ What It Doesn't Do¶
- Run an actual transformer model
- Produce realistic text (uses fake logits)
- Account for vocabulary size (uses 10 tokens for simplicity)
Think of it as: "If a model believed these were the next-token odds, how would parameter X change behavior?"
The Decoding Pipeline (Visual)¶
This is the standard token selection process. Any LLM uses something like this:
Raw logits (e.g., [2.2, 1.8, 1.4, ...])
โ
[Optional: Repetition penalty applied to recent tokens]
โ
[Temperature scaling: logits / T]
โ
[Filtering: top-k, then top-p, then min-p, then typical-p]
โ
[Softmax: convert to probabilities]
โ
[Sampling: pick a token randomly according to probabilities]
โ
[Next token selected]
Most systems apply filters in roughly this order. The order matters!
Key Concepts (Intuition + Math)¶
Temperature¶
Rescales logits before softmax:
- Lower T (< 1) โ sharpens distribution โ deterministic
- Higher T (> 1) โ flattens distribution โ random
Top-k¶
Keeps only the k highest-ranked tokens (by logit), discards rest.
- Fixed-size candidate set
- Always removes lowest-probability tail
- Good for preventing garbage tokens
Top-p (Nucleus Sampling)¶
Keeps the smallest set of tokens whose cumulative probability โฅ p.
- Dynamic-size candidate set (adapts to model confidence)
- Balances exploration and coherence
- Common default: p โ 0.9 or 0.95
Min-p¶
Keeps tokens with probability โฅ min_p ร max_prob_at_this_step.
- Adapts like top-p but using a relative threshold
- Useful in open-source backends
Typical-p¶
Selects tokens with typical information content (close to distribution entropy).
- Different from max-probability focus
- Less common but sometimes better for certain tasks
Repetition Penalty¶
Scales down logits of tokens seen recently (within a window).
- Discourages (not forbids) repetition
- Window size matters (look back 8, 32, 64, 128 tokens?)
- Penalty strength: 1.0=none, 1.1=mild, 1.5=heavy
Installation & Setup¶
Prerequisites¶
- Python 3.8+
numpy(required)matplotlib(optional, for plotting)
Install¶
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements-experiments.txt
Run the Default Experiments¶
python3 experiments/local/sampling_simulator.py
You'll see output like:
=== Very deterministic (T=0.2) ===
Entropy: 0.412
approve 0.967
reject 0.023
review 0.005
escalate 0.002
delay 0.000
...
The Simulator Code¶
Below is the complete, ready-to-run simulator at experiments/local/sampling_simulator.py:
Core Math Functions¶
import numpy as np
def softmax(logits: np.ndarray) -> np.ndarray:
'''Numerically stable softmax.'''
x = logits - np.max(logits)
e = np.exp(x)
return e / np.sum(e)
def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
'''Scale logits by temperature (logits / T).'''
if temperature <= 0:
raise ValueError('temperature must be > 0')
return logits / temperature
def filter_top_k(logits: np.ndarray, k: int) -> np.ndarray:
'''Keep only the top-k logits; set others to -inf.'''
if k is None or k <= 0 or k >= logits.size:
return logits
idx = np.argpartition(logits, -k)[-k:]
mask = np.full_like(logits, fill_value=False, dtype=bool)
mask[idx] = True
out = logits.copy()
out[~mask] = -np.inf
return out
def filter_top_p(logits: np.ndarray, p: float) -> np.ndarray:
'''Nucleus filtering: keep smallest set with cumulative prob >= p.'''
if p is None or p >= 1.0:
return logits
if p <= 0:
raise ValueError('top_p must be in (0,1]')
probs = softmax(logits)
order = np.argsort(probs)[::-1]
cum = np.cumsum(probs[order])
keep = order[cum <= p]
if keep.size == 0:
keep = order[:1]
mask = np.zeros_like(probs, dtype=bool)
mask[keep] = True
out = logits.copy()
out[~mask] = -np.inf
return out
def filter_min_p(logits: np.ndarray, min_p: float) -> np.ndarray:
'''Min-p filtering: keep tokens with prob >= min_p * max_prob.'''
if min_p is None or min_p <= 0:
return logits
if min_p > 1:
raise ValueError('min_p must be in (0,1]')
probs = softmax(logits)
thr = min_p * np.max(probs)
mask = probs >= thr
out = logits.copy()
out[~mask] = -np.inf
return out
def filter_typical_p(logits: np.ndarray, typical_p: float) -> np.ndarray:
'''Typical-p filtering.
Keeps tokens with information content close to distribution entropy.
'''
if typical_p is None or typical_p >= 1.0:
return logits
if typical_p <= 0:
raise ValueError('typical_p must be in (0,1]')
probs = softmax(logits)
surprise = -np.log(np.clip(probs, 1e-30, 1.0))
H = float(np.sum(probs * surprise))
order = np.argsort(np.abs(surprise - H))
cum = np.cumsum(probs[order])
keep = order[cum <= typical_p]
if keep.size == 0:
keep = order[:1]
mask = np.zeros_like(probs, dtype=bool)
mask[keep] = True
out = logits.copy()
out[~mask] = -np.inf
return out
def apply_repetition_penalty(logits: np.ndarray, generated_ids: list[int], penalty: float) -> np.ndarray:
'''Simple repetition penalty: downscale logits for already-seen tokens.'''
if penalty is None or penalty == 1.0:
return logits
out = logits.copy()
for tid in set(generated_ids):
if out[tid] > 0:
out[tid] /= penalty
else:
out[tid] *= penalty
return out
Sampling & Metrics¶
def sample_next(logits: np.ndarray, rng: np.random.Generator) -> int:
probs = softmax(logits)
return int(rng.choice(len(probs), p=probs))
def entropy(probs: np.ndarray) -> float:
p = np.clip(probs, 1e-30, 1.0)
return float(-np.sum(p * np.log(p)))
def run_experiment(
token_names,
base_logits,
n_samples=5000,
seed=123,
temperature=1.0,
top_k=0,
top_p=1.0,
min_p=None,
typical_p=1.0,
repetition_penalty=1.0,
sequence_len=0,
):
'''Run parameter sweep experiment.'''
rng = np.random.default_rng(seed)
counts = np.zeros(len(base_logits), dtype=int)
generated = []
for _ in range(n_samples):
logits = base_logits.copy()
# Optional: penalize tokens that appear in the recent window
if sequence_len > 0:
logits = apply_repetition_penalty(logits, generated[-sequence_len:], repetition_penalty)
# Temperature scaling
logits = apply_temperature(logits, temperature)
# Filters (order matters)
logits = filter_top_k(logits, top_k)
logits = filter_top_p(logits, top_p)
logits = filter_min_p(logits, min_p)
logits = filter_typical_p(logits, typical_p)
tid = sample_next(logits, rng)
counts[tid] += 1
generated.append(tid)
probs_emp = counts / counts.sum()
return {
'counts': counts,
'probs_emp': probs_emp,
'entropy': entropy(probs_emp),
'top_tokens': sorted(
[(token_names[i], probs_emp[i]) for i in range(len(token_names))],
key=lambda x: -x[1]
)[:10],
}
Main Demo¶
if __name__ == '__main__':
tokens = ['approve', 'reject', 'review', 'escalate', 'delay', 'audit', 'optimize', 'notify', 'assign', 'close']
# Example base logits (hand-made): first few tokens more likely
logits = np.array([2.2, 1.8, 1.4, 0.9, 0.2, 0.1, -0.3, -0.6, -0.8, -1.0], dtype=float)
settings = [
dict(name='Very deterministic (T=0.2)', temperature=0.2, top_k=0, top_p=1.0),
dict(name='Balanced (T=0.8)', temperature=0.8, top_k=0, top_p=0.95),
dict(name='More creative (T=1.3)', temperature=1.3, top_k=50, top_p=1.0),
dict(name='Tighter nucleus (T=1.0, p=0.6)', temperature=1.0, top_k=0, top_p=0.6),
dict(name='Min-p filter (p=0.1)', temperature=1.0, top_k=0, top_p=1.0, min_p=0.1),
dict(name='Typical-p (p=0.2)', temperature=1.0, top_k=0, top_p=1.0, typical_p=0.2),
]
for s in settings:
name = s.pop('name')
out = run_experiment(tokens, logits, n_samples=10000, seed=42, **s)
print(f"\n=== {name} ===")
print('Entropy:', round(out['entropy'], 3))
for t, p in out['top_tokens']:
print(f"{t:10s} {p:.3f}")
Interpreting Results¶
Entropy¶
What is it? Average "surprise" in the distribution. Higher = more random, lower = more deterministic.
- Entropy โ 0 โ Nearly deterministic (T=0.1, or top-k=1)
- Entropy โ 1.0 โ Fairly peaked (safety-critical tasks)
- Entropy โ 2.0+ โ Flat (creative tasks, code generation)
How to read it: - Low entropy = model "confident" in a few top options - High entropy = model "uncertain," many options equally plausible
Top Token Probability¶
What is it? How likely is the most probable token?
- Top token prob ~90%+ โ Highly deterministic (classification)
- Top token prob ~30-50% โ Balanced (summarization)
- Top token prob ~15-25% โ Creative (writing, brainstorming)
Surviving Token Count¶
What is it? How many tokens have non-zero probability after all filters?
- With top-k=1: Only 1 token
- With top-k=5: Only 5 tokens
- With top-p=0.5: ~2-4 tokens (varies by confidence)
- With no filters + high temp: Most/all tokens
Recommended Experiment Sequence¶
Experiment 1: Temperature Sweep¶
Fix top_p=1.0, top_k=0, vary temperature:
settings = [
dict(name='T=0.1', temperature=0.1, top_k=0, top_p=1.0),
dict(name='T=0.2', temperature=0.2, top_k=0, top_p=1.0),
dict(name='T=0.5', temperature=0.5, top_k=0, top_p=1.0),
dict(name='T=1.0', temperature=1.0, top_k=0, top_p=1.0),
dict(name='T=1.5', temperature=1.5, top_k=0, top_p=1.0),
dict(name='T=2.0', temperature=2.0, top_k=0, top_p=1.0),
]
Observe: Entropy increases linearly. Top token probability decreases.
Experiment 2: Top-p Sweep¶
Fix temperature=1.0, top_k=0, vary top_p:
settings = [
dict(name='p=0.2', temperature=1.0, top_k=0, top_p=0.2),
dict(name='p=0.5', temperature=1.0, top_k=0, top_p=0.5),
dict(name='p=0.8', temperature=1.0, top_k=0, top_p=0.8),
dict(name='p=0.95', temperature=1.0, top_k=0, top_p=0.95),
dict(name='p=1.0', temperature=1.0, top_k=0, top_p=1.0),
]
Observe: Lower p = fewer tokens survive = lower entropy.
Experiment 3: Top-k Sweep¶
Fix temperature=1.0, top_p=1.0, vary top_k:
settings = [
dict(name='k=1', temperature=1.0, top_k=1, top_p=1.0),
dict(name='k=5', temperature=1.0, top_k=5, top_p=1.0),
dict(name='k=20', temperature=1.0, top_k=20, top_p=1.0),
dict(name='k=0 (no limit)', temperature=1.0, top_k=0, top_p=1.0),
]
Observe: k=1 โ entropy=0 (only 1 token). Higher k โ higher entropy.
Real-World Parameter Presets¶
| Task | Temperature | Top-p | Top-k | Notes |
|---|---|---|---|---|
| Classification | 0.0-0.2 | 1.0 | 0 | Deterministic |
| Extraction | 0.1-0.3 | 1.0 | 0 | Safe, consistent |
| Summarization | 0.6-0.8 | 0.95 | 0 | Balance fidelity + variety |
| RAG/Q&A | 0.2-0.4 | 1.0 | 0 | Grounded, low hallucination |
| Code generation | 0.8 | 1.0 | 30 | Diverse but bounded |
| Brainstorming | 0.9-1.2 | 0.85 | 0 | Creative |
| Story writing | 1.0-1.3 | 0.9 | 0 | High creativity |
Optional: Visualization¶
Plot entropy vs. temperature:
import matplotlib.pyplot as plt
temps = np.linspace(0.1, 2.0, 20)
entropies = []
for t in temps:
out = run_experiment(tokens, logits, temperature=t, n_samples=5000)
entropies.append(out['entropy'])
plt.figure(figsize=(10, 6))
plt.plot(temps, entropies, marker='o', linewidth=2)
plt.xlabel('Temperature', fontsize=12)
plt.ylabel('Entropy (bits)', fontsize=12)
plt.title('How Temperature Affects Distribution Spread', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()
Next Steps¶
- Follow the guided learning path: Start with Experiment 1: Temperature
- Read parameter documentation: Parameter Reference
- Move to cloud: Azure OpenAI Setup (when ready)
FAQ¶
Q: Can I modify the simulator to use my own logits?
A: Yes! Replace logits = np.array([...]) with your own values.
Q: Will this teach me how a real model works?
A: Noโthis teaches you parameter math, which is universal. But real models have vocabulary of 50k+ tokens and produce coherent text. The parameter principles are identical.
Q: Which parameters matter most?
A: In order: (1) Temperature, (2) Top-p, (3) Penalties. Top-k is less common in modern systems.
Q: Can I use this with other APIs (OpenAI, Anthropic)?
A: The parameter math is the same everywhere. Learn it here, apply it there.
๐ Ready to start? Go to Experiment 1: Temperature โ