Skip to content

Local Sampling Simulator Guide

What is This?

A free, offline simulator that teaches you exactly what LLM sampling parameters do by:

  1. Starting with fake logits (pretend next-token odds)
  2. Applying parameter transformations (temperature, filters, penalties)
  3. Sampling thousands of times to estimate the distribution
  4. Showing results: entropy, top tokens, survival rate

Key idea: Parameter math is identical in all models. Learn it here, apply it everywhere.


What It Is (and Isn't)

โœ… What It Covers

  • Temperature scaling (logits รท T)
  • Softmax โ†’ probabilities
  • Top-k truncation (keep top k ranked tokens)
  • Top-p (nucleus) truncation (keep smallest set with cumulative prob โ‰ฅ p)
  • Min-p filtering (keep prob โ‰ฅ min_p ร— max_prob)
  • Typical-p filtering (keep tokens with "typical" information content)
  • Repetition penalties (discourage tokens seen recently)

โŒ What It Doesn't Do

  • Run an actual transformer model
  • Produce realistic text (uses fake logits)
  • Account for vocabulary size (uses 10 tokens for simplicity)

Think of it as: "If a model believed these were the next-token odds, how would parameter X change behavior?"


The Decoding Pipeline (Visual)

This is the standard token selection process. Any LLM uses something like this:

Raw logits (e.g., [2.2, 1.8, 1.4, ...])
        โ†“
[Optional: Repetition penalty applied to recent tokens]
        โ†“
[Temperature scaling: logits / T]
        โ†“
[Filtering: top-k, then top-p, then min-p, then typical-p]
        โ†“
[Softmax: convert to probabilities]
        โ†“
[Sampling: pick a token randomly according to probabilities]
        โ†“
[Next token selected]

Most systems apply filters in roughly this order. The order matters!


Key Concepts (Intuition + Math)

Temperature

Rescales logits before softmax:

\[P(token_i) = \frac{e^{logit_i / T}}{\sum_j e^{logit_j / T}}\]
  • Lower T (< 1) โ†’ sharpens distribution โ†’ deterministic
  • Higher T (> 1) โ†’ flattens distribution โ†’ random

Top-k

Keeps only the k highest-ranked tokens (by logit), discards rest.

  • Fixed-size candidate set
  • Always removes lowest-probability tail
  • Good for preventing garbage tokens

Top-p (Nucleus Sampling)

Keeps the smallest set of tokens whose cumulative probability โ‰ฅ p.

  • Dynamic-size candidate set (adapts to model confidence)
  • Balances exploration and coherence
  • Common default: p โ‰ˆ 0.9 or 0.95

Min-p

Keeps tokens with probability โ‰ฅ min_p ร— max_prob_at_this_step.

  • Adapts like top-p but using a relative threshold
  • Useful in open-source backends

Typical-p

Selects tokens with typical information content (close to distribution entropy).

  • Different from max-probability focus
  • Less common but sometimes better for certain tasks

Repetition Penalty

Scales down logits of tokens seen recently (within a window).

  • Discourages (not forbids) repetition
  • Window size matters (look back 8, 32, 64, 128 tokens?)
  • Penalty strength: 1.0=none, 1.1=mild, 1.5=heavy

Installation & Setup

Prerequisites

  • Python 3.8+
  • numpy (required)
  • matplotlib (optional, for plotting)

Install

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements-experiments.txt

Run the Default Experiments

python3 experiments/local/sampling_simulator.py

You'll see output like:

=== Very deterministic (T=0.2) ===
Entropy: 0.412
approve       0.967
reject        0.023
review        0.005
escalate      0.002
delay         0.000
...

The Simulator Code

Below is the complete, ready-to-run simulator at experiments/local/sampling_simulator.py:

Core Math Functions

import numpy as np

def softmax(logits: np.ndarray) -> np.ndarray:
    '''Numerically stable softmax.'''
    x = logits - np.max(logits)
    e = np.exp(x)
    return e / np.sum(e)


def apply_temperature(logits: np.ndarray, temperature: float) -> np.ndarray:
    '''Scale logits by temperature (logits / T).'''
    if temperature <= 0:
        raise ValueError('temperature must be > 0')
    return logits / temperature


def filter_top_k(logits: np.ndarray, k: int) -> np.ndarray:
    '''Keep only the top-k logits; set others to -inf.'''
    if k is None or k <= 0 or k >= logits.size:
        return logits
    idx = np.argpartition(logits, -k)[-k:]
    mask = np.full_like(logits, fill_value=False, dtype=bool)
    mask[idx] = True
    out = logits.copy()
    out[~mask] = -np.inf
    return out


def filter_top_p(logits: np.ndarray, p: float) -> np.ndarray:
    '''Nucleus filtering: keep smallest set with cumulative prob >= p.'''
    if p is None or p >= 1.0:
        return logits
    if p <= 0:
        raise ValueError('top_p must be in (0,1]')

    probs = softmax(logits)
    order = np.argsort(probs)[::-1]
    cum = np.cumsum(probs[order])

    keep = order[cum <= p]
    if keep.size == 0:
        keep = order[:1]

    mask = np.zeros_like(probs, dtype=bool)
    mask[keep] = True
    out = logits.copy()
    out[~mask] = -np.inf
    return out


def filter_min_p(logits: np.ndarray, min_p: float) -> np.ndarray:
    '''Min-p filtering: keep tokens with prob >= min_p * max_prob.'''
    if min_p is None or min_p <= 0:
        return logits
    if min_p > 1:
        raise ValueError('min_p must be in (0,1]')

    probs = softmax(logits)
    thr = min_p * np.max(probs)
    mask = probs >= thr

    out = logits.copy()
    out[~mask] = -np.inf
    return out


def filter_typical_p(logits: np.ndarray, typical_p: float) -> np.ndarray:
    '''Typical-p filtering.

    Keeps tokens with information content close to distribution entropy.
    '''
    if typical_p is None or typical_p >= 1.0:
        return logits
    if typical_p <= 0:
        raise ValueError('typical_p must be in (0,1]')

    probs = softmax(logits)
    surprise = -np.log(np.clip(probs, 1e-30, 1.0))
    H = float(np.sum(probs * surprise))

    order = np.argsort(np.abs(surprise - H))
    cum = np.cumsum(probs[order])

    keep = order[cum <= typical_p]
    if keep.size == 0:
        keep = order[:1]

    mask = np.zeros_like(probs, dtype=bool)
    mask[keep] = True

    out = logits.copy()
    out[~mask] = -np.inf
    return out


def apply_repetition_penalty(logits: np.ndarray, generated_ids: list[int], penalty: float) -> np.ndarray:
    '''Simple repetition penalty: downscale logits for already-seen tokens.'''
    if penalty is None or penalty == 1.0:
        return logits
    out = logits.copy()
    for tid in set(generated_ids):
        if out[tid] > 0:
            out[tid] /= penalty
        else:
            out[tid] *= penalty
    return out

Sampling & Metrics

def sample_next(logits: np.ndarray, rng: np.random.Generator) -> int:
    probs = softmax(logits)
    return int(rng.choice(len(probs), p=probs))


def entropy(probs: np.ndarray) -> float:
    p = np.clip(probs, 1e-30, 1.0)
    return float(-np.sum(p * np.log(p)))


def run_experiment(
    token_names,
    base_logits,
    n_samples=5000,
    seed=123,
    temperature=1.0,
    top_k=0,
    top_p=1.0,
    min_p=None,
    typical_p=1.0,
    repetition_penalty=1.0,
    sequence_len=0,
):
    '''Run parameter sweep experiment.'''
    rng = np.random.default_rng(seed)

    counts = np.zeros(len(base_logits), dtype=int)
    generated = []

    for _ in range(n_samples):
        logits = base_logits.copy()

        # Optional: penalize tokens that appear in the recent window
        if sequence_len > 0:
            logits = apply_repetition_penalty(logits, generated[-sequence_len:], repetition_penalty)

        # Temperature scaling
        logits = apply_temperature(logits, temperature)

        # Filters (order matters)
        logits = filter_top_k(logits, top_k)
        logits = filter_top_p(logits, top_p)
        logits = filter_min_p(logits, min_p)
        logits = filter_typical_p(logits, typical_p)

        tid = sample_next(logits, rng)
        counts[tid] += 1
        generated.append(tid)

    probs_emp = counts / counts.sum()

    return {
        'counts': counts,
        'probs_emp': probs_emp,
        'entropy': entropy(probs_emp),
        'top_tokens': sorted(
            [(token_names[i], probs_emp[i]) for i in range(len(token_names))],
            key=lambda x: -x[1]
        )[:10],
    }

Main Demo

if __name__ == '__main__':
    tokens = ['approve', 'reject', 'review', 'escalate', 'delay', 'audit', 'optimize', 'notify', 'assign', 'close']

    # Example base logits (hand-made): first few tokens more likely
    logits = np.array([2.2, 1.8, 1.4, 0.9, 0.2, 0.1, -0.3, -0.6, -0.8, -1.0], dtype=float)

    settings = [
        dict(name='Very deterministic (T=0.2)', temperature=0.2, top_k=0, top_p=1.0),
        dict(name='Balanced (T=0.8)', temperature=0.8, top_k=0, top_p=0.95),
        dict(name='More creative (T=1.3)', temperature=1.3, top_k=50, top_p=1.0),
        dict(name='Tighter nucleus (T=1.0, p=0.6)', temperature=1.0, top_k=0, top_p=0.6),
        dict(name='Min-p filter (p=0.1)', temperature=1.0, top_k=0, top_p=1.0, min_p=0.1),
        dict(name='Typical-p (p=0.2)', temperature=1.0, top_k=0, top_p=1.0, typical_p=0.2),
    ]

    for s in settings:
        name = s.pop('name')
        out = run_experiment(tokens, logits, n_samples=10000, seed=42, **s)
        print(f"\n=== {name} ===")
        print('Entropy:', round(out['entropy'], 3))
        for t, p in out['top_tokens']:
            print(f"{t:10s}  {p:.3f}")

Interpreting Results

Entropy

What is it? Average "surprise" in the distribution. Higher = more random, lower = more deterministic.

  • Entropy โ‰ˆ 0 โ†’ Nearly deterministic (T=0.1, or top-k=1)
  • Entropy โ‰ˆ 1.0 โ†’ Fairly peaked (safety-critical tasks)
  • Entropy โ‰ˆ 2.0+ โ†’ Flat (creative tasks, code generation)

How to read it: - Low entropy = model "confident" in a few top options - High entropy = model "uncertain," many options equally plausible

Top Token Probability

What is it? How likely is the most probable token?

  • Top token prob ~90%+ โ†’ Highly deterministic (classification)
  • Top token prob ~30-50% โ†’ Balanced (summarization)
  • Top token prob ~15-25% โ†’ Creative (writing, brainstorming)

Surviving Token Count

What is it? How many tokens have non-zero probability after all filters?

  • With top-k=1: Only 1 token
  • With top-k=5: Only 5 tokens
  • With top-p=0.5: ~2-4 tokens (varies by confidence)
  • With no filters + high temp: Most/all tokens

Experiment 1: Temperature Sweep

Fix top_p=1.0, top_k=0, vary temperature:

settings = [
    dict(name='T=0.1', temperature=0.1, top_k=0, top_p=1.0),
    dict(name='T=0.2', temperature=0.2, top_k=0, top_p=1.0),
    dict(name='T=0.5', temperature=0.5, top_k=0, top_p=1.0),
    dict(name='T=1.0', temperature=1.0, top_k=0, top_p=1.0),
    dict(name='T=1.5', temperature=1.5, top_k=0, top_p=1.0),
    dict(name='T=2.0', temperature=2.0, top_k=0, top_p=1.0),
]

Observe: Entropy increases linearly. Top token probability decreases.

Experiment 2: Top-p Sweep

Fix temperature=1.0, top_k=0, vary top_p:

settings = [
    dict(name='p=0.2', temperature=1.0, top_k=0, top_p=0.2),
    dict(name='p=0.5', temperature=1.0, top_k=0, top_p=0.5),
    dict(name='p=0.8', temperature=1.0, top_k=0, top_p=0.8),
    dict(name='p=0.95', temperature=1.0, top_k=0, top_p=0.95),
    dict(name='p=1.0', temperature=1.0, top_k=0, top_p=1.0),
]

Observe: Lower p = fewer tokens survive = lower entropy.

Experiment 3: Top-k Sweep

Fix temperature=1.0, top_p=1.0, vary top_k:

settings = [
    dict(name='k=1', temperature=1.0, top_k=1, top_p=1.0),
    dict(name='k=5', temperature=1.0, top_k=5, top_p=1.0),
    dict(name='k=20', temperature=1.0, top_k=20, top_p=1.0),
    dict(name='k=0 (no limit)', temperature=1.0, top_k=0, top_p=1.0),
]

Observe: k=1 โ†’ entropy=0 (only 1 token). Higher k โ†’ higher entropy.


Real-World Parameter Presets

Task Temperature Top-p Top-k Notes
Classification 0.0-0.2 1.0 0 Deterministic
Extraction 0.1-0.3 1.0 0 Safe, consistent
Summarization 0.6-0.8 0.95 0 Balance fidelity + variety
RAG/Q&A 0.2-0.4 1.0 0 Grounded, low hallucination
Code generation 0.8 1.0 30 Diverse but bounded
Brainstorming 0.9-1.2 0.85 0 Creative
Story writing 1.0-1.3 0.9 0 High creativity

Optional: Visualization

Plot entropy vs. temperature:

import matplotlib.pyplot as plt

temps = np.linspace(0.1, 2.0, 20)
entropies = []

for t in temps:
    out = run_experiment(tokens, logits, temperature=t, n_samples=5000)
    entropies.append(out['entropy'])

plt.figure(figsize=(10, 6))
plt.plot(temps, entropies, marker='o', linewidth=2)
plt.xlabel('Temperature', fontsize=12)
plt.ylabel('Entropy (bits)', fontsize=12)
plt.title('How Temperature Affects Distribution Spread', fontsize=14)
plt.grid(True, alpha=0.3)
plt.show()

Next Steps


FAQ

Q: Can I modify the simulator to use my own logits?
A: Yes! Replace logits = np.array([...]) with your own values.

Q: Will this teach me how a real model works?
A: Noโ€”this teaches you parameter math, which is universal. But real models have vocabulary of 50k+ tokens and produce coherent text. The parameter principles are identical.

Q: Which parameters matter most?
A: In order: (1) Temperature, (2) Top-p, (3) Penalties. Top-k is less common in modern systems.

Q: Can I use this with other APIs (OpenAI, Anthropic)?
A: The parameter math is the same everywhere. Learn it here, apply it there.


๐Ÿ‘‰ Ready to start? Go to Experiment 1: Temperature โ†’