Skip to content

Experiment 1: Temperature Sweep

The Mental Model

Imagine you're a language model and someone asks you:

"Should I approve this request?"

You have 10 possible answers: approve, reject, review, escalate, delay, audit, optimize, notify, assign, close.

Internally, the model doesn't just pick one word. It scores every word and turns those scores into probabilities. Temperature is the knob that controls how confident the model acts when sampling from those probabilities.

Think of it like this:

Low Temperature High Temperature
Model is very sure of itself Model spreads bets evenly
"It's almost definitely approve" "Honestly any of these could work"
A politician giving a press statement A brainstorming session
Strict, deterministic Loose, exploratory

Step 1: Understand What "Logits" Are

Logits are raw scores the model assigns to each possible next token. They're just numbers โ€” they can be positive or negative.

They are computed by the model itself as the final score for each possible next token, using the hidden representation built up through the transformer layers

How they are produced

Inside an LLM, the prompt is first turned into embeddings, then processed through many layers of attention and feed-forward networks. At the end, the model has a vector that represents the current context, and that vector is multiplied by a learned output matrix to produce one score per vocabulary token.

What the scores mean

Those numbers are called logits because they are raw, unnormalized values. A larger logit means the model finds that token more compatible with the current context, but the values only become probabilities after a softmax step.

In our experiment, the logits for the 10 tokens look like this:

Token       Logit
approve     2.2   โ† model's favorite
reject      1.8
review      1.4
escalate    0.9
delay       0.2
audit       0.1
optimize   -0.3
notify     -0.6
assign     -0.8
close      -1.0   โ† model's least favorite

Understanding Logits

  • Higher logit = model thinks this token fits better
  • The logit is NOT a probability yet (2.2 doesn't mean 2.2%)
  • The absolute values don't matter โ€” only the differences between them matter

Key Insight: Ranking vs. Strength

The logit list is like a ranked list of preferences, but the ranking alone doesn't tell you how strong those preferences are.


Step 2: Understand the Softmax Function

To turn logits into probabilities, we use softmax. Here's what it does in plain English:

  1. Take every logit and compute e^logit (e = 2.718, Euler's number)
  2. Add all those values up
  3. Each token's probability = its e^logit divided by the total

The formula:

P(token_i) = e^(logit_i) / sum of all e^(logit_j)
\[ P(i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} \]

Why e^x?

Because it has a useful property: it's always positive, and it makes bigger numbers much bigger than smaller ones. A logit of 4.0 doesn't just beat 2.0 โ€” it completely dominates it. This is what creates the "peaked" distribution.

What Softmax Guarantees

  • All probabilities are between 0 and 1
  • All probabilities add up to exactly 1.0
  • The ranking of tokens is preserved (highest logit = highest probability)

Our example at baseline (T=1.0):

Token       Logit    โ†’    Probability
approve     2.2      โ†’    ~0.36  (36%)
reject      1.8      โ†’    ~0.24  (24%)
review      1.4      โ†’    ~0.16  (16%)
escalate    0.9      โ†’    ~0.10  (10%)
delay       0.2      โ†’    ~0.05  (5%)
audit       0.1      โ†’    ~0.04  (4%)
optimize   -0.3      โ†’    ~0.03  (3%)
notify     -0.6      โ†’    ~0.02  (2%)
assign     -0.8      โ†’    ~0.01  (1%)
close      -1.0      โ†’    ~0.01  (<1%)

So at baseline, if you sample 100 times, you'd get approve about 36 times, reject about 24 times, etc.


Step 3: Understand What Temperature Does to Logits

Temperature changes the logits BEFORE softmax is applied.

The formula:

scaled_logit_i = logit_i / T
where T = temperature

That's it. Just divide every logit by T. Then softmax runs on these scaled logits.

Let's trace what happens step by step for two temperatures:

Case A: T = 0.5 (low temperature, more confident)

Divide all logits by 0.5:

Token       Original    Divided by 0.5
approve     2.2     โ†’   4.4
reject      1.8     โ†’   3.6
review      1.4     โ†’   2.8
escalate    0.9     โ†’   1.8
...
close      -1.0     โ†’  -2.0

When T < 1.0

The differences between logits get bigger (approve vs close: was 3.2, now 6.4). When softmax runs on these bigger differences, the winner dominates much more.

Case B: T = 2.0 (high temperature, less confident)

Divide all logits by 2.0:

Token       Original    Divided by 2.0
approve     2.2     โ†’   1.1
reject      1.8     โ†’   0.9
review      1.4     โ†’   0.7
escalate    0.9     โ†’   0.45
...
close      -1.0     โ†’  -0.5

When T > 1.0

The differences get smaller (approve vs close: was 3.2, now 1.6). When softmax runs on these compressed differences, the probabilities are much more equal.

Critical Insight: Order Preservation

Temperature does NOT change the token ranking.
Temperature is inversely proportional to the "confidence gap" between tokens, but the order of which token is most likely doesn't change. approve is still #1 at T=0.1 and at T=5.0.
Temperature only changes how much more likely the top token is than the others.


Entropy: The Single Number That Captures Distribution Shape

Entropy is useful because it tells you how spread out the model's next-token probability distribution is. Temperature changes that spread, and entropy gives you a single number that summarizes it

A compact way to say it is: temperature controls the distribution, while entropy measures the result of that distribution.

How entropy is calculated

Once you have next-token probabilities p1, p2, ..., pn, where each p is between 0 and 1 and they sum to 1,

Shannon entropy is calculated as:

\[ H = -\sum_{i=1}^{n} p_i \log(p_i) \]

The exact log base depends on the units you want; - base 2 gives bits, - natural log gives nats.

The choice is just a unit convention, like using Celsius vs Fahrenheit.

in ML, Entropy is measured in bits because the common machine-learning definition uses base-2 logarithms, and that makes the unit "how many binary yes/no questions of uncertainty" you have on average.

Practical intuition

The Three-Part System

Think of logits as the raw preference scores, temperature as the knob that sharpens or flattens those preferences, and entropy as the summary of how uncertain the final distribution is.


Step 4: Read the Graphs โ€” Panel by Panel

The graph exp1_temperature.png has several panels. Here's how to read each one:


Panel 1โ€“5: The Individual Bar Charts (top two rows, left side)

Each bar chart shows the probability distribution for one temperature value.

What you're looking at:

  • X-axis: each of the 20 tokens (labeled Token 0, Token 1, etc.)
  • Y-axis: probability (0 = impossible, 1.0 = certain)
  • Bar height = how likely the model is to pick that token

How to Interpret Each Bar Chart

T = 0.1  โ†’  One bar is near 1.0, everything else is basically flat near zero
             = Model is almost certain. Sample 1000 times โ†’ same answer 990+ times.

T = 0.5  โ†’  Top bar is tall, next few are visible, rest still near zero
             = Model is fairly confident but alternatives exist.

T = 1.0  โ†’  Top bar maybe 0.35, next bars clearly visible
             = Baseline behavior. Mix of confidence and diversity.

T = 2.0  โ†’  All bars become more similar heights
             = Model is unsure. Many tokens have meaningful probability.

T = 5.0  โ†’  All bars are nearly identical heights
             = Model is almost completely random. Like rolling a 20-sided die.

The white-bordered bar: That's the token with the highest probability. The annotation shows P(top)=X% โ€” that percentage shrinks as temperature rises.

The "Entropy = X bits" in the title: More on this in Step 5, but higher = more spread out = more random.

exp1_temperature.png


Panel 6: "Top-token Prob vs Temperature" (right side, middle row)

This is a line chart where:

  • X-axis: temperature (from 0.05 to 5.0)
  • Y-axis: probability of the single most likely token
  • White dots: the 5 specific temperatures you tested

Reading the Curve

The curve drops steeply from left to right:

  • At T=0.1: top token has ~97% probability (model almost always picks it)
  • At T=1.0: top token has ~36% probability (baseline)
  • At T=5.0: top token has ~15% probability (barely more likely than random)

The dashed vertical line at T=1.0 is the baseline reference.

What the Curve Shape Tells You

The drop is not linear โ€” it's steeper at low temperatures. Going from T=0.1 to T=0.3 makes a huge difference. Going from T=3.0 to T=5.0 makes very little difference. This is the "diminishing returns" zone.

Finding Your Sweet Spot

If you want to find the sweet spot between "too deterministic" and "too random", look for where the curve starts to flatten. That's usually around T=0.7โ€“1.2.


Panel 7: "Entropy vs Temperature" (right side, bottom row)

Entropy is a number that measures "how spread out" a probability distribution is:

  • Entropy = 0: One option has 100% probability. Zero randomness.
  • Entropy = maximum: All options equally likely. Maximum randomness.

The unit is bits. For a 20-token vocabulary, maximum entropy is about 4.32 bits (logโ‚‚ of 20).

Reading the Entropy Chart

  • X-axis: temperature
  • Y-axis: Shannon entropy in bits
  • The line rises smoothly as temperature increases
Low entropy (< 1 bit)   = Very peaked, deterministic behavior
Medium entropy (1โ€“3 bit) = Balanced, diverse but not random
High entropy (3โ€“4 bit)  = Very flat, nearly random

Why Entropy Matters More Than Just Top Probability

Imagine two distributions:

  • Distribution A: [0.9, 0.05, 0.05] โ€” entropy is low, model mostly picks token 0
  • Distribution B: [0.4, 0.4, 0.2] โ€” entropy is higher, model is genuinely split

Both have a clear "winner" but they behave very differently when sampled repeatedly. Entropy captures this full picture in a single number.


Panel 8: The Heatmap (bottom left, spanning two columns)

This is the most information-dense panel. It shows the full distribution across all 20 tokens across 50 temperature values simultaneously.

What you're looking at:

  • X-axis: temperature (left = 0.1, right = 5.0)
  • Y-axis: token index (bottom = token 0, top = token 19)
  • Color: probability (brighter/yellower = higher probability, darker/purple = lower)

How to Read the Heatmap

At the far left (low T):

  • Token 0 has a very bright bar โ€” it's capturing almost all the probability
  • Tokens 1โ€“19 are mostly dark purple (near zero)

As you move right (higher T):

  • Token 0's bar dims gradually
  • Other tokens' bars brighten
  • By T=5.0, all tokens are roughly the same shade of medium brightness

What This Tells You at a Glance

  • A distribution with one bright token + many dark tokens = low entropy, high confidence
  • A distribution where all tokens are medium brightness = high entropy, random

Pro tip: The heatmap lets you see which specific tokens "activate" as temperature rises. Some tokens go from nearly impossible to meaningful contributors. Those are the tokens that represent creative or unusual alternatives.


Step 5: The Mental Walkthrough โ€” Run It In Your Head

Here's a simulation of what happens when you change temperature, step by step:

Scenario: Classify a customer service request

You feed an LLM this prompt:

"Customer says their order is late. Action: ___"

Model internal logits (simplified to 5 tokens):

escalate   : 2.5
notify     : 2.0
delay      : 0.8
close      : -0.5
optimize   : -1.2
At T = 0.2 (very deterministic)

Scaled logits: [12.5, 10.0, 4.0, -2.5, -6.0] โ†’ Softmax โ†’ probabilities: [~0.92, ~0.07, ~0.01, ~0.00, ~0.00]

Behavior: Model says escalate 92% of the time. Reliable, consistent, predictable. Good for a production system.

At T = 1.0 (baseline)

Scaled logits: [2.5, 2.0, 0.8, -0.5, -1.2] โ†’ Softmax โ†’ probabilities: [~0.52, ~0.32, ~0.12, ~0.03, ~0.01]

Behavior: Model says escalate 52% of the time, notify 32% of the time. Still mostly right but with variation.

At T = 2.0 (creative)

Scaled logits: [1.25, 1.0, 0.4, -0.25, -0.6] โ†’ Softmax โ†’ probabilities: [~0.35, ~0.27, ~0.15, ~0.12, ~0.11]

Behavior: Still prefers escalate but notify, delay, and even close get real probability. You'd see different answers on different runs. Bad for classification, potentially useful for generating diverse scenarios.


Step 6: Why This Matters โ€” Practical Decision Guide

Use this decision tree when you're setting temperature in a real project:

Is there ONE correct answer? (classification, factual QA, code)
    YES โ†’ Use T = 0.1 to 0.3
    NO  โ†’ Is coherence and quality still important?
              YES โ†’ Use T = 0.5 to 0.8  (most production tasks)
              NO (exploring/brainstorming) โ†’ Use T = 0.9 to 1.5

The Five Temperature "Zones"

Zone 1: T = 0.0 to 0.2 โ€” Greedy/Deterministic
  • Same answer every time (or nearly so)
  • Use for: classification labels, extracting structured data, code with strict syntax, factual answers
  • Risk: Can be repetitive, may miss correct answers that require slight variation
Zone 2: T = 0.2 to 0.5 โ€” Focused
  • Strong preference for the top tokens, slight variation
  • Use for: summarization, question answering, customer support, most professional tasks
  • Risk: May be slightly "stiff" or formulaic
Zone 3: T = 0.5 to 0.8 โ€” Balanced (most common production setting)
  • Good balance between consistency and natural-sounding variation
  • Use for: chatbots, writing assistance, code generation, general assistants
  • Risk: Some outputs will differ significantly โ€” requires quality checking
Zone 4: T = 0.8 to 1.2 โ€” Creative
  • Model takes risks, tries unusual combinations
  • Use for: brainstorming, creative writing, generating varied examples
  • Risk: Some outputs will be weird or off-topic
Zone 5: T > 1.5 โ€” Exploratory
  • Model is essentially rolling dice
  • Use for: research/testing, understanding the model's vocabulary, stress testing
  • Risk: Often produces incoherent or nonsensical text in production

Step 7: Common Misconceptions โ€” Cleared Up

Misconception 1: 'Higher temperature makes the model smarter/more creative'

Not exactly. Higher temperature makes the model less predictable, which can look like creativity. But the model's knowledge and capabilities don't change. A high temperature can produce brilliant unusual ideas OR complete nonsense โ€” you can't control which. Temperature changes the distribution of outputs, not the quality of the model's underlying knowledge.

Misconception 2: 'Temperature = 0 gives the best answer'

Temperature = 0 (greedy decoding) gives the most probable answer, not necessarily the best answer. Sometimes the second or third most likely token leads to a better overall sentence or idea. Greedy decoding can also get stuck in repetitive loops because it always picks the same token given the same context.

Misconception 3: 'The temperature I set applies to the whole generation'

True for standard APIs, but some advanced systems apply different temperatures to different parts of the output (e.g., stricter at the start of a list, looser for creative expansions). Keep this in mind when you see fine-grained control systems.

Misconception 4: 'Low temperature means the model is thinking harder'

Temperature is applied after the model has already done all its thinking (the forward pass). It only affects the sampling step. The model's reasoning quality doesn't change โ€” you're just changing how you pick from the resulting probability distribution.


Step 8: How to Read the Sample Output Text

In the original experiment, you'd see output like:

=== T=0.1 (extreme determinist) ===
Entropy: 0.412
approve       0.967
reject        0.023
review        0.005
escalate      0.002
delay         0.001
audit         0.001
optimize      0.000
notify        0.000
assign        0.000
close         0.000

Reading Guide for Sample Output

Field What it means
Entropy: 0.412 Very low โ€” nearly all probability mass on one token
approve 0.967 96.7% probability. If you sample 1000 times, ~967 are "approve"
reject 0.023 2.3% probability. Rare but possible
review 0.005 0.5% probability. Almost never
Everything else Effectively zero โ€” these tokens are off the table
=== T=2.0 (very creative) ===
Entropy: 2.891
optimize      0.152
approve       0.138
notify        0.134
reject        0.121
review        0.106
escalate      0.088
delay         0.082
audit         0.075
assign        0.068
close         0.036

Reading High-Temperature Output

Field What it means
Entropy: 2.891 High โ€” many tokens have meaningful probability
optimize 0.152 Highest at 15.2% โ€” but that's a very slim lead
approve 0.138 13.8% โ€” barely behind optimize
close 0.036 3.6% โ€” even the "worst" token has real probability

An Interesting Phenomenon

Notice that at T=2.0, optimize is now the top token โ€” but approve had a higher logit! This happens because high temperature flattens the distribution enough that statistical noise and renormalization can temporarily shift ranks. This is one reason very high temperatures can feel "wrong" โ€” the model's actual preferred answer is no longer the most sampled one.


Step 9: Worked Example โ€” Calculating by Hand

Let's manually calculate probabilities for 3 tokens at T=0.5 and T=2.0.

Original logits: - approve: 2.2 - reject: 1.8 - review: 1.4

Step 1: Scale by temperature

At T=0.5:           At T=2.0:
approve โ†’ 2.2/0.5 = 4.4    approve โ†’ 2.2/2.0 = 1.1
reject  โ†’ 1.8/0.5 = 3.6    reject  โ†’ 1.8/2.0 = 0.9
review  โ†’ 1.4/0.5 = 2.8    review  โ†’ 1.4/2.0 = 0.7

Step 2: Apply e^x (exponentiate)

At T=0.5:                   At T=2.0:
e^4.4 โ‰ˆ 81.5               e^1.1 โ‰ˆ 3.00
e^3.6 โ‰ˆ 36.6               e^0.9 โ‰ˆ 2.46
e^2.8 โ‰ˆ 16.4               e^0.7 โ‰ˆ 2.01
Sum   = 134.5              Sum   = 7.47

Step 3: Divide by sum (normalize)

At T=0.5:                   At T=2.0:
approve  = 81.5/134.5 = 60.6%    approve = 3.00/7.47 = 40.2%
reject   = 36.6/134.5 = 27.2%    reject  = 2.46/7.47 = 32.9%
review   = 16.4/134.5 = 12.2%    review  = 2.01/7.47 = 26.9%

What Changed

  • At T=0.5: approve has 60.6% โ€” a strong lead
  • At T=2.0: approve has 40.2% โ€” still leads but far from dominant
  • At T=2.0: review went from 12.2% to 26.9% โ€” a huge jump!

This is the core mechanism. Low T amplifies differences. High T compresses them.


Step 10: What to Do Next

Now that you understand temperature deeply, you're ready to understand why it interacts with the other parameters:

  1. Top-p (Experiment 2): Even with a good temperature, you might want to exclude very low-probability tokens entirely. Top-p does this dynamically.

  2. Top-k (Experiment 3): A simpler, fixed-size version of top-p filtering.

  3. Repetition Penalty (Experiment 5): Adjusts logits after temperature, but before sampling. Understanding temperature makes repetition penalty obvious.


Quick Reference: Graph Interpretation Cheat Sheet

When you open exp1_temperature.png, scan in this order:

Cheat Sheet for Reading Graphs

1. Look at the bar chart panels
   โ†’ Are the bars peaked (one tall bar) or flat (all similar)?
   โ†’ Peaked = low T, flat = high T

2. Look at entropy values in the panel titles
   โ†’ < 1.0 bits = very deterministic
   โ†’ 1.0โ€“2.5 bits = balanced
   โ†’ > 2.5 bits = highly random

3. Look at the "Top-token Prob" line chart
   โ†’ Where does the curve flatten? That's where further T increases stop mattering much
   โ†’ The white dots are your specific test temperatures

4. Look at the entropy curve
   โ†’ Should mirror the top-token curve (as top prob drops, entropy rises)
   โ†’ Check that they're consistent

5. Look at the heatmap
   โ†’ Which tokens "light up" as T increases?
   โ†’ How many tokens get above ~10% probability at high T?

Summary

Concept One-sentence explanation
Logit Raw score the model assigns to a token โ€” just a number
Softmax Converts logits to probabilities that sum to 1
Temperature Divides all logits before softmax โ€” compresses or expands differences
Low T Amplifies differences โ†’ confident, deterministic
High T Compresses differences โ†’ uncertain, diverse
Entropy One number measuring how spread out a distribution is
The heatmap Shows full distribution across all tokens at all temperatures simultaneously

Final Takeaway

Temperature is the most important parameter to understand first because every other parameter in LLM sampling operates on top of the temperature-scaled distribution. Master this, and the rest becomes intuitive.


Part of the 6-Experiment LLM Parameter Learning Path
Next: Experiment 2 โ€” Top-p (Nucleus Sampling)

Optional: Dive Deeper

Plot entropy vs. temperature (including very high values like T=5.0 if you want to see near-uniform behavior):

import matplotlib.pyplot as plt

temps = [0.1, 0.2, 0.5, 1.0, 1.5, 2.0, 5.0]
entropies = []

for t in temps:
    out = run_experiment(tokens, logits, temperature=t, n_samples=10000)
    entropies.append(out['entropy'])

plt.plot(temps, entropies, marker='o')
plt.xlabel('Temperature')
plt.ylabel('Entropy')
plt.title('How Temperature Affects Distribution Spread')
plt.grid(True)
plt.show()

Takeaway

Temperature is the master randomness dial: it smoothly moves behavior from "pick best token" to "sample broadly".

๐Ÿ‘‰ Next: Experiment 2: Top-p (Nucleus Sampling)