Experiment 5: Repetition Penalties¶
What this file is: A beginner-friendly, ground-up explanation of Experiment 5 (Repetition Penalties).
Who it's for: Anyone who completed Experiments 1โ4 and wants to understand repetition penalties โ including the math, the failure modes, and why they're nothing like top-k or top-p.
What you'll get: The ability to diagnose repetition problems in real LLM output, choose the right penalty strength and lookback window, and understand the difference between multiplicative, presence, and frequency penalties.
Before You Start: A New Kind of Problem¶
The first four experiments all dealt with a single-step sampling question: given a probability distribution, how do we pick one token? Temperature, top-k, and top-p are all about shaping that distribution at a single moment.
Repetition penalty is different. It's the first parameter in this series that cares about history โ what tokens the model has already generated. It answers a different question:
Not "how do I pick a token?" but "how do I avoid picking tokens I've already picked too many times?"
This makes it fundamentally a sequence-level control, not a token-level one. You need to understand this shift in framing before the mechanics will make sense.
The Core Intuition: Pressure, Not Prohibition¶
Before touching any math, understand the key word in the experiment description:
"Repetition penalties discourage, not forbid."
This is the most important distinction. Top-k and top-p can create hard zeros โ tokens that literally cannot be sampled. Repetition penalty never creates a hard zero (unless the penalty is infinite, which no system uses). It only moves the probability of seen tokens downward โ and the other unseen tokens absorb the redistributed probability.
Core distinction
Repetition penalty is probability pressure, not eligibility filtering.
Think of it like a tax: - No tax (penalty=1.0): every token pays the same rate, seen tokens have no disadvantage - Light tax (penalty=1.1): seen tokens pay a 10% surcharge, slight disadvantage - Heavy tax (penalty=2.0): seen tokens pay a 100% surcharge, significant disadvantage - Prohibition (penalty=โ): seen tokens are banned entirely โ but no real system does this
The model can still generate "approve" after it has already generated "approve." It just becomes less likely to do so. How much less likely depends on the penalty value and the original probability of the token.
Step 1: Understand the Multiplicative Penalty Formula¶
The formula from the experiment:
if token in recent_tokens:
if logit > 0:
logit /= penalty
else:
logit *= penalty
This needs careful unpacking. Let's go through each part.
Part 1: "if token in recent_tokens"¶
The penalty only applies to tokens that have appeared in a lookback window โ typically the last N tokens of the generated sequence. Tokens outside the window are treated normally.
If lookback = 8, only the last 8 generated tokens are tracked. If "approve" appeared at position 3 but you're now generating position 12, "approve" is no longer penalized (it left the lookback window).
This is a design choice: tokens you used very recently are penalized; tokens you used long ago are not. The window size controls "how long is the model's anti-repetition memory."
Part 2: "if logit > 0: logit /= penalty"¶
For positive logits โ tokens the model already likes โ dividing by the penalty makes the logit smaller, which softmax then translates to a lower probability.
Example: logit = 2.2, penalty = 1.3
Penalized logit = 2.2 / 1.3 = 1.692
The logit dropped from 2.2 to 1.692. Not to zero โ just lower.
Part 3: "else: logit *= penalty"¶
For negative logits โ tokens the model already disfavors โ multiplying by the penalty makes the logit more negative, pushing probability even lower.
Example: logit = -0.6, penalty = 1.3
Penalized logit = -0.6 * 1.3 = -0.78
The logit dropped from -0.6 to -0.78. More negative = even less likely after softmax.
Why This Asymmetry?¶
Without the positive/negative split, you'd apply the same operation to all logits. But dividing a negative logit by a penalty > 1 would make it less negative (closer to zero) โ which would increase its probability. That's the opposite of what you want.
Example of what would go wrong without the split:
logit = -0.6, penalty = 1.3
Wrong operation: -0.6 / 1.3 = -0.46 โ less negative, HIGHER probability!
Correct operation: -0.6 * 1.3 = -0.78 โ more negative, lower probability
The asymmetry ensures that the penalty always discourages seen tokens, regardless of the sign of their logit.
Implementation detail
The positive/negative branch is essential; removing it can invert behavior for negative logits.
Step 2: Trace the Penalty for One Token at Three Penalty Strengths¶
Let's track what happens to "approve" (logit=2.2) when it has already appeared in the lookback window.
Setup¶
Original distribution at T=0.8 (as in the experiment settings):
Scaled logits (รท0.8): approve: 2.75, reject: 2.25, review: 1.75, ...
Probabilities via softmax:
approve: ~0.44 (44%)
reject: ~0.26 (26%)
review: ~0.15 (15%)
escalate: ~0.07 (7%)
delay: ~0.03 (3%)
...
Penalty = 1.0 (No penalty)¶
No change. Approve stays at logit 2.75 (post-temperature). Probability ~44%.
Penalty = 1.3 (Moderate)¶
Approve's logit after temperature = 2.75 (positive, so divide):
2.75 / 1.3 = 2.115
New distribution after penalty applied to approve: - approve: logit 2.115 โ lower probability - all other tokens: unchanged logits
After softmax with approve's logit reduced:
approve probability drops from ~44% to ~30%
All other tokens' probabilities increase proportionally
The "missing" probability from approve (about 14 percentage points) gets redistributed across all other tokens. The tokens near approve in rank (reject, review) benefit most in absolute terms, but every surviving token benefits somewhat.
Penalty = 2.0 (Extreme)¶
Approve's logit after temperature = 2.75:
2.75 / 2.0 = 1.375
Approve's logit has nearly halved. After softmax:
approve probability drops from ~44% to roughly ~15-18%
reject, review, and others all increase significantly
At penalty=2.0, approve is still the most likely token, but it's no longer dominant. The model now genuinely considers reject, review, escalate as real alternatives.
Interpretation
Heavy penalty flattens local preferences without forbidding recurrence.
Step 3: Understand the Lookback Window¶
The lookback window is the set of recent tokens that are subject to the penalty. This is a parameter that often gets ignored, but it's as important as the penalty strength.
Practical knob
Tune lookback and penalty together; strength without window context is incomplete.
Visualizing the Window¶
Suppose lookback = 8 and the model has generated this sequence:
Position: 1 2 3 4 5 6 7 8 9
Token: approve reject approve review approve review approve reject [next]
At position 9, the lookback window covers positions 1โ8. Tokens in the window:
In window: approve (appears at pos 1, 3, 5, 7 โ 4 times), reject (pos 2, 8 โ 2 times), review (pos 4, 6 โ 2 times)
Not in window: escalate, delay, audit, optimize, notify, assign, close
The penalty applies to approve, reject, and review. The other 7 tokens are unaffected โ they have their normal probabilities.
What Happens When Tokens Leave the Window¶
As generation continues, older tokens drop out of the window. At position 10: - Window covers positions 2โ9 - Position 1 (approve) is now outside the window - Approve is still penalized because it also appears at positions 3, 5, 7, and 9
But at position 17 (if approve hasn't appeared since position 9): - Window covers positions 9โ16 - Approve's last appearance was at position 9, still barely in the window
At position 18: - Window covers positions 10โ17 - Approve's last appearance (position 9) is now outside the window - Approve is no longer penalized โ it gets its full original probability back
The Lookback Window Controls "Forgetting"¶
A short window (e.g., 8): the model forgets recent tokens quickly. Good for preventing immediate adjacent repetitions ("the the the") but won't help with phrases that repeat every 10 tokens.
A long window (e.g., 128): the model remembers a much longer history. Better for preventing paragraph-level repetitions but at the risk of suppressing legitimate reuse of important words (technical terms, names that genuinely need to appear repeatedly).
Step 4: The Five Penalty Strengths โ What Each Produces¶
Penalty = 1.0 (No penalty)¶
Distribution is entirely shaped by temperature (T=0.8 in this experiment). No anti-repetition pressure.
Behavior you'll see: The model's natural tendencies dominate. If "approve" has logit 2.2, it will appear frequently. In a short sequence of 8 tokens, you'll likely see "approve" 2โ3 times.
Entropy โ 1.2โ1.5 (just T=0.8 effect).
When "no penalty" is actually fine: - Summarization: technical terms legitimately recur ("the protein binds... the binding process... the protein's binding affinity") - List generation: list structure words ("first", "second", "third") appear repeatedly by design - Code generation: syntax keywords ("return", "for", "if") are supposed to recur
Penalty = 1.1 (Light, 10% surcharge)¶
Mild discouragement. If "approve" appeared once and had 44% probability, it now might have ~40%.
Behavior: Nearly identical to no penalty for most generations. You'd need to run hundreds of samples to statistically detect the difference. Helps slightly with the most extreme cases where the dominant token has 80%+ probability.
Entropy: Barely higher than no penalty.
When to use: As a conservative default "just in case" measure. Won't noticeably change output but provides a small safety margin against the most pathological loops.
Penalty = 1.3 (Moderate)¶
Meaningful discouragement. "Approve" with logit 2.2 (positive) gets divided: 2.2/1.3 = 1.69. At T=0.8: 2.75 โ 2.115.
Behavior: Noticeably more variety. You'll see approve less, but it still wins often. The second and third tokens (reject, review) get picked more than they would without penalty.
Entropy: Meaningfully higher. In the experiment output, expect entropy to jump noticeably compared to no penalty.
When to use: Standard setting for chatbots and dialogue systems. Reduces repetitive phrasing without degrading coherence.
Penalty = 1.5 (Heavy)¶
Significant discouragement. "Approve" at 2.75 (post T) โ 2.75/1.5 = 1.833.
Behavior: Approve still wins sometimes, but the race between approve, reject, and review is now genuinely competitive. You'll see much more variety in short sequences. The "winner" of each generation step changes more frequently.
Entropy: High. The experiment's example output shows approve at ~15.6% โ it went from being dominant (44%) to being just one of many roughly-equal options.
When to use: Story generation, creative writing where you want linguistic variety. Also useful if you've observed looping in production and need to break it.
Penalty = 2.0 (Extreme)¶
Very strong discouragement. "Approve" at 2.75 โ 2.75/2.0 = 1.375.
Behavior: Approve's probability has roughly halved. The model now actively seeks alternatives. You'll see a much wider range of tokens in the output. But here's the risk: even necessary repetitions get penalized. If the correct output truly requires saying "approve" multiple times (e.g., "The system should approve all, and if approved, report as approved"), the model will try to rephrase awkwardly.
Entropy: Very high, often higher than the "unconstrained" entropy you'd see from temperature alone. The penalty can temporarily make the distribution more uniform than even a high temperature would produce.
When to use: Rarely in production. Useful for stress testing, or when you've verified that your specific task has zero tolerance for any repetition (e.g., generating a list of strictly unique items, where each item must be different from all previous ones).
High-penalty risk
Strong penalties can suppress necessary function words and damage fluency.
Step 5: Frequency vs Presence Penalty โ The Two API Flavors¶
Many APIs (including OpenAI's) expose two different penalty types. They sound similar but have meaningfully different behavior.
Presence Penalty¶
Rule: If a token has appeared AT ALL in the lookback window,
apply a fixed logit deduction.
Effect: binary โ either penalized (appeared) or not (hasn't appeared)
Formula (additive version):
if token in seen_tokens:
logit -= presence_penalty_value
What it produces: Equal pressure on all seen tokens, regardless of how many times they've appeared. Token appearing once = same penalty as token appearing ten times. The penalty is about existence in the window, not frequency.
When it matters: When you want to discourage any repetition of words you've already used, but you don't care whether a word appeared once or many times. Good for diversity of vocabulary.
Frequency Penalty¶
Rule: The more times a token has appeared, the larger its penalty.
Effect: proportional to count โ tokens that repeat many times get penalized more
Formula (additive version):
if token in seen_tokens:
logit -= frequency_penalty_value * count_of_appearances
What it produces: Escalating pressure. A token appearing once gets a small penalty. The same token appearing five times gets 5x the penalty. The model is progressively discouraged from continuing a repetition pattern once it starts.
When it matters: When you want to allow occasional reuse of common words but strongly prevent runaway repetition loops. Good for breaking "the the the the" patterns because the penalty compounds with each additional repetition.
Side-by-Side Comparison¶
Presence vs frequency
Presence answers "seen or unseen"; frequency answers "how often seen".
Imagine the lookback window contains: [approve, reject, approve, approve, review, approve]
"Approve" has appeared 4 times. "Reject" has appeared 1 time. "Review" has appeared 1 time.
With presence penalty = 0.5 (additive):
approve: logit -= 0.5 (appeared, flat deduction)
reject: logit -= 0.5 (appeared, same deduction despite fewer occurrences)
review: logit -= 0.5 (appeared, same deduction)
Others: no change
With frequency penalty = 0.5 (additive):
approve: logit -= 0.5 * 4 = 2.0 (4 appearances, heavy penalty)
reject: logit -= 0.5 * 1 = 0.5 (1 appearance, light penalty)
review: logit -= 0.5 * 1 = 0.5 (1 appearance, light penalty)
Others: no change
Which to Use When¶
| Situation | Use Presence Penalty | Use Frequency Penalty |
|---|---|---|
| Generate diverse vocabulary | โ | |
| Break runaway loops | โ | |
| Even treatment of all seen tokens | โ | |
| Escalating pressure on repeat offenders | โ | |
| Output is a list (each item must be unique) | โ | |
| Dialogue (some words legitimately repeat) | โ |
Step 6: Read the Graphs โ Panel by Panel¶
The graph exp5_repetition.png has 9 panels across 3 rows. Here is how to read each one.
Graph reading strategy
Start with row 2 trends (penalty-response curves), then validate with row 1 before/after bars.
Row 1, Panels 1โ3: "Before and After Penalty" bar charts¶
Three bar charts showing the distribution before and after applying the penalty to "approve" and another token, for three penalty strengths (1.0, 1.3, 2.0).
What you're looking at: - X-axis: token index (0โ14) - Y-axis: probability - Dark/grey bars: original distribution (before penalty) - Colored bars: distribution after penalty applied - Red-shaded vertical bands at the penalized token positions - Annotation "penalized" with an arrow pointing to the penalized token's post-penalty bar
How to read them:
Panel 1 (penalty=1.0): The two sets of bars are identical. No change. This is your visual baseline.
Panel 2 (penalty=1.3): The penalized tokens' bars (under the red shading) are shorter in the colored vs grey version. All other tokens' bars are slightly taller in the colored version. The redistribution is visible if you look carefully โ the gain is spread across many tokens, so each gains a little.
Panel 3 (penalty=2.0): The penalized tokens' bars have dropped substantially. The surrounding tokens have noticeably taller bars in the colored version. The redistribution is clearly visible.
What to look for:
- How much did the penalized token's bar shrink? That's the direct effect.
- Did the token maintain its rank? (It should still be the tallest bar overall at penalty=1.3 and probably at 2.0 too โ just shorter.)
- Which tokens benefited most from the redistribution? They're the ones that grew the most.
Row 2, Left + Middle: "Penalized Token Probability vs Penalty Strength"¶
A line chart showing how the probability of each penalized token changes as penalty increases from 1.0 to 3.0.
What you're looking at: - X-axis: penalty value (1.0 to 3.0) - Y-axis: probability of the penalized token after applying that penalty - Pink line: approve (high-ranked token, positive logit) - Yellow line: audit or another token (lower-ranked token) - Horizontal dashed lines: the original probabilities (before any penalty) - Vertical dotted line at penalty=1.0: the baseline (no penalty)
How to read it:
Both lines start at the left (penalty=1.0) at their original probability values. As you move right (higher penalty), both lines drop.
The pink line (approve, high probability) drops faster in absolute terms โ it has more probability to lose because it starts higher. But it takes a large penalty to make it fall below the second-ranked token.
The yellow line (lower-probability token) drops more slowly in absolute terms but may drop faster as a fraction of its original value โ because its logit is smaller, dividing by the penalty has a proportionally larger effect on its already-compressed logit.
The "crossover point": If you extend the lines far enough, you might see them cross โ meaning the originally-higher token gets penalized below the originally-lower token. This crossover indicates the penalty has become strong enough to invert the natural ranking. This is usually the "too strong" zone.
The flattening zone: Both lines eventually flatten and stop dropping much even as penalty increases further. This is because at very high penalties, the penalized tokens' logits become so negative that they're already near-zero probability โ further penalty has diminishing returns.
Row 2, Right: "Logit Reduction for Penalized Token"¶
A line chart showing the logit change (raw logit shift, not probability shift) for the penalized token.
What you're looking at: - X-axis: penalty value - Y-axis: change in logit value (penalty=1.0 is the reference; values are negative because the logit is being reduced) - Purple line with shaded area below - Horizontal dashed line at zero: the no-penalty baseline
How to read it:
The line starts at zero (penalty=1.0, no change) and drops steadily as penalty increases. The drop is not linear โ it's faster at low penalty values (going from 1.0 to 1.5 drops the logit more than going from 2.5 to 3.0).
Why the logit change matters more than the probability change:
Logit changes are what the model "feels" internally. A logit drop of 1.0 always means the same thing in terms of relative preference, regardless of where the logit started. A probability drop from 40% to 30% means something different if the token started at 41% vs if it started at 80%.
The logit reduction chart shows the raw mechanism. The probability chart (row 2, left) shows the consequence. Both together give you the complete picture.
Row 3, Panels 1โ2: "Presence Penalty" and "Frequency Penalty" comparison charts¶
Two grouped bar charts, one for each penalty type, showing how three tokens are affected at three penalty strengths (0.5, 1.0, 1.5).
What you're looking at: - X-axis: three tokens, labeled with how many times each appeared in the window (e.g., "Token 0 (seen 1x)", "Token 1 (seen 3x)", "Token 2 (seen 5x)") - Y-axis: probability after applying the penalty - Three bars per token: one for each penalty strength (lighter shade = weaker penalty, darker = stronger)
How to read the Presence Penalty panel:
All three tokens get the same penalty deduction regardless of how many times they appeared. The bars for "seen 1x", "seen 3x", and "seen 5x" tokens all drop by the same amount for a given penalty strength.
At penalty=1.5 (darkest bars): all three tokens' bars are at roughly the same height relative to their starting values. Equal treatment.
How to read the Frequency Penalty panel:
The token appearing 5x gets the largest penalty (longest downward shift). The token appearing 1x gets the smallest penalty. The bars show staggered heights โ "seen 5x" is shortest, "seen 1x" is tallest, at any given penalty strength.
At penalty=1.5: the token seen 5x has a dramatically lower bar than the token seen 1x. The penalty has compounded.
The visual comparison:
The key visual difference between the two panels: in the Presence panel, bars within each group are nearly the same height (equal treatment). In the Frequency panel, bars within each group form a clear staircase pattern (escalating treatment). That staircase pattern is the "frequency" in frequency penalty.
Fast visual check
Staircase pattern means frequency penalty is behaving correctly.
Row 3, Right: "Repetition Penalty Guide" table¶
A reference table showing penalty values and their effects.
How to read it:
Each row is a penalty value range. The effect column describes what you'd observe in practice. The last row (>1.8) is highlighted in red to signal that this is the danger zone.
Use this as a quick lookup when calibrating penalty in the field.
Step 7: How to Read the Sample Output Text¶
=== No penalty ===
Entropy: 1.234
approve 0.287
reject 0.201
review 0.143
escalate 0.099
delay 0.067
...
Reading guide:
| What you see | What it means |
|---|---|
Entropy: 1.234 |
Moderate โ a few tokens dominate but others are viable |
approve 0.287 |
28.7% โ highest probability token at T=0.8 |
| Regular decline down the list | Natural distribution shape from temperature |
| No hard zeros | No top-k or top-p filtering in this setting |
=== Heavy penalty (1.5x) ===
Entropy: 2.012
approve 0.156
reject 0.155
review 0.143
escalate 0.134
delay 0.121
audit 0.109
...
Reading guide:
| What you see | What it means |
|---|---|
Entropy: 2.012 |
Much higher โ distribution is now much flatter |
approve 0.156 |
Down from 28.7% โ penalty significantly reduced it |
reject 0.155 |
Nearly tied with approve โ tokens are close to equal |
review 0.143 |
Still declining but slowly โ near-uniform distribution |
| Very small drops between tokens | Penalty flattened the distribution |
Notice the approve-reject gap: at no penalty, approve had 8.6 percentage points more than reject (28.7% vs 20.1%). At heavy penalty, the gap is 0.1 percentage points (15.6% vs 15.5%). The penalty erased nearly all of approve's advantage โ not because approve is bad, but because it appeared in the lookback window.
Step 8: The "Too Strong" Failure Mode โ How Coherence Breaks Down¶
Here is the scenario that demonstrates why penalty > 1.8 is marked as dangerous:
Imagine generating a customer service response. The model needs to say "the customer" multiple times:
"Thank you for contacting us. The customer's order has been flagged. The customer will receive a refund. Please tell the customer..."
At penalty=1.3: "the" and "customer" get mild discouragement after first use. The model might slightly vary phrasing โ "the order has been flagged. You will receive a refund. Please advise the customer..." โ which is natural and actually better writing.
At penalty=1.8: "the" has appeared many times and faces a heavy penalty. The model starts actively avoiding it. Output degrades: "Your order has been flagged. A refund will arrive. Please advise regarding..." โ grammatically awkward, stilted.
At penalty=2.5: Even common function words like "a", "is", "the", "to" get heavily penalized after their first use. The model can no longer form coherent sentences because the most common words in any language are always appearing in the lookback window.
The lesson: Repetition penalty doesn't know which repetitions are intentional and which are looping failures. It applies the same pressure to "approve approve approve" (bad loop) and "the the the" (also bad) as it does to "The president of the company said the policy..." (perfectly natural reuse of "the"). High penalties create unnatural avoidance of necessary repetitions.
Semantic blindness
Penalties act on token occurrence patterns, not linguistic intent.
Step 9: The Lookback Window in Practice¶
Three tasks, three different lookback window choices:
Code Generation (lookback = 32)¶
for i in range(n): # 'for' at position 1
result = process(i) # 'i' at position 4
if result: # 'if' at position 7
for j in range(m): # 'for' again at position 10
If for has a long lookback window, it gets penalized when it appears at position 10. But for is syntactically necessary here โ it's not a loop caused by the model getting confused. Short-to-medium lookback (32 tokens) strikes a balance: prevents the model from writing for for for in a broken loop while still allowing for to appear legitimately a few lines later.
Story Generation (lookback = 128)¶
In a narrative, you want characters' names, key objects, and thematic words to recur across paragraphs. But you don't want the model to repeat entire phrases. A long lookback window (128 tokens) catches paragraph-level repetition while allowing tokens from several paragraphs ago to be used again.
Chatbot (lookback = 64)¶
Conversational responses tend to be 50โ150 tokens long. A 64-token lookback keeps the current response diverse internally while still allowing common words to recur naturally across turns.
Step 10: The "Adjacent Repeats" Diagnostic¶
The experiment suggests this diagnostic code:
def count_adjacent_repeats(generated_ids):
repeats = 0
for i in range(1, len(generated_ids)):
if generated_ids[i] == generated_ids[i - 1]:
repeats += 1
return repeats
This counts how often consecutive tokens are identical โ the "the the the" type of loop.
How to use it to calibrate penalty:
- Run your model with penalty=1.0 and count adjacent repeats over 1000 samples
- Try penalty=1.1 โ did adjacent repeats drop significantly?
- Try penalty=1.3 โ how much further did repeats drop?
- Plot penalty vs adjacent-repeat count
- Find the "elbow" โ where adding more penalty stops reducing repeats. That's your sweet spot.
What "elbow" means here:
Going from 1.0 to 1.1 might cut adjacent repeats in half. Going from 1.1 to 1.3 might cut them in half again. Going from 1.5 to 1.8 might reduce them by only 5%. The 1.5โ1.8 range is diminishing returns โ you're not getting fewer loops but you are getting more incoherence. The elbow at ~1.5 is where you stop.
Calibration rule
Stop increasing penalty once repeat reduction flattens but fluency cost rises.
Step 11: Common Misconceptions โ Cleared Up¶
Common failure source
Loop fixes often overcorrect because teams tune penalty before diagnosing repetition type.
Misconception 1: "Repetition penalty stops the model from repeating tokens"
No. It discourages repetition. The model can still output the same token multiple times. At penalty=2.0, if "approve" genuinely has the highest logit by a wide margin, it can still win the sampling lottery multiple times in a row โ just less often than without the penalty.
Misconception 2: "Higher penalty = better output quality"
Only up to a point. Too high a penalty forces the model to avoid words it needs to use naturally. The relationship between penalty and quality is an inverted U โ quality improves as you add penalty (reducing loops) then falls as penalty becomes excessive (forcing unnatural avoidance). The peak of the U depends on your task.
Misconception 3: "Repetition penalty and temperature do the same thing"
They're completely different mechanisms. Temperature changes all tokens' probabilities globally (reshaping the whole distribution). Repetition penalty changes only the specific tokens that have appeared in the lookback window, leaving all other tokens unchanged. They also operate at different levels: temperature is a single-step operation; repetition penalty is a history-dependent operation.
Misconception 4: "Presence and frequency penalty are just different names for the same thing"
No. Presence penalty is binary โ any seen token gets a flat deduction. Frequency penalty is proportional โ tokens seen more times get larger deductions. For a token seen 5 times, frequency penalty applies 5x the pressure of presence penalty. Their behavior diverges significantly once any token appears more than once.
Misconception 5: "I should always use repetition penalty"
Only if your task has a repetition problem. For tasks like code generation, mathematics, or factual extraction, many tokens legitimately need to appear multiple times. Adding repetition penalty to these tasks can actively hurt output quality by discouraging necessary repetitions. Always diagnose first: if you're not seeing loops, don't add penalty.
Step 12: Practical Decision Guide¶
Step 1: Do you observe looping or repetitive output?
NO โ Don't add repetition penalty; it won't help and may hurt
YES โ Continue to Step 2
Step 2: What type of repetition are you seeing?
Adjacent ("the the the"):
โ Start with frequency penalty, strength 1.1โ1.3
โ Short lookback (16โ32 tokens) is enough
Phrase-level (same sentence repeating every 5 tokens):
โ Moderate penalty (1.3โ1.5), lookback 32โ64
Paragraph-level (same idea repeated in different words):
โ Light presence penalty (1.1โ1.2), long lookback (128+)
โ Also consider checking temperature โ too-low temperature causes this type
Step 3: Start conservative, measure, increase only if needed
โ Penalty 1.1 first
โ Measure adjacent repeats and output quality
โ If loops persist, increase to 1.3
โ If quality degrades at any point, back down one step
Step 4: Never exceed 1.5 for production without manual review
โ Test outputs at 1.5+ carefully before deploying
โ Watch for: awkward phrasing, missing function words, stilted tone
Task-Level Settings Reference¶
| Task | Penalty | Lookback | Type | Notes |
|---|---|---|---|---|
| Classification | 1.0 | N/A | N/A | No repetition issue; penalty not needed |
| Factual QA | 1.0โ1.1 | 32 | Presence | Key terms legitimately repeat |
| Summarization | 1.0โ1.1 | 64 | Frequency | Important nouns must recur; gentle penalty |
| Code generation | 1.1 | 32 | Frequency | Syntax tokens repeat by design; light touch |
| Chatbot | 1.2 | 64 | Frequency | Conversational variety, allow common words |
| Creative writing | 1.15โ1.2 | 128 | Presence | Encourage vocabulary diversity across paragraphs |
| Story generation (long) | 1.2โ1.3 | 128โ256 | Frequency | Longer memory, moderate loop prevention |
| Brainstorming | 1.3โ1.5 | 64 | Presence | Want distinct ideas, strong diversity pressure |
Quick Reference: Graph Interpretation Cheat Sheet¶
When you open exp5_repetition.png, scan in this order:
1. Look at the before/after bar charts (row 1)
โ Find the red-shaded bands โ those are the penalized tokens
โ Compare colored vs grey bars: how much did penalized tokens shrink?
โ Did they maintain their rank (still tallest), or did other tokens surpass them?
โ Which tokens grew the most from redistribution?
2. Look at the "Penalized Token Probability" line chart (row 2, left+middle)
โ Read the starting probability for each line (at penalty=1.0)
โ How steeply does each line drop?
โ Is there a crossover point where one line falls below another?
โ Where does each line flatten (diminishing returns zone)?
3. Look at the "Logit Reduction" chart (row 2, right)
โ The logit chart shows the internal mechanism; probability chart shows the result
โ Fast early drop + slow later flattening = diminishing returns pattern
โ The shape of this curve tells you how "efficient" each increment of penalty is
4. Look at the Presence vs Frequency comparison (row 3, panels 1โ2)
โ In the Presence panel: are the bars within each token group roughly equal?
โ In the Frequency panel: does the staircase pattern appear (more appearances = shorter bar)?
โ The degree of staircase steepness shows how aggressively frequency penalty escalates
5. Look at the guide table (row 3, right)
โ Check which penalty range your current setting falls into
โ Is the red warning row visible? (>1.8 zone)
Summary¶
| Concept | One-sentence explanation |
|---|---|
| Repetition penalty | Modifies logits of recently-seen tokens to make them less likely (not impossible) to be sampled again |
| Penalty = 1.0 | No change โ the baseline, as if the parameter doesn't exist |
| Positive logit rule | Positive logits get divided by the penalty, making them smaller |
| Negative logit rule | Negative logits get multiplied by the penalty, making them more negative |
| Why asymmetry | Without it, dividing a negative logit would make it less negative, accidentally increasing its probability |
| Lookback window | Only tokens within the last N positions are penalized; older tokens return to normal |
| Presence penalty | Binary โ any seen token gets the same flat deduction |
| Frequency penalty | Proportional โ tokens seen more times get larger deductions |
| Too strong | Penalties above 1.5โ1.8 start penalizing necessary words, causing incoherence |
| Discourages not forbids | Even heavily penalized tokens can still be sampled; it's pressure, not prohibition |
Final takeaway: Repetition penalty is a diagnostic tool, not a default setting. Run without it first. If you observe loops, add it conservatively starting at 1.1 and measure both loop frequency and output coherence. The goal is the minimum penalty that breaks the loop โ not the maximum penalty your system will accept.
Bottom line
Use the smallest effective penalty and verify both repeat rate and readability.
Part of the 6-Experiment LLM Parameter Learning Path
Previous: Experiment 4 โ Combined Filters
Next: Experiment 6 โ Real Use Cases