Model Produces Chaotic / Repetitive Output When `top_k` Is Higher — How to Fix This

I’ve been fine-tuning a domain-specific LLM, and I’ve encountered an issue that’s quite confusing. I’d like to ask if anyone has run into something similar.

:backhand_index_pointing_right: Issue Description

During inference, I set top_k = 20 because I want the model to have some creativity and flexibility when generating domain-specific answers. However, for certain technical questions, the second half of the response suddenly becomes repetitive or chaotic.

Typical failure patterns include:

  • Repetitive symbols like “???..”
  • Meaningless sequences like “| | | | | …”
  • Sudden loss of structure, interruptions, or self-contradiction
  • Obviously unstable decoding

The first part of the output is usually normal, and the problem tends to appear in the later portion of long responses.


:red_question_mark: I would like to ask a few questions:

1. Is top_k = 20 reasonable? Is it too high or too low?

I know that top_k controls the candidate token pool for sampling, but different models recommend different ranges. I’m not sure whether 20 is considered normal for a domain-specific model.


2. Why does a fine-tuned domain model become unstable at higher top_k values?

Especially issues like:

  • Repeated symbols (???? or ||||)
  • Garbled / corrupted characters
  • Long responses collapsing midway

3. Is there a way to fix this without lowering top_k?

I want the model to retain a certain level of creativity, so I prefer not to make top_k too small.

Is there any method to reduce chaotic generation while still keeping diversity?


:pushpin: Summary

My main questions are:

  • Is top_k = 20 a reasonable setting?
  • What causes long-form answers to become chaotic in fine-tuned models under higher top_k?
  • How to stabilize generation quality without reducing top_k?

Any experiences, insights, or suggestions would be greatly appreciated.
Thank you!


2 Likes

My model used to produce those cursed |||| lines like it was trying to build a fence. I fixed it by mixing in a tiny bit of base data during fine tuning so the distribution feels less brittle when sampling gets wider.

2 Likes

I don’t think top_k is the cause in this case. It might be the trigger, though…


I’ll go through your three questions one by one, but with broader context so you can see why this happens and what concrete knobs you can turn.


0. Short answers first

  1. Is top_k = 20 reasonable?
    Yes. In absolute terms top_k = 20 is very normal and actually smaller than common “creative” defaults like top_k = 50 in Hugging Face examples and many model cards. (arXiv)
    The problem is less the number 20 itself and more how it interacts with your model, temperature, top_p, and long outputs.

  2. Why does the second half of long answers become chaotic under higher top_k?
    Because for long generations the model’s probability distribution over next tokens becomes flatter and less reliable. With a relatively large k, the sampler starts pulling from the “unreliable tail” of the distribution, which Holtzman et al. show is exactly where degenerate text (repetition, junk tokens) lives. (arXiv)
    Basu et al. (Mirostat) formalize this as the “confusion trap”: for large k/p, perplexity grows with length and coherence collapses. (Hugging Face)

  3. How to keep diversity without just lowering top_k?
    Treat top_k as a cap and move the real “creativity control” to:

    • Temperature and top_p / typical decoding / Mirostat
    • Repetition controls: no_repeat_ngram_size, repetition_penalty, custom logit penalties for junk tokens (ACL Anthology)
    • Better EOS + stopping logic and, if needed, some light data / fine-tune cleanup

    You can keep top_k = 20 (or even higher) and still stabilize generations if you shape the distribution and force the model out of repetition loops.

Now the long version.


1. Background: what top_k really does, and whether “20” is normal

1.1 Quick mental model of top_k

At each generation step:

  1. Sort all tokens by probability.
  2. Keep only the top k tokens; set all others to probability 0.
  3. Renormalize and sample from those k.

So:

  • Small k

    • Only very likely tokens are considered.
    • Output tends to be safe, conservative, and sometimes repetitive (few alternatives).
  • Larger k

    • You allow more candidates, including lower-probability tokens.
    • More stylistic variety and creativity, but you start touching the noisy tail of the distribution if you go too far. Holtzman et al. call this tail “unreliable” and show it is where repetition and incoherence live. (arXiv)

top_k is not inherently “high” or “low”; it depends on:

  • how peaked the distribution is (your model + your context), and
  • what you do with temperature and top_p.

1.2 Context: what do people actually use?

If you scan mainstream tutorials and model cards:

  • Hugging Face “How to generate text” shows examples like:

    sample_outputs = model.generate(
        **model_inputs,
        max_new_tokens=40,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=3,
    )
    

    (arXiv)

  • The official generation strategies docs and LLM tutorial show top_k=50 in their standard sampling configuration. (Hugging Face)

  • Many open models (e.g. BLOOM, Gemma, Japanese GPT-2 variants, Phi-4, Plamo) recommend or default to top_k=50, top_p≈0.9–0.95, temperature≈0.7–1.0. (GitHub)

So, relative to that ecosystem:

  • top_k=20 is perfectly reasonable and actually conservative.

  • A base model that falls apart at top_k=20 is usually telling you something about:

    • its tail calibration,
    • fine-tune side effects, or
    • your other decoding settings, not that “20 is too big” in isolation.

2. Why long, fine-tuned answers blow up in the second half

Let’s connect your symptoms to what’s known in the literature.

2.1 The “unreliable tail” and neural text degeneration

Holtzman et al., The Curious Case of Neural Text Degeneration, explicitly study why LMs with naive sampling produce: (arXiv)

  • bland, repetitive text under greedy/beam search, and
  • incoherent, repetitive, or junk tokens under unconstrained sampling.

Key points:

  • The LM’s probability mass at a step is concentrated in a small “nucleus” of reasonable tokens; the rest is a long tail of low-probability tokens.

  • Sampling too far into that tail (e.g. using too large k or very high temperature) is exactly what causes:

    • repetitions,
    • nonsense tokens, and
    • degraded coherence as sequences get longer.

They propose nucleus (top-p) sampling as a fix: instead of top-k, choose the smallest set of tokens whose probabilities sum to a given p (for example 0.9), then sample from that set. This caps how much of the tail you ever touch.

In your case:

  • Early in the answer, the distribution is sharp and the top 20 tokens are all reasonable.
  • Later, as the context gets longer and the model becomes less certain, the distribution over tokens flattens. Now the top 20 contain more garbage options and odd punctuation, so top_k=20 effectively digs into the “unreliable tail”.

The result looks exactly like their degeneration examples: long answers whose second half becomes nonsense or repetitive.

2.2 Boredom trap vs confusion trap (Mirostat perspective)

Basu et al., in the Mirostat paper, analyze how perplexity evolves over time under different sampling regimes: (Hugging Face)

  • For small k/p:

    • Perplexity drops as you generate more tokens.
    • Text gets extremely predictable and repetitive → boredom trap.
  • For large k/p:

    • Perplexity increases as you generate more tokens.
    • Text becomes chaotic and incoherent over time → confusion trap.
  • For moderate settings there’s a sweet spot where perplexity stays in a human-like range.

Your observation:

The first half is usually normal, but the second half of long answers becomes repetitive or chaotic.

Maps almost perfectly to the confusion trap:

  1. Long sequence → distribution flattens.
  2. top_k + temperature/top_p combine to let perplexity drift upward.
  3. The model starts sampling tokens from increasingly noisy regions → gibberish symbols, structure collapse.

Mirostat was designed specifically to avoid this by adjusting an effective k step-by-step to keep perplexity near a target. (Hugging Face)

2.3 The Repeat Curse: why “???” and |||| loops appear

The Repeat Curse paper (Yao et al., 2025) looks at repetition via mechanistic interpretability: (Sascha Metzger)

  • They identify specific “repetition features” in the network that, when activated, push the model to keep outputting similar tokens.
  • These features get triggered more easily under prompts or contexts that already contain repetition.

Translated into your symptoms:

  • Once the model starts emitting ? or | repeatedly, you’ve probably activated these internal repetition features.
  • From that point, the model is dynamically biased toward continuing that pattern on subsequent steps.
  • Without explicit decoding constraints, it will happily continue ???????????? until max_new_tokens is reached.

So the repeated punctuation is not just random noise; it is often a stable attractor in the network’s dynamics, especially under high-entropy sampling.

2.4 Fine-tuning amplifies tail and repetition issues

Fine-tuning a base LM on a domain dataset does great things near the modes of the domain distribution, but it often makes the tail worse:

  • The model is now sharper and more confident on domain-typical continuations, but
  • More fragile when you push it into under-represented contexts or very long generations.

Recent work like “Repetition in Repetition Out” (Rep-Dropout) explicitly shows that more repetition in the fine-tuning data → more degeneration and repetition at generation time. (Hugging Face)

Other fine-tuning surveys and guides also highlight that small or skewed fine-tuning sets can cause overfitting and catastrophic forgetting, which manifest as brittle behavior outside the narrow target regime. (ACL Anthology)

Put together:

  • If your domain data includes many templates, repeated symbols, tables, or weird formatting, fine-tuning may have:

    • increased the probability of those “symbolic” subsequences,
    • and made the model more likely to fall into those patterns once the distribution gets noisy.

This explains why your domain-fine-tuned model degrades under larger top_k while the base model might still be stable under similar settings.

2.5 Long outputs + imperfect stopping

A practical but important factor:

  • If your generation is ending by hitting max_new_tokens instead of EOS, the model is being forced to keep sampling even when it “wants” to stop.
  • The longer you force it to continue beyond its natural stopping point, the more you push it into regions of high uncertainty and, therefore, into the degeneration regimes above.

The HF generation docs emphasize setting eos_token_id and possibly explicit stop criteria in GenerationConfig. (OpenReview)

If your fine-tuning data doesn’t consistently teach an EOS pattern (or your inference config ignores it), then:

  • The model has no strong “stop here” attractor,
  • So high-entropy sampling will eventually wander into ???? or |||| or similar junk patterns.

3. How to stabilize generation without simply lowering top_k

You want to keep some diversity and creativity. The good news is: you can, but you should shift which knobs you rely on.

3.1 Treat top_k as a cap, not the main creativity knob

Instead of:

“I want more creativity → increase top_k.”

Think:

“I’ll keep top_k as a reasonable cap (20–50), and use temperature + top_p / typical / Mirostat to shape creativity.”

Concrete ideas:

  1. Use top-p (nucleus) sampling with moderate p

    • For example: top_p ≈ 0.9–0.95, temperature ≈ 0.5–0.8, top_k = 20–50.
    • This is exactly what Holtzman et al. propose (nucleus sampling), and what many HF examples use. (arXiv)
  2. Try typical decoding

    • Typical decoding selects tokens whose information content is near the conditional entropy, avoiding both over-predictable and extremely unlikely tokens. (Facebook)
    • In practice, typical_p ≈ 0.9–0.95, moderate temperature, and top_k as a fallback cap often yields more stable long outputs than pure top-k.
  3. If your stack supports it, try Mirostat

    • Mirostat keeps perplexity approximately constant over the sequence, avoiding both boredom and confusion traps. (Hugging Face)
    • Implementations exist in llama.cpp, text-generation-webui, and other local-LLM tools, with parameters like mirostat_mode, mirostat_tau, mirostat_eta. (fast.ai Course Forums)

    Rough starting point (if supported):

    • mirostat_mode = 2, mirostat_tau ≈ 5–7, mirostat_eta ≈ 0.1

You can keep top_k = 20 in all of these; the key is that you’re no longer relying on k alone to implement “creativity”.

3.2 Add decoding-time repetition controls

Since you are literally seeing ???? and ||||, direct repetition control is appropriate.

Hugging Face’s generate supports: (ACL Anthology)

  1. no_repeat_ngram_size

    • Ensures that any n-gram of length n appears at most once in the generated sequence.

    • Example:

      outputs = model.generate(
          **inputs,
          do_sample=True,
          top_k=20,
          top_p=0.9,
          temperature=0.7,
          no_repeat_ngram_size=4,
      )
      
    • This is very effective for preventing long copy-paste loops.

  2. repetition_penalty

    • Multiplies the logits of previously seen tokens to discourage reusing them.
    • Values slightly above 1 (e.g. 1.05–1.15) often help without harming domain terms. (ACM Digital Library)
  3. Custom logit processors for junk tokens

    • You can implement a LogitsProcessor that detects characters like '?', '|', '_' being overused and down-weights them after some threshold.
    • HF’s generation_utils provides the base hooks and examples (e.g. n-gram processors) you can adapt. (ACL Anthology)

These do not force you to lower top_k. They reshape the effective distribution to avoid repetition attractors.

3.3 Strengthen EOS and stopping logic

To avoid “junk at the tail because we hit max length”:

  • Ensure your fine-tuning data consistently includes a clear EOS token or pattern.

  • In inference:

    • Set eos_token_id properly in GenerationConfig. (OpenReview)

    • Prefer to stop on EOS or known stop sequences (e.g. </s>, </assistant>) rather than only on max_new_tokens.

    • For extremely long answers, consider:

      • generating in chunks (e.g. paragraph by paragraph), or
      • generating a plan first, then generating each section separately.

This reduces the time the model spends in the “very long, very uncertain” regime where top_k interacts badly with flat distributions.

3.4 Prompt / pipeline structure: get creativity early, then lock in

A powerful pattern that keeps diversity without unstable tails:

  1. Step 1: high-entropy planning

    • Ask the model to generate a short plan or outline of the answer with slightly higher temperature/top_p.
    • Keep max_new_tokens small here (e.g. 64–128).
    • Example knobs: temperature=0.8, top_p=0.95, top_k=20.
  2. Step 2: low-entropy execution

    • Feed plan + original question back in and request a detailed answer.
    • Use more conservative decoding: temperature≈0.4–0.6, moderate top_p, same top_k=20, repetition controls.

This way, “creativity” lives in the ideas (the plan), not in pushing the token sampler into the unstable tail for thousands of tokens.

3.5 Data / fine-tune side improvements (without touching top_k)

If you can modify training, there are several ways to reduce chaos:

  1. Clean repetitive and symbolic junk from the fine-tune corpus

    • Remove or down-weight samples with:

      • long runs of punctuation or separators (||||||||, ------, etc.),
      • broken markup, tables, or ASCII art that isn’t essential.
    • Rep-Dropout shows that reducing repetition in training data reduces degeneration at inference, even with the same decoding. (Hugging Face)

  2. Mix in some base-style data or regularization

    • Regularized fine-tuning work shows that mixing in a bit of base data or using regularization can keep repetition rates closer to the base model’s behaviour. (ACL Anthology)
  3. DITTO-style repetition-aware training

    • The Apple paper “Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation” (DITTO) constructs synthetic repetitive examples and explicitly trains the model to penalize them. (arXiv)
    • That’s heavier weight than decoding tweaks, but if repetition/junk is mission-critical, a follow-up finetune in that direction can help without touching decoding.

These approaches keep your top_k budget intact but reduce the probability that ???/|||| patterns are attractive options in the first place.


4. Concrete “starting presets” you can try

Assuming a HF generate-style API, here are a few realistic presets that keep top_k=20 but adjust other knobs.

4.1 Conservative but still varied (for long technical answers)

generation_config = dict(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.5,
    max_new_tokens=512,            # or your typical answer length
    no_repeat_ngram_size=4,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id,
)
  • top_k=20 kept as-is.
  • top_p=0.9 and temperature=0.5 keep you away from the extreme tail while still allowing variation. (arXiv)
  • no_repeat_ngram_size and repetition_penalty directly fight loops.

4.2 Typical decoding variant

generation_config = dict(
    do_sample=True,
    top_k=20,
    typical_p=0.95,
    temperature=0.6,
    max_new_tokens=512,
    no_repeat_ngram_size=4,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id,
)
  • Replace top_p with typical decoding, which tends to avoid both dullness and extreme outliers. (Facebook)

4.3 If you have Mirostat (llama.cpp, web UIs, etc.)

Configure:

  • mirostat_mode = 2
  • mirostat_tau = 5.0–7.0
  • mirostat_eta = 0.1
  • plus top_k = 20 as a hard cap.

Mirostat will dynamically adjust effective sampling to keep perplexity stable and avoid the confusion/boredom traps for long outputs. (Hugging Face)


5. Useful links for deeper reading

Theory & algorithms:

Practical decoding & HF docs:

Fine-tuning & degeneration:


Final bullet summary

  • top_k = 20 is not inherently too high; many recipes use top_k = 50. The instability you see is due to the interaction of top-k with long sequences, temperature/top_p, and a fine-tuned distribution that is brittle in the tail. (zotero.org)

  • Long answers push the model into regions where the next-token distribution is flat; with moderate k, you start sampling from the unreliable tail, causing degeneration (Holtzman), entering the confusion trap (Mirostat), and triggering repetition features (Repeat Curse) → ???? / |||| loops. (arXiv)

  • Fine-tuning on domain data can amplify these problems, especially if the data has high repetition or weird symbols, and if EOS behavior isn’t clean; this is consistent with Rep-Dropout and fine-tuning-pitfall literature. (Hugging Face)

  • You can keep top_k the same and stabilize output by:

    • using top_p / typical / Mirostat with moderate temperature,
    • adding no_repeat_ngram_size, repetition_penalty, and custom penalties for punctuation spam,
    • enforcing good EOS/stop behavior, and
    • optionally cleaning or regularizing your fine-tune data. (OpenReview)
1 Like

You mentioned that “top_k = 50 is the default in many model cards and HF examples.”
Could you share a few specific models or official documentation links?
So far, I’ve only found that the GPT-2 series and the official Transformers examples use top_k = 50 by default, and I’d like to confirm whether there are other models that explicitly recommend 50 as well.

I also want to ask:
Is there any literature that provides guidelines or recommended practices for choosing the value of top_k?

1 Like

Thanks for sharing this! I’m curious about the details — when you said you “mixed in a tiny bit of base data during fine-tuning,” what exactly do you mean by base data?
Was it the original pre-training corpus, the model’s instruct data, or something else?
And how did you integrate it into the fine-tuning process (e.g., what ratio, how many samples, etc.)?

1 Like

Since Transformers and other backends often have a default value of top_k = 50, if you don’t specify the argument and do_sample = True (default), top_k = 50 will likely be used.

https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig.top_k

  • top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering. This value is set in a model’s generation_config.json file. If it isn’t set, the default value is 50.

Edit:
Llama.cpp’s default seems to be top_k = 40.

When setting top_k, the behavior isn’t determined solely by top_k itself. The model’s inherent behavior and interactions with other parameters also affect its output, and the ideal value varies depending on the purpose. So, trial and error is the only way to go…


1. What does top_k actually do?

Very simply:

  • At each step, the model picks the next token from the top k most likely tokens.
  • Smaller k → fewer choices → more stable, less creative.
  • Larger k → more choices → more creative, but also more risk of junk.

Think of top_k as:

“How many candidate tokens am I willing to even consider at each step?”


2. Reasonable ranges (rule of thumb)

You almost never need anything outside this spectrum:

  • k = 1

    • Greedy (no randomness).
    • Maximum stability, zero creativity.
  • k ≈ 5–20

    • Focused, relatively safe, still a bit of variation.
    • Good for technical/domain QA, code, SQL, config, formal stuff.
  • k ≈ 20–40

    • Standard “assistant/chatbot” range.
    • Balanced between coherence and variety.
  • k ≈ 40–80

    • More “creative writing / brainstorming” territory.
    • Higher chance of off-topic or weird outputs.

3. How to choose top_k in practice (simple recipe)

  1. Pick by task type:

    • Technical / factual / precise: start with top_k = 10–20.
    • General chat / explanation: start with top_k = 20–40.
    • Creative writing: start with top_k = 40–60.
  2. Set temperature and top_p first:

    • Use temperature to control “how random” it feels (0.3–0.7 for serious work).
    • Use top_p ≈ 0.9–0.95 (nucleus sampling) or typical decoding if you have it.
    • Treat top_k as a cap, not your main creativity knob.
  3. Add repetition control:

    • no_repeat_ngram_size = 3–5.
    • Mild repetition_penalty ≈ 1.05–1.15.
      These matter more for stopping ???? / |||| loops than tiny changes to k.
  4. Test on real prompts:

    • If outputs are too boring but correct → slightly raise temperature (not necessarily k).
    • If outputs get chaotic or repetitive → lower temperature or top_p before touching k.
  5. Only tweak top_k if you must:

    • If still too deterministic → bump top_k a bit (e.g., 10 → 20, 20 → 30).

    • If still too unstable → you can try 20 → 10, but usually you should fix:

      • temperature / top_p,
      • repetition penalties,
      • EOS / max length,
      • training issues (fine-tune data).

Ultra-short version

  • Reasonable top_k is usually 5–50.
  • Pick smaller k for precise/technical tasks, larger k for creative tasks.
  • Your k=20 is fine; focus more on temperature, top_p, repetition control, and stopping, not on changing k itself.