The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Abstract
Arbitrary order generation in diffusion large language models limits reasoning capability by causing premature solution space collapse, making standard policy optimization more effective.
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap
Community
Links
š paper: https://arxiv.org/abs/2601.15165
š project page: https://nzl-thu.github.io/the-flexibility-trap
š» code: https://github.com/LeapLabTHU/JustGRPO
š¤ model: https://huggingface.co/nzl-thu/LLaDA-Instruct-JustGRPO
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective (2025)
- dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning (2025)
- Learning Unmasking Policies for Diffusion Language Models (2025)
- d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models (2025)
- CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models (2026)
- From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models (2025)
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Nice find!
It appears that the figures for gsm8k and math500 in Figure 3 might have been swapped. Could you please verify this?
Thanks for pointing this out. This is indeed a labeling mismatch in our figure. We will correct this in our upcoming revision (v2) on arXiv promptly. We really appreciate your attention to detail!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper