The Flexibility Trap

Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni1, Shenzhi Wang1, Yang Yue1, Tianyu Yu2, Weilin Zhao2, Yeguo Hua2, Tianyi Chen2, Jun Song3, Cheng Yu3, Bo Zheng3, Gao Huang1✉
1LeapLab, Tsinghua University 2Tsinghua University 3Alibaba Group
Main comparison

JustGRPO achieves state-of-the-art reasoning performance by deliberately forgoing arbitrary order during training, while retaining efficient parallel decoding at inference.

Abstract

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for complex tasks like mathematics and coding.

We reveal a counter-intuitive reality: arbitrary order generation, in its current form, limits rather than expands reasoning boundaries. We find that dLLMs tend to exploit this flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities are often devoted to preserving this flexibility.

We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying just standard Group Relative Policy Optimization (GRPO). Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs.

The Flexibility Trap

Central to the appeal of dLLMs is their theoretical flexibility: the capability for arbitrary-order generation. Theoretically, this unconstrained generation order constitutes a superset of the fixed autoregressive trajectory. This flexibility naturally suggests a potential for superior reasoning—in complex tasks like mathematics and coding, such freedom could unlock non-sequential problem-solving paths inaccessible to standard left-to-right models.

We challenge this intuition. Our findings reveal that arbitrary-order generation, in its current form, narrows rather than expands the model's reasoning potential.

Finding 1: Arbitrary Order Limits Reasoning Potential

To rigorously assess reasoning potential, we employ Pass@k—a standard proxy for effective solution space coverage and achievable reasoning upper bound. We compare two decoding modes: Arbitrary Order (standard diffusion decoding with low-confidence remasking) and AR Order (strictly left-to-right decoding).

Pass@k comparison

Reasoning potential measured by Pass@k. While arbitrary order is competitive at k=1, it exhibits notably flatter scaling curves compared to AR Order. As k increases, AR mode demonstrates a stronger capacity to uncover correct solutions.

Finding 2: Arbitrary Order Explores a Subset of AR's Solution Space

One might hypothesize that arbitrary order explores a different solution space, albeit less efficiently. We test this by analyzing solution coverage at k=1024. The result reveals a stark reality: the reasoning traces generated by arbitrary order are largely a strict subset of those generated by AR.

Coverage analysis

Solution space coverage measured by Pass@1024. On HumanEval, AR solves 21.3% of problems that arbitrary order misses, whereas the reverse is only 0.6%. The flexible decoding process rarely unlocks genuinely new solutions.

Mechanism: The Entropy Degradation

Why does the theoretically superior solution space of dLLMs collapse in practice? We attribute this phenomenon to how these two modes handle uncertainty.

Confronting vs. Bypassing Uncertainty

Confronting vs Bypassing

AR order constrains the model to strictly resolve the left-most unresolved token at each step, forcing the model to confront uncertainty as it arises. By sampling exactly at the fork, the model commits to a specific rationale, thereby preserving the exploration space.

Arbitrary order adaptively selects tokens to update based on model confidence, preferentially generating "easy" tokens with high certainty while bypassing "hard" ones. By the time the model returns to fill in the bypassed forks, the established bidirectional context has already severely constrained the potential branches.

The Entropy Degradation Phenomenon

Inspecting the frequently bypassed tokens reveals a clear pattern: the diffusion sampler disproportionately defers logical connectives and transition markers such as "Therefore", "Thus", and "Since". Prior work has shown that such tokens often function as "reasoning sparks" or "logical forks" that determine subsequent reasoning directions. Keeping these tokens in a high-entropy state is critical for effective exploration.

Entropy degradation

Left: Frequently bypassed tokens are typically logical connectors. Right: While global average entropy remains comparable (dashed lines), the entropy at logical forks drops significantly in arbitrary order (blue bars). We term this entropy degradation.

Key Insight: The flexibility of arbitrary order serves as a mechanism for inference-time exploitation rather than reasoning exploration. By bypassing high-uncertainty tokens, the model trades the exploration of diverse reasoning paths for greedy optimization of local consistency.

JustGRPO: Return to Simplicity

Our findings suggest that arbitrary order actually limits the reasoning potential accessible to RL. Despite this, current RL methods for dLLMs remain heavily burdened by the need to preserve this flexibility—grappling with combinatorial trajectories, intractable marginal likelihoods, and sampler-learner mismatches.

We propose a return to simplicity: since pure autoregressive order yields better reasoning potential, we explicitly forgo arbitrary-order generation during RL training. This transforms the dLLM from a chaotic sequence denoiser into a well-defined autoregressive policy.

Formulation

To obtain the probability of the next token ot given history o<t, we construct an input state where the past is observed and the future is masked:

t = [o1, ..., ot-1, [MASK], ..., [MASK]]

The autoregressive policy is then defined as:

πARθ(ot | o<t, q) = Softmax(fθ(x̃t))t

This formulation enables the direct application of standard GRPO to diffusion language models—no trajectory approximations or marginal likelihood estimation needed.

Key Remarks

  • The AR constraint is applied only during training to correctly assign credit. It refines the model's joint distribution without altering the underlying architecture.
  • At inference time, the model retains its parallel decoding capabilities, allowing the use of parallel samplers to accelerate generation.
  • JustGRPO achieves the reasoning depth of autoregressive models while preserving the inference speed of dLLMs.

Experiments

System-Level Comparison

We evaluate JustGRPO on LLaDA-Instruct across four standard benchmarks: GSM8K, MATH-500, HumanEval, and MBPP. Simplifying the training objective to a standard autoregressive formulation yields consistent improvements over methods specifically designed for dLLMs.

System-level comparison table

LLaDA-1.5 and LLADOU are shown in gray as they use different training data or modified architectures.

JustGRPO Preserves Parallel Decoding

A natural concern: does training with AR constraints compromise the model's inherent parallel decoding capabilities? We evaluate inference performance under varying degrees of parallelism using the training-free EB-Sampler.

Parallel decoding

JustGRPO preserves parallel decoding. Surprisingly, accuracy gains become more pronounced as parallelism increases. The model learns a more robust reasoning manifold that is resilient to approximation errors inherent in parallel sampling.

Conclusion

The intuitive appeal of diffusion language models (dLLMs) lies in their order arbitrariness, often perceived as a superior mechanism for navigating complex reasoning paths. Our study reveals a counter-intuitive reality: this unrestricted flexibility in fact narrows the reasoning potential. By allowing the model to bypass high-entropy tokens, effectively skipping the most demanding logical branches, arbitrary-order generation acts as an exploitation mechanism that prioritizes greedy optimization of individual trajectories at the expense of broader solution coverage.

Therefore, the elicitation of the reasoning capability of dLLMs can be easier. By operating dLLMs in a standard autoregressive way, we enable the direct application of Group Relative Policy Optimization (GRPO) without any complex adaptations tailored for order arbitrariness. This intentional constraint paradoxically yields a significant upgrade in reasoning performance, while fully preserving the parallel decoding capabilities of dLLMs. By returning to the basic, natural left-to-right order of language modeling, we hope to encourage a re-examination of its real value in the training of next-generation diffusion models.

BibTeX

@article{ni2026flexibility,
  title     = {The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
  author    = {Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  journal   = {arXiv preprint arXiv:2601.15165},
  year      = {2026},
}