HEX visualization sample

Abstract

Left figure
Right figure

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56× (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

Key Contributions

  • Limitation Analysis: We identify why common confidence-based schedules (e.g., top-K margin) can catastrophically fail in dLLMs on reasoning tasks (e.g., [AfterEoT] collapse), revealing sensitivity to decoding order.
  • HEX—Test-time Scaling via Hidden Experts: We uncover and exploit a latent mixture of semi-AR experts by marginalizing over block schedules and aggregating via majority vote—introducing a new compute-accuracy knob at inference time.
  • State-of-the-Art, Training-Free: HEX achieves up to 3.56× gains and surpasses GRPO on GSM8K (88.10%), MATH (40.00%), ARC-C (87.80%), and improves TruthfulQA (57.46%)—all without retraining.

Findings

img1
Limitation Analysis: We identify why common confidence-based schedules (e.g., top-K margin) can catastrophically fail in dLLMs on reasoning tasks (e.g., [AfterEoT] collapse), revealing sensitivity to decoding order.
img2
img3
Uncovered interesting property: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. HEX address this, by ensembling heterogeneous block schedules.

Result

Main results
State-of-the-Art, Training-Free: HEX achieves up to 3.56× gains and surpasses GRPO on GSM8K (88.10%), MATH (40.00%), ARC-C (87.80%), and improves TruthfulQA (57.46%)—all without retraining.
scaling
Test-Time Scaling: HEX’s accuracy improves monotonically as the number of voting samples increases, while the tie rate, an indicator of ambiguity, steadily declines. HEX effectively exposes a tunable accuracy, compute knob: practitioners can trade inference cost for accuracy in a predictable way, without retraining.

BibTeX

@article{lee2025hex,
  title   = {Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts},
  author  = {Jihoon Lee and Hoyeon Moon and Kevin Zhai and Arun Kumar Chithanar and
             Anit Kumar Sahu and Soummya Kar and Chul Lee and Souradip Chakraborty and
             Amrit Singh Bedi},
  journal = {arXiv preprint},
  year    = {2025},
  note    = {Project page: https://junos-ai-org.github.io/Test-Time-Scaling/}
}