
Abstract


Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56× (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.
Key Contributions
- Limitation Analysis: We identify why common confidence-based schedules (e.g., top-K margin) can catastrophically fail in dLLMs on reasoning tasks (e.g., [AfterEoT] collapse), revealing sensitivity to decoding order.
- HEX—Test-time Scaling via Hidden Experts: We uncover and exploit a latent mixture of semi-AR experts by marginalizing over block schedules and aggregating via majority vote—introducing a new compute-accuracy knob at inference time.
- State-of-the-Art, Training-Free: HEX achieves up to 3.56× gains and surpasses GRPO on GSM8K (88.10%), MATH (40.00%), ARC-C (87.80%), and improves TruthfulQA (57.46%)—all without retraining.
Findings



Result


BibTeX
@article{lee2025hex,
title = {Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts},
author = {Jihoon Lee and Hoyeon Moon and Kevin Zhai and Arun Kumar Chithanar and
Anit Kumar Sahu and Soummya Kar and Chul Lee and Souradip Chakraborty and
Amrit Singh Bedi},
journal = {arXiv preprint},
year = {2025},
note = {Project page: https://junos-ai-org.github.io/Test-Time-Scaling/}
}