Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

News

Awards

No items found.

Competition Awards

No items found.

Abstract

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7× end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment.

Highlights

Enjoy 2× faster end-to-end reasoning RL training
Lossless on-policy RL — training quality preserved theoretically and empirically
A free, high-quality draft model for efficient deployment

Overview

Reinforcement Learning (RL) for reasoning LLMs, while proven to be powerful, suffers from severe efficiency bottlenecks due to the long-tail sequence lengths in the rollout stage, causing large GPU idle time and slow progress. TLT addresses this by automatically training a lightweight drafter during training and dynamically applying speculative decoding, delivering significant speedups without altering the underlying RL algorithm.

Adaptive Speculative Decoding

Reasoning RL training is dominated by long-tail rollouts, where a few extremely long responses consume over 80% of the step time and collapse GPU utilization—most GPUs sit idle while waiting for those long rollouts to finish, leading to highly inefficient training.

Speculative decoding (SD), an efficient and lossless way to accelerate LLM generation, is a natural fit for long-tail rollouts. However, in RL the target model evolves continuously, which quickly makes a fixed drafter ineffective. TLT addresses this by introducing Adaptive Speculative Decoding: it leverages idle “rollout bubbles” to update the drafter on the fly, ensuring the draft model stays aligned with the changing target throughout training.

Efficient RL Framework

TLT introduces the Adaptive Drafter, Spot Trainer, and Adaptive Rollout Engine design into a unified acceleration framework. In TLT, the drafter updates are performed opportunistically (as spot tasks) to minimize impact on the RL workload. During decoding, our efficient rollout engine dynamically adjusts SD strategies in response to real-time workload characteristics, maximizing token throughput while staying within memory limits.

‍

Results

Across diverse LLM sizes and system settings, TLT achieves substantial, lossless performance gains without breaking the on-policy restrictions. By improving rollout efficiency with adaptive speculative decoding, TLT reduces long-tail latency and delivers 1.7–2.1× end-to-end RL speedup, while maintaining reward parity with baseline methods. Additionally, thanks to the adaptive speculative decoding design, TLT yields a high-quality draft model as a free byproduct for efficient deployment.

End-to-end Training Speed Evaluation. The y-axis indicates the relative training throughput of each system running GRPO RL algorithm. TLT achieves 1.7-2.1× speedup over the state-of-the-art RL training system VeRL.

End-to-end Training Curves. The average reward curves with TLT for both Qwen2.5 7B and 32B models match well with VeRL's.

‍

Video

Citation

@inproceedings{TLT,
title={Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter},
author={Qinghao Hu and Shang Yang and Junxian Guo and Xiaozhe Yao and Yujun Lin and Yuxian Gu and Han Cai and Chuang Gan and Ana Klimovic and Song Han},
booktitle={Proceedings of the 31th ACM International Conference on Architectural Support for Programming Languages and Operating Systems},
year={2026},
series={ASPLOS '26}
}

Media

No media articles found.

Acknowledgment

We thank MIT-IBM Watson AI Lab, MIT AI Hardware Program, MIT Amazon Science Hub, Hyundai Motor Company, and National Science Foundation for supporting this work.

Team Members

Qinghao Hu

Shang Yang

Yujun Lin

Yuxian Gu

Han Cai

Song Han