Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.
We study the problem of fine-tuning of a pretrained language model with a reward model. To prevent issues such as spurious reward overoptimization — which can lead to poor performance and low diversity — the fine-tuning objective includes a KL divergence constraint that keeps the updated model close to the original. This objective can be interpreted probabilistically as Bayesian posterior inference, with the optimal policy expressible as a reweighted version of the reference model. Within the probabilistic interpretation, on-policy RL is equivalent to amortized variational inference to minimize the reverse KL with respect to the posterior density. However, reverse KL optimization is susceptible to mode collapse and requires on-policy samples. The reliance on on-policy samples can limit the scalabality as we cannot use replay buffers that can be populated in parallel at scale.
An alternative off-policy approach to enhance scalability and exploration is using Generative Flow Networks (GFlowNets). GFlowNets cast the inference problem as sequential decision-making, where a policy constructs outputs token by token, sampling proportional to a designed reward function. Since GFlowNet objectives such as VarGrad (defined below) are off-policy, they support flexible exploration and efficient training using replay buffers, making it well-suited for large-scale, distributed fine-tuning of language models.
In this work we introduce Trajectory Balance with Asynchrony (TBA), an asynchronous, distributed reinforcement learning (RL) system designed to accelerate and scale post-training of large language models (LLMs). The key idea is decoupling data generation (exploration) from policy updates (training), which enables more efficient resource utilization and dramatically reduces wall-clock training time.
TBA divides the system into two main types of nodes:
Key Benefits and Design Considerations
On mathematical reasoning tasks like GSM-8k, TBA shows significant improvements over baseline methods. The approach's ability to explore diverse solution strategies while learning from successful examples helps the model develop better reasoning capabilities. The decoupled exploration and learning process allows TBA to discover effective solution patterns more efficiently than traditional on-policy methods.
In preference fine-tuning experiments, TTBA not only redefines the win-rate versus KL Pareto frontier—achieving higher win rates (e.g., around 82% for well-tuned hyperparameters) at lower or comparable KL values—but also achieves significant training speedups (approximately 5× faster than synchronous methods). Further ablations reveal that maintaining a moderately high on-policy proportion (controlled by the parameter m) is critical for maximizing win-rate performance without sacrificing model stability. Overall, the experiments demonstrate that TBA effectively scales preference fine-tuning by combining rapid, diverse exploration with stable, efficient policy updates.
In red-teaming experiments, TBA achieves up to a 7x speedup in wall-clock time relative to a synchronous setup. It maintains competitive—or even superior—performance on the diversity-toxicity Pareto frontier, meaning it effectively balances broad exploration of adversarial prompts with the generation of high-toxicity (i.e., high-reward) cases.
@article{ bartoldson2025tba, title={Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable {LLM} Post-Training}, author={Brian R. Bartoldson and Siddarth Venkatraman and James Diffenderfer and Moksh Jain and Tal Ben-Nun and Seanie Lee and Minsu Kim and Johan Obando-Ceron and Yoshua Bengio and Bhavya Kailkhura}, year={2025}, journal={arXiv preprint arXiv:2503.18929}, url={https://arxiv.org/abs/2503.18929} }