Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

1Lawrence Livermore National Laboratory 2Mila - Quebec AI Institute
3Université de Montréal 4KAIST 5CIFAR Fellow
GSM-8k Results
TBA performs rapid, scalable exploration of model responses, improving RL efficiency on the GSM8K mathematical reasoning task.

Abstract

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.


Fine-tuning large language models

Setup

We study the problem of fine-tuning of a pretrained language model with a reward model. To prevent issues such as spurious reward overoptimization — which can lead to poor performance and low diversity — the fine-tuning objective includes a KL divergence constraint that keeps the updated model close to the original. This objective can be interpreted probabilistically as Bayesian posterior inference, with the optimal policy expressible as a reweighted version of the reference model. Within the probabilistic interpretation, on-policy RL is equivalent to amortized variational inference to minimize the reverse KL with respect to the posterior density. However, reverse KL optimization is susceptible to mode collapse and requires on-policy samples. The reliance on on-policy samples can limit the scalabality as we cannot use replay buffers that can be populated in parallel at scale.

An alternative off-policy approach to enhance scalability and exploration is using Generative Flow Networks (GFlowNets). GFlowNets cast the inference problem as sequential decision-making, where a policy constructs outputs token by token, sampling proportional to a designed reward function. Since GFlowNet objectives such as VarGrad (defined below) are off-policy, they support flexible exploration and efficient training using replay buffers, making it well-suited for large-scale, distributed fine-tuning of language models.

$$ \begin{align*} \mathcal{L}_{\text{TB}}^{\text{VarGrad}}(\mathbf{B};\theta) = \frac{1}{BK}\sum_{i=1,j=1}^{i=B,j=K} \Bigg( \log \hat{Z}(\mathbf{x}^{(i)}) + \sum_{t=1}^{T}\log \pi_{\theta}(y_t^{(i,j)} \mid y_{1:t}^{(i,j)},\mathbf{x}^{(i)}) \\- \log \pi_{\text{ref}}(y_t^{(i,j)} \mid y_{1:t}^{(i,j)},\mathbf{x}^{(i)}) - \frac{1}{\beta} r(\mathbf{y}^{(i,j)};\mathbf{x}^{(i)}) \Bigg)^2. \end{align*}$$

Trajectory Balance with Asynchrony

Image

In this work we introduce Trajectory Balance with Asynchrony (TBA), an asynchronous, distributed reinforcement learning (RL) system designed to accelerate and scale post-training of large language models (LLMs). The key idea is decoupling data generation (exploration) from policy updates (training), which enables more efficient resource utilization and dramatically reduces wall-clock training time.

TBA divides the system into two main types of nodes:

  • Searcher Nodes maintain a local copy of the policy and generate responses for queries in the dataset in a local buffer which is periodically synced to a global buffer.
  • Trainer Nodes asynchronously sample a batch from the global buffer, alternating between on-policy and off-policy data to update the policy.

Key Benefits and Design Considerations

  • Asynchronous and Distributed Design: By decoupling exploration from training, TBA allows the searcher nodes to continuously generate trajectories without waiting for the training process, leading to massive parallelism and high resource utilization.
  • Off-Policy Learning with Trajectory Balance Objective: Integrating the TB objective facilitates learning from diverse, off-policy data. This approach mitigates the common pitfalls of on-policy RL (such as mode collapse) while ensuring that updates derive from a broader and more varied set of experiences.
  • Scalability: The architecture is inherently scalable. As more searcher nodes are added, the volume and diversity of the trajectories in the global buffer increase, which in turn enhances exploration—especially beneficial in settings with sparse rewards.


Experiments

Mathematical Reasoning

On mathematical reasoning tasks like GSM-8k, TBA shows significant improvements over baseline methods. The approach's ability to explore diverse solution strategies while learning from successful examples helps the model develop better reasoning capabilities. The decoupled exploration and learning process allows TBA to discover effective solution patterns more efficiently than traditional on-policy methods.


Preference fine-tuning

In preference fine-tuning experiments, TTBA not only redefines the win-rate versus KL Pareto frontier—achieving higher win rates (e.g., around 82% for well-tuned hyperparameters) at lower or comparable KL values—but also achieves significant training speedups (approximately 5× faster than synchronous methods). Further ablations reveal that maintaining a moderately high on-policy proportion (controlled by the parameter m) is critical for maximizing win-rate performance without sacrificing model stability. Overall, the experiments demonstrate that TBA effectively scales preference fine-tuning by combining rapid, diverse exploration with stable, efficient policy updates.

Image
TBA scales search and improves RL efficiency on the TL;DR summarization task.
Image
TBA defines a new KL vs. win-rate Pareto frontier for the TL;DR summarization task.

Red-teaming

In red-teaming experiments, TBA achieves up to a 7x speedup in wall-clock time relative to a synchronous setup. It maintains competitive—or even superior—performance on the diversity-toxicity Pareto frontier, meaning it effectively balances broad exploration of adversarial prompts with the generation of high-toxicity (i.e., high-reward) cases.

LM results
TBA reaches the RT diversity-toxicity Pareto frontier and improves as search is scaled.
LM results
TBA speeds up the wall-clock time required to reach the Pareto frontier for the red-teaming task

BibTeX


      @article{
        bartoldson2025tba,
        title={Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable {LLM} Post-Training},
        author={Brian R. Bartoldson and Siddarth Venkatraman and James Diffenderfer and Moksh Jain and Tal Ben-Nun and Seanie Lee and Minsu Kim and Johan Obando-Ceron and Yoshua Bengio and Bhavya Kailkhura},
        year={2025},
        journal={arXiv preprint arXiv:2503.18929},
        url={https://arxiv.org/abs/2503.18929}
      }