Jackpot: Efficient Reinforcement Learning for LLMs
A new study introduces Jackpot, a framework designed to optimize reinforcement learning (RL) applied to large language models (LLMs). Training LLMs via RL is notoriously expensive, particularly due to the high computational cost associated with the rollout phase.
Decoupling and Sampling
Jackpot addresses this challenge by decoupling rollout generation from policy optimization. This approach allows for the use of more efficient models for rollout, achieving potential efficiency gains. However, this decoupling introduces a significant distribution mismatch that can destabilize learning.
To mitigate this issue, Jackpot uses Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. The framework integrates an OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation based on top-$k$ probability estimation and batch-level bias correction.
Experimental Results
The theoretical analysis demonstrates that OBRS consistently moves the rollout distribution closer to the target distribution within a controllable acceptance budget. Empirical results show that Jackpot significantly improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps with a batch size of 64.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!