Jackpot: Optimal Sampling for Efficient RL and LLMs

Jackpot: Efficient Reinforcement Learning for LLMs

A new study introduces Jackpot, a framework designed to optimize reinforcement learning (RL) applied to large language models (LLMs). Training LLMs via RL is notoriously expensive, particularly due to the high computational cost associated with the rollout phase.

Decoupling and Sampling

Jackpot addresses this challenge by decoupling rollout generation from policy optimization. This approach allows for the use of more efficient models for rollout, achieving potential efficiency gains. However, this decoupling introduces a significant distribution mismatch that can destabilize learning.

To mitigate this issue, Jackpot uses Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. The framework integrates an OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation based on top-$k$ probability estimation and batch-level bias correction.

Experimental Results

The theoretical analysis demonstrates that OBRS consistently moves the rollout distribution closer to the target distribution within a controllable acceptance budget. Empirical results show that Jackpot significantly improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps with a batch size of 64.

Jackpot: Optimal Sampling for Efficient RL and LLMs

Jackpot: Efficient Reinforcement Learning for LLMs

Decoupling and Sampling

Experimental Results

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

LLM: ragionamento potenziato per problemi matematici

Found-RL: Reinforcement Learning potenziato per guida autonoma

AdaFRUGAL: training di modelli LLM più efficiente e adattabile