PROPEL: Optimizing Task Generation for LLM Training with Reinforcement Learning

Overcoming the Bottleneck in AI Agent Training

Training agents through Reinforcement Learning (RL) is increasingly constrained by a limiting resource: the availability of "frontier tasks." These are valid, solvable tasks with just enough difficulty to effectively train the current model. As Large Language Models (LLMs) and agentic models evolve, predefined or fixed task distributions quickly become saturated. Simultaneously, naive synthetic task generation often produces trivial, impossible, or ill-posed results, rendering them useless for effective training.

This scarcity of quality tasks represents a significant bottleneck, slowing progress in the training of complex AI systems. The ideal solution would involve training a task generator using RL itself, optimizing the validity and learnability of the tasks produced. However, this approach faces a considerable computational challenge.

PROPEL: A Framework for Optimizing Evaluation Costs

Direct optimization of a task generator requires repeated solver rollouts (the model attempting to solve the task) for each candidate task generated. In the context of software-engineering (SWE) tasks, a single solver rollout can take tens of minutes. This makes training a generator with the solver "in-the-loop" an intractable process, both in terms of time and computational resources. The need to allocate GPU resources for such extended periods for each evaluation iteration makes the approach prohibitive, especially for those managing on-premise infrastructures with defined budgets and hardware capacities.

To address this problem, PROPEL was introduced as a solver-amortized framework. PROPEL is designed to train task generators while maintaining a targeted solve rate. Its operation is based on training a lightweight "activation probe." This probe is trained once on a labeled corpus of generated tasks and solver outcomes. Subsequently, the probe predicts the target-solver pass rate from a frozen generator reference model and serves as a proxy for the solve rate during generator optimization. This drastically reduces generator evaluation to a single forward pass, eliminating the need for costly and lengthy solver rollouts for each iteration.

Impact and Implications for LLM Workloads

The results achieved with PROPEL are significant, demonstrating its effectiveness across multiple model scales and domains, including math, code, and software engineering. The framework shifts task generation towards the desired solve rate. For example, for coding tasks, the percentage of tasks generated at the "learnable frontier" increased from 10.1% to 20.0% for a Qwen2.5-3B-Instruct solver. For a Qwen2.5-7B-Instruct solver, the increase was from 5.3% to 12.6%.

In the field of software engineering, PROPEL doubled the share of generations at the targeted solve rate, moving from 9.8% to 19.6% for the Qwen3.5-27B model. These improvements were observed even on repositories not seen during the training of the probe and generator, highlighting the method's robustness and generalizability. The computational efficiency introduced by PROPEL is crucial for organizations operating LLMs on self-hosted or hybrid infrastructures, where optimizing hardware resources and reducing the Total Cost of Ownership (TCO) are absolute priorities.

Future Prospects for Efficient Training

The introduction of PROPEL represents a significant step forward in optimizing Reinforcement Learning-based training processes for Large Language Models. By drastically reducing the computational cost associated with task evaluation, the framework opens new possibilities for developing more capable and robust AI agents. For CTOs, infrastructure architects, and DevOps leads, solutions like PROPEL are fundamental for maximizing the use of available hardware resources, whether in on-premise GPU clusters or hybrid configurations.

The ability to efficiently generate high-quality tasks not only accelerates training but also allows for the exploration of broader and more complex problem spaces, pushing the boundaries of current LLM capabilities. This is particularly relevant in contexts where data sovereignty and direct control over infrastructure are non-negotiable requirements. Continued research in this direction will be crucial to unlock the full potential of generative artificial intelligence, making the training of increasingly larger and more sophisticated models an economically and technically sustainable reality.