๐ Frameworks
AI generated
Meta: Easier Reinforcement Learning with TorchForge and Weaver
Meta has introduced TorchForge, a PyTorch-native library designed to simplify reinforcement learning (RL) in large language models (LLMs). The library was developed to address the infrastructural challenges that often slow down research and limit the iteration speed of teams.
## TorchForge and Weaver: A Synergy for Large-Scale RL
In collaboration with Stanford and CoreWeave, the Meta team tested TorchForge on a 512-GPU cluster, using Weaver as a verification system. This allowed GRPO (Generalized Policy Optimization) to be run at a scale and speed previously unattainable. The integration led to a simpler setup, more stable training, and a more efficient workflow from ideation to implementation.
TorchForge offers PyTorch-native RL primitives that scale from a single node to a multi-node cluster without infrastructural complexity. Weaver, on the other hand, provides production-grade reward signals without human annotations or expensive API calls. Monarch orchestrates distributed coordination with automatic fault tolerance.
## Key Features of TorchForge
* Pseudocode-style RL APIs.
* Flexible synchronicity.
* Monarch service abstractions.
* Decoupled control and data planes.
* TorchStore in-memory weight sync.
* Proven end-to-end components (vLLM, TorchTitan).
* Heterogeneous, ephemeral scaling.
* Integration of custom rewards and verifiers (e.g., Weaver).
* Robust and reproducible pipelines.
* Extensibility via first-class environments and tools.
## Weaver: A Verifier for Reasoning
Weaver is a verification system designed to bridge the gap between generation and verification in large language models. It aggregates multiple smaller verifiers to create a more effective verification engine. This automated system eliminates the need for continuous human annotations and reduces reliance on expensive frontier model APIs.
## Experimental Results
The tests compared three reward approaches on Qwen3-8B-Base and Qwen3-32B-Base models:
* Single reward model (RM) without annotations.
* Weaver without annotations.
* Annotated training samples.
The combined use of Forge and Weaver led to significantly superior results compared to single reward models in the Math, GPQA, and MMLU Pro benchmarks. The pipeline demonstrated a notable improvement, achieving 63% of the gap between the single RM and annotated training on GPQA with Qwen3-8B.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!