The ability of a large language model (LLM) to use external tools is crucial, but robust generalization across different tasks and toolsets remains a challenge.

DIVE: A New Approach

DIVE (Diversity in Agentic Task Synthesis) is a method that aims to improve the generalization of LLMs in tool use. The approach inverts the order of task synthesis, first executing a variety of real-world tools and then deriving tasks based on the execution traces. This ensures that tasks are always executable and verifiable.

Diversity and Performance

DIVE scales structural diversity along two controllable axes: tool-pool coverage and per-task toolset variety. An Evidence Collection--Task Derivation loop induces rich multi-step tool-use patterns across 373 tools in five domains. Training the Qwen3-8B model with DIVE data (48k SFT + 3.2k RL) resulted in an average improvement of +22 points across nine OOD benchmarks, outperforming the strongest 8B baseline by +68 points. Analysis revealed that scaling diversity consistently outperforms quantity scaling for OOD generalization, even with 4x less data.