Small LLM and Tool-Calling Benchmark
A recent benchmark tested 21 small language models (LLMs) to evaluate their ability to use external tools. The focus was on the models' ability to determine when it was appropriate to call a tool, not just the mere ability to do so.
Key Findings
Four models tied for #1 with an Agent Score of 0.880:
- lfm2.5:1.2b
- qwen3:0.6b
- qwen3:4b
- phi4-mini:3.8b
The biggest surprise was the excellent performance of lfm2.5:1.2b, a 1.2 billion parameter state-space hybrid model, which also recorded the lowest latency among the top models (approximately 1.5 seconds).
Interestingly, in the Qwen3 family, the ranking is non-monotonic: the 0.6B version outperformed the 4B and 1.7B versions. The 1.7B version seems to be in a "capability valley", aggressive enough to call tools, but not enough to discern when not to.
The Importance of Parsing
The analysis highlighted that the way tool calls are interpreted is as crucial as the test itself. Five models required custom parsers due to non-standard formats:
- lfm2.5: bracket notation
- jan-v3: raw JSON
- gemma3: function syntax inside tags
- deepseek-r1: bare function calls
- smollm3: occasional omission of tags
Fixing the parser does not always improve a model's performance. For example, lfm2.5 saw a significant improvement (from 0.640 to 0.880) after the parser was fixed, while gemma3 suffered a decrease (from 0.600 to 0.550). This demonstrates that benchmarks that ignore the format can overestimate or underestimate the capabilities of the models.
Final Thoughts
The results suggest that local tool-calling agents can operate effectively on commodity hardware. The number of parameters is not a reliable indicator of performance, and conservative behavior (avoiding acting on uncertain prompts) can lead to better results.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!