AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Small LLM Evaluation: The Importance of Parsing in Local Agents

Published on 2026-02-14 14:01 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

Valutazione di LLM piccoli: l'importanza del parsing negli agenti locali

Small LLM and Tool-Calling Benchmark

A recent benchmark tested 21 small language models (LLMs) to evaluate their ability to use external tools. The focus was on the models' ability to determine when it was appropriate to call a tool, not just the mere ability to do so.

Key Findings

Four models tied for #1 with an Agent Score of 0.880:

lfm2.5:1.2b
qwen3:0.6b
qwen3:4b
phi4-mini:3.8b

The biggest surprise was the excellent performance of lfm2.5:1.2b, a 1.2 billion parameter state-space hybrid model, which also recorded the lowest latency among the top models (approximately 1.5 seconds).

Interestingly, in the Qwen3 family, the ranking is non-monotonic: the 0.6B version outperformed the 4B and 1.7B versions. The 1.7B version seems to be in a "capability valley", aggressive enough to call tools, but not enough to discern when not to.

The Importance of Parsing

The analysis highlighted that the way tool calls are interpreted is as crucial as the test itself. Five models required custom parsers due to non-standard formats:

lfm2.5: bracket notation
jan-v3: raw JSON
gemma3: function syntax inside tags
deepseek-r1: bare function calls
smollm3: occasional omission of tags

Fixing the parser does not always improve a model's performance. For example, lfm2.5 saw a significant improvement (from 0.640 to 0.880) after the parser was fixed, while gemma3 suffered a decrease (from 0.600 to 0.550). This demonstrates that benchmarks that ignore the format can overestimate or underestimate the capabilities of the models.

Final Thoughts

The results suggest that local tool-calling agents can operate effectively on commodity hardware. The number of parameters is not a reliable indicator of performance, and conservative behavior (avoiding acting on uncertain prompts) can lead to better results.

AI-Radar Takeaway

A benchmark of 21 small language models (LLMs) reveals that the ability to call tools locally depends as much on the model as on the accuracy of the parser used. The results highlight how models with less than 4 billion parameters can compete with larger models, with latency times of less than 2 seconds on standard CPUs.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

PeerPush AI Community Platform

Discover and share AI tools and projects. Connect with developers, get feedback, and grow your AI startup in a vibrant community of innovators.

✓ AI Community ✓ Project Showcase ✓ Developer Network

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Little Qwen 3.5 27B and Qwen 35B-A3B models excel in logical reasoning

Little Qwen 3.5 27B and Qwen 35B-A3B models excel in logical reasoning

Little Qwen 3.5 27B and Qwen 35B-A3B models have demonstrated remarkable logical reasoning capabilities in a specific benchmark. The results, obtained using lin

K2.6 Excels in Independent Coding Benchmark, Outperforming Noted Models

K2.6 Excels in Independent Coding Benchmark, Outperforming Noted Models

An independent coding benchmark, akitaonrails, has placed the K2.6 model in Tier A with a score of 87, surpassing competitors like Qwen 3.6 plus and Deepseek v4

LLM Benchmark: Qwen MoE outperforms LLaMA-70B in neuroscience

LLM Benchmark: Qwen MoE outperforms LLaMA-70B in neuroscience

A new benchmark in neuroscience and brain-computer interfaces (BCI) reveals that the Qwen3 235B MoE model outperforms LLaMA-3.3 70B. The results highlight a sha

Local LLMs: Qwen 3.6 35B A3B Excels in Specialized Code Comprehension

Local LLMs: Qwen 3.6 35B A3B Excels in Specialized Code Comprehension

An independent analysis highlights significant advancements in local Large Language Models (LLMs), particularly Qwen 3.6 35B A3B, in understanding niche academi

LLM and unexpected requests: when AI responds outside the box

LLM and unexpected requests: when AI responds outside the box

A Reddit post showcases an unexpected response from a large language model (LLM) to an initial request without a system prompt. The example highlights the diffi

More in LLM

The Performance Gap Between Open and Closed Models Might Be an Illusion

June 2026: NVIDIA, AMD, and Intel Lead the Quantization Push for On-Premise LLMs

Meta reads your mind while you type: a scalpel-free neural interface with a built-in paradox

SWE-rebench update: Qwen3.6-27B, Gemma 4 31B and other local models join the leaderboard

openPangu-2.0-Flash: MoE and Extended Context Trained on Ascend for On-Premise Inference

OpenAI engineers claim to have found a way to halve inference costs

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in