AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Benchmarking 17 local LLMs: focusing on tool calling

Published on 2026-02-23 15:13 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

Benchmark di 17 LLM locali: focus sul tool calling

A comparative analysis evaluated the performance of 17 large language models (LLMs) running locally, focusing on their ability to use external tools via API calls (tool calling). The tests were conducted on a production MCP server, using 19 different tools and evaluating both "single-shot" and "agentic loop" scenarios.

Test Setup

The models were run on a machine equipped with an NVIDIA RTX 4080 GPU (16GB VRAM) and 64GB of RAM, via LM Studio. Models not specifically trained for tool calling were also included to assess whether basic reasoning abilities could compensate for the lack of fine-tuning.

The tasks were divided into three difficulty levels:

Level 0 (Explicit): Tool name and parameters provided precisely.
Level 1 (Natural Language): Request in natural language, with the model having to identify the correct tool and map the description to the parameters.
Level 2 (Reasoning): Only the high-level goal is provided, requiring the model to plan the sequence of calls and chain the IDs.

Key Findings

The "agentic loop" approach proved to significantly improve performance, especially in Level 2 tasks, where many models failed in the "single-shot" test.
A 7B parameter model, ibm/granite-4-h-tiny, outperformed larger models (up to 32B) in the overall score.
Models not specifically trained for tool calling, such as ernie-4.5-21b and gemma-3-12b, showed remarkable improvements in the "agentic loop" approach.

Implications

These results suggest that the architecture and inference methodology (agentic loop) can have a significant impact on the tool calling capabilities of LLMs, even more than the size of the model itself. For those evaluating on-premise deployments, there are trade-offs to consider between model size, hardware requirements, and inference architecture complexity. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.

AI-Radar Takeaway

A recent study compared 17 large language models (LLMs) running locally, evaluating their "tool calling" capabilities in real-world scenarios. The research highlights how the "agentic loop" approach, where the model receives feedback from the tools, significantly improves performance, especially in complex tasks requiring reasoning.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Qwen 3.5 35B: Local Inference on 8GB VRAM

Qwen 3.5 35B: Local Inference on 8GB VRAM

A user shared their experience using the Qwen 3.5 35B model on a GPU with only 8GB of VRAM for local agentic workloads. The setup includes an Intel i9-14900HX p

Qwen3.6 27B on V100s: 1000 Tokens/Second in On-Premise Inference Scenarios

Qwen3.6 27B on V100s: 1000 Tokens/Second in On-Premise Inference Scenarios

A recent Reddit test showcased the ability to generate 1000 tokens per second with the Qwen3.6 27B model on an NVIDIA V100 GPU setup, handling 128 concurrent re

Local AI inference: possible even without a GPU

Local AI inference: possible even without a GPU

A user demonstrates how to run LLM models and Stable Diffusion on an old CPU-only desktop PC, paving the way for low-cost AI experimentation with full data cont

TSMC Increases Advanced Node Prices: Higher Costs Ahead for Nvidia, AMD, and AI Hardware

TSMC Increases Advanced Node Prices: Higher Costs Ahead for Nvidia, AMD, and AI Hardware

TSMC has reportedly raised prices for all advanced nodes, which account for 74% of its wafer business. Nvidia, AMD, Apple, and Qualcomm will face higher wafer c

NVIDIA Vera: Olympus Cores Redefine ARM Performance for Data Centers

Hardware May 26

NVIDIA Vera: Olympus Cores Redefine ARM Performance for Data Centers

NVIDIA is set to launch its Vera CPU, an ARM-based data center processor designed for agentic AI workloads. Featuring proprietary Olympus cores, Vera promises c

More in LLM

Step 3.7 Flash with Claude-style prompts beats Hermes on code: a wake-up call for local LLM deployments

Mistral AI: The open source challenge to OpenAI's dominance

Google's TabFM: zero-shot tabular predictions without training

Longcat 2: INT8 and FP8 quantization now available for on-prem deployment

Why AI Needs a Glossary (and What It Has to Do with On-Premise Deployment)

Smartschool and AI for admission tests: why teaching is harder than answering

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in