A comparative analysis evaluated the performance of 17 large language models (LLMs) running locally, focusing on their ability to use external tools via API calls (tool calling). The tests were conducted on a production MCP server, using 19 different tools and evaluating both "single-shot" and "agentic loop" scenarios.

Test Setup

The models were run on a machine equipped with an NVIDIA RTX 4080 GPU (16GB VRAM) and 64GB of RAM, via LM Studio. Models not specifically trained for tool calling were also included to assess whether basic reasoning abilities could compensate for the lack of fine-tuning.

The tasks were divided into three difficulty levels:

  • Level 0 (Explicit): Tool name and parameters provided precisely.
  • Level 1 (Natural Language): Request in natural language, with the model having to identify the correct tool and map the description to the parameters.
  • Level 2 (Reasoning): Only the high-level goal is provided, requiring the model to plan the sequence of calls and chain the IDs.

Key Findings

  • The "agentic loop" approach proved to significantly improve performance, especially in Level 2 tasks, where many models failed in the "single-shot" test.
  • A 7B parameter model, ibm/granite-4-h-tiny, outperformed larger models (up to 32B) in the overall score.
  • Models not specifically trained for tool calling, such as ernie-4.5-21b and gemma-3-12b, showed remarkable improvements in the "agentic loop" approach.

Implications

These results suggest that the architecture and inference methodology (agentic loop) can have a significant impact on the tool calling capabilities of LLMs, even more than the size of the model itself. For those evaluating on-premise deployments, there are trade-offs to consider between model size, hardware requirements, and inference architecture complexity. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.