Needle: A Compact LLM for On-Device Intelligence

The generative artificial intelligence landscape is dominated by Large Language Models (LLMs) with billions of parameters, often requiring cloud infrastructure or high-end hardware for their deployment. However, there is a growing need for efficient and compact AI solutions capable of operating directly on resource-constrained consumer devices. In this context, the Needle project emerges as an innovative response, releasing an open-source LLM with only 26 million parameters, specifically optimized for "tool calling" (or "function-calling").

The initiative stems from frustration with the limited effort put into developing "agentic" models that can run on budget smartphones and other low-cost devices. Needle aims to bridge this gap, demonstrating that for specific tasks like tool calling, massive models are often overkill. The goal is to make on-device AI a practical reality, extending intelligent capabilities to a wide range of personal devices.

Revolutionary Architecture and Performance on Edge Devices

The core of Needle's innovation lies in its architecture, named Simple Attention Networks (SANs). Unlike traditional models that integrate feed-forward networks (FFNs) for fact memorization and reasoning, Needle relies exclusively on attention and gating mechanisms, completely eliminating MLPs. This architectural choice is based on the observation that tool calling is fundamentally a retrieval and assembly process โ€“ matching a query to a tool name, extracting argument values, and emitting JSON โ€“ rather than a complex reasoning task that would require extensive internal memorization capacity.

This streamlined configuration allows Needle to achieve remarkable performance on consumer devices. The model can process 6000 tokens per second during the prefill phase and 1200 tokens per second during the decode phase. Training involved pre-training on 200 billion tokens, using 16 TPU v6e for 27 hours, followed by post-training on 2 billion tokens of synthesized function-calling data (generated via Gemini with 15 tool categories), completed in just 45 minutes. This efficiency in training and inference makes it an ideal candidate for integration into smartphones, smartwatches, and smart glasses.

Implications for On-Premise Deployment and Data Sovereignty

While Needle is designed for consumer devices, its implications extend significantly to the context of on-premise deployment and AI on local infrastructures. The ability to run compact and efficient models on limited hardware opens new perspectives for companies that need to maintain full control over their data and inference operations. Needle's "no FFN" approach, which has proven effective in other contexts such as Retrieval-Augmented Generation (RAG) and external tool use, suggests that models do not need to memorize facts in FFN weights if those facts are provided directly in the input.

This paradigm is particularly relevant for scenarios requiring high standards of data sovereignty, regulatory compliance, or air-gapped environments, where sending sensitive data to external cloud services is not an option. Running smaller, more efficient LLMs on bare metal or edge servers reduces the Total Cost of Ownership (TCO) in the long term, minimizing operational costs related to energy and cooling, while also offering more predictable latency and throughput. For those evaluating on-premise deployment, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between costs, performance, and control.

Future Prospects and Development Context

Needle is not an isolated initiative but is part of a broader effort to make on-device AI a practical reality. The development team is also behind Cactus, an open-source inference engine specifically designed for mobile and wearables. This synergy between an optimized model and a dedicated inference engine promises to accelerate the adoption of AI across a wide range of personal devices.

Despite its small size, Needle has demonstrated its ability to outperform larger models such as FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M in single-shot function calling tasks. It is important to note, however, that these competing models possess greater scope and capacity, excelling in broader conversational contexts. Needle thus positions itself as a specialized solution, highly performant for its specific domain. The model is available under an MIT license, with weights and code accessible on Hugging Face and GitHub, encouraging the community to test it and fine-tune it for their own needs on Mac or PC.