Sub-Agents on Local Hardware: Optimizing LLMs with Limited VRAM

The Challenge of Sub-Agents in On-Premise Environments

The adoption of Large Language Models (LLMs) has opened new frontiers for automation and intelligent assistance, especially through the use of sub-agents capable of splitting and managing complex tasks. While cloud environments offer almost unlimited resources for these operations, deploying LLMs in on-premise or self-hosted contexts presents significant challenges, particularly regarding VRAM availability and cache management. Many sub-agent implementations are not designed to operate in environments with limited hardware resources, making it difficult for developers to replicate the capabilities of cloud systems on their local servers.

This discrepancy drives IT specialists and decision-makers to seek innovative solutions that balance performance, costs, and data control. The ability to run LLMs and their advanced functionalities locally is crucial for sectors requiring high standards of security, data sovereignty, and regulatory compliance, where transferring sensitive information to external cloud services is not always a viable option.

Optimization with Limited VRAM: A Custom Approach

To address the restrictions imposed by hardware configurations with limited VRAM, a user developed a customized solution. The primary challenge was operating with only 10GB of VRAM and a single slot for the KV (Key-Value) cache, which was already subject to Quantization. Standard sub-agent implementations are typically unable to handle such constraints, often requiring more resources for the simultaneous loading and management of multiple models or contexts.

The answer came through a fork of an existing sub-agent repository, specifically adapted for integration with pi coding agent. This approach allowed the use of a model like qwen3.6-35b-a3b within a llama.cpp server environment, demonstrating that it is possible to enable advanced LLM functionalities even on less powerful hardware. This customization highlights the importance of flexible and Open Source Frameworks for innovation in the field of local LLM deployment.

Performance and Operational Context for On-Premise Deployment

Despite hardware limitations, the solution demonstrated notable performance. By utilizing the Multi-Token Prediction (MTP) feature in the main llama.cpp branch and an Apex variant of the Qwen model (Qwen3.6-35B-A3B-APEX-MTP-GGUF), it was possible to manage a context of 175-200k Tokens with q_8 kv Quantization. Throughput performance ranged between 200 and 300 prompt processing (pp) and 25-40 Tokens per second (tps), depending on draft hit rates.

These figures are particularly relevant for organizations evaluating on-premise LLM deployment. They demonstrate that, with the right optimizations and careful selection of models and Frameworks, solid performance can be achieved even without resorting to high-end GPUs. For those evaluating on-premise deployment, AI-RADAR offers analytical Frameworks on /llm-onpremise to assess the trade-offs between initial CapEx, long-term TCO, data sovereignty, and performance requirements, providing a solid basis for informed strategic decisions.

Future Prospects and Implications for Data Sovereignty

Further developments are planned for the project, including the ability to spawn sub-agents with no previous context and to manage the saving and storing of the main context via the slots endpoint and the --slot-save-path parameter. Although the resulting .bin files can be quite large, this functionality would further enhance the flexibility and efficiency of sub-agents in resource-constrained environments.

This example underscores a growing trend: the search for solutions that allow companies to maintain full control over their AI workloads. Self-hosted LLM deployment, even with modest hardware, offers advantages in terms of privacy, security, and TCO, reducing reliance on external cloud providers. For CTOs, DevOps leads, and infrastructure architects, understanding how to optimize LLM Inference on local hardware is fundamental for building resilient and compliant AI infrastructures.