The Challenge of LLM Inference on Local Hardware

The adoption of Large Language Models (LLM) in self-hosted and on-premise environments is a growing priority for many organizations, driven by the need to ensure data sovereignty, process control, and optimized Total Cost of Ownership (TCO). However, deploying these models on local infrastructures, especially with limited hardware resources, presents significant challenges. The choice of the right model and its configuration becomes crucial for balancing performance, context capacity, and output quality.

A prime example of this challenge emerges from the evaluation of two Qwen3.6 model variants for coding and agentic workloads on a single RTX 5080 GPU with 16GB of VRAM. This scenario highlights the complex decisions that CTOs, DevOps leads, and infrastructure architects must face when designing local AI solutions, where every gigabyte of VRAM and every token per second matters.

Technical Deployment Details and Current Performance

The test environment in question is based on a local setup that includes an RTX 5080 GPU with 16GB of VRAM and 96GB of system RAM, running on Windows. For LLM inference, the llama.cpp framework is used with the MTP branch, which supports CPU expert offload for Mixture of Experts (MoE) models. Currently, the Qwen3.6-35B-A3B-MTP model in Q8_0 GGUF format is running.

Observed performance with this configuration, at an active context of approximately 118K tokens within a total context setting of 196K, shows a prefill speed of about 1178 tokens per second and a decode speed of about 32 tokens per second. For follow-ups, with an active context between 118K and 143K tokens, the decode speed remains between 32 and 37 tokens per second. The user is now testing the same A3B configuration with an extended context up to 232K tokens, seeking to understand if an alternative model, the Qwen3.6-27B dense MTP, could offer advantages.

Trade-offs Between Dense and Mixture of Experts (MoE) Models

The decision between a dense model and a MoE model is at the core of this evaluation. Dense models, like the Qwen3.6-27B, activate all their parameters for each token processed, potentially offering greater consistency in output. MoE models, such as the Qwen3.6-35B-A3B, activate only a subset of "experts" (and thus parameters) for each token, which can lead to more efficient inference in terms of computational resources, especially when using CPU offload for experts not active on the GPU.

Key questions revolve around whether the 27B dense model can outperform the 35B MoE in overall performance on 16GB of VRAM, whether it offers a smoother experience at deep contexts, and whether its consistency is preferable to the active-parameter efficiency of the MoE for prolonged use in coding-agent scenarios. An additional constraint is disk space: the 27B dense model requires about 30GB, while the user only has 4GB available, making the choice even more critical.

Optimizing Local Infrastructure for LLMs

This scenario underscores the importance of careful infrastructure planning for on-premise LLM deployments. GPU VRAM capacity is often the primary bottleneck for large model inference, but system RAM and storage space also play a fundamental role. The choice between different model architectures (dense vs. MoE) and optimization techniques like Quantization (Q8_0 GGUF) and CPU offload are essential strategies for maximizing the utilization of available resources.

For companies evaluating self-hosted alternatives to cloud solutions, understanding these trade-offs is vital. The ability to run complex workloads such as coding agents with extended contexts, while maintaining acceptable performance and respecting hardware constraints, is a decisive factor for deployment success. AI-RADAR offers analytical frameworks on /llm-onpremise to support decision-makers in evaluating these complex balances between costs, performance, and control.