DFlash Speculative Decoding on VRAM-Limited GPU: A Case Study with Qwen3.5-35B

Running Large Language Models (LLMs) on local infrastructure presents a significant challenge, especially when operating with hardware that has VRAM constraints. Many organizations considering an on-premise deployment find themselves needing to balance performance with the acquisition costs of new, high-capacity GPUs. In this context, optimizing the utilization of existing resources becomes crucial to ensure data sovereignty and control over AI workloads.

A recent experiment demonstrated how these limitations can be overcome, achieving a remarkable increase in token generation performance. The test involved a 35-billion-parameter LLM, Qwen3.5-35B-A3B, running on an NVIDIA RTX 2080 SUPER GPU equipped with only 8GB of VRAM, a configuration typically considered insufficient for models of this size. Success was achieved through the implementation of DFlash speculative decoding within the llama.cpp framework, combined with a strategy of offloading MoE experts to the CPU.

Technical Details and Optimization Strategies

The cornerstone of this setup lies in the adoption of DFlash speculative decoding, a technique that accelerates the token generation process by using a smaller, faster "draft" model to predict upcoming token sequences. These predictions are then validated in parallel by the main model, reducing the overall inference time. In this specific case, the main Qwen3.5-35B-A3B model was quantized to Q5_K_M, with a size of approximately 24.44 GiB, well beyond the available VRAM capacity.

To overcome this limitation, a strategic combination of techniques was employed. The main model, being a Mixture of Experts (MoE), allowed for offloading a portion of the experts to the CPU, freeing up valuable VRAM. The DFlash "draft" model, a Qwen3.5-35B-A3B-DFlash Q4_K_M, was significantly smaller, with approximately 474 million parameters and a size of only 267.8 MiB, enabling its full loading onto the GPU. Optimization also required fine-tuning llama.cpp configuration parameters, such as -ncmoe for MoE expert management and --draft-max for the maximum length of draft model predictions, identifying the "sweet spots" to maximize efficiency.

Performance and Implications for On-Premise Deployment

The experimental results were significant. Starting from a baseline of approximately 26.8 tokens/s without DFlash, the implementation of speculative decoding boosted the generation speed to about 35.6-35.8 tokens/s. This translates to a performance gain of approximately 33-34%, a substantial increase for inference workloads. The high acceptance rate of the draft model's predictions, at 99.302% with a --draft-max of 6, confirms the effectiveness of the technique.

These data have direct implications for companies evaluating on-premise LLM deployments. They demonstrate that competitive performance can be achieved even with older hardware or limited VRAM, extending the lifecycle of existing infrastructure and reducing the Total Cost of Ownership (TCO). The ability to run complex models locally strengthens data sovereignty, compliance, and security, critical aspects for regulated industries or air-gapped environments. For those evaluating analytical frameworks for on-premise deployment, AI-RADAR offers in-depth resources and analysis on /llm-onpremise to better understand these trade-offs.

Future Outlook and Considerations

The success of this experiment underscores the value of innovation in Open Source frameworks like llama.cpp and optimization techniques such as speculative decoding. The ability to run large LLMs on consumer or mid-range hardware opens new opportunities for the development and deployment of AI applications in edge contexts or with limited budgets.

It is important to note that optimization requires careful calibration of parameters and a deep understanding of the interactions between the model, hardware, and framework. While the results are promising, research and development in this field are constantly evolving, with new quantization techniques and model architectures continually emerging. This case study offers a concrete example of how software engineering can unlock the potential of existing hardware, making AI more accessible and controllable.