On-Premise Optimization: DeepSeek-V4-Flash Pushes Workstation GPU Limits

In the rapidly evolving landscape of Large Language Models (LLMs), efficient Inference on local hardware represents a critical challenge for companies prioritizing data sovereignty and control over their technology stacks. A recent study has highlighted the capabilities of the DeepSeek-V4-Flash model, demonstrating how, through targeted optimizations, significant performance can be achieved on workstation graphics cards, such as the NVIDIA RTX PRO 6000 Max-Q. This approach underscores the importance of careful hardware and software configuration to maximize Throughput and minimize latency in on-premise Deployments.

The experiment revealed that DeepSeek-V4-Flash, when equipped with MTP (Multi-Token Prediction) self-speculation and advanced Quantization techniques, can reach up to 85.52 Tokens/second with a 524k Token context in a two-stream mode. In single-stream scenarios with a 128k Token context, performance rises to approximately 111 Tokens/second. These figures represent a substantial increase compared to configurations without MTP, with improvements of 62% and 110% respectively, highlighting the value of architectural model optimization.

Technical Details and Deployment Challenges

The core of this performance lies in a combination of advanced techniques. Model Quantization was performed using a hybrid approach: W4A16 INT4 (GPTQ) for the 768 "routed-expert tensors" and FP8_BLOCK for the five attention projections. This strategy helps reduce memory footprint and improve computational efficiency while maintaining high model fidelity. The Deployment relies on a patched version of the vLLM Framework, which is essential for correctly handling DeepSeek-V4-Flash's specific quantizations.

A critical aspect that emerged concerns the specific characteristics of NVIDIA RTX PRO 6000 Max-Q cards. Due to their PCIe-only topology, which differs from server variants with NVLink, it was necessary to disable vLLM's CustomAllreduce feature (--disable-custom-all-reduce). This measure prevents deadlocks that would otherwise occur due to the use of CUDA P2P on non-NVLink interconnects. Further optimizations were applied at the NCCL (NVIDIA Collective Communications Library) level to reduce the TTFT (Time To First Token) from 155 ms to approximately 91 ms, without compromising decode Throughput. The full model, with 671 billion total parameters and 32 billion active, comfortably fits on two GPUs with 96 GB of VRAM each.

Implications for On-Premise AI Workloads

These results have significant implications for CTOs, DevOps leads, and infrastructure architects considering LLM Deployment in on-premise or Air-gapped environments. The ability to run complex models like DeepSeek-V4-Flash on relatively accessible workstation hardware, with competitive performance, offers a viable alternative to cloud-based solutions. In-house management allows for granular control over data security, regulatory compliance, and long-term Total Cost of Ownership (TCO), factors often prioritized by regulated industries or companies with specific data sovereignty needs.

However, the experience shows that optimizing such Deployments is not trivial. It requires deep knowledge of hardware, Quantization techniques, and serving Frameworks. The need to patch vLLM and apply hardware-specific tuning highlights that on-premise Deployments, while offering advantages in control and privacy, also entail greater operational complexity. For those evaluating Self-hosted Deployments, AI-RADAR offers analytical Frameworks on /llm-onpremise to assess the trade-offs between performance, costs, and infrastructure requirements, providing neutral guidance without direct recommendations.

Future Prospects and Current Limitations

Despite significant progress, the configuration presents some limitations. Currently, Tensor Parallelism (TP) is limited to 2 GPUs; attempts with TP=1 cause Out-Of-Memory (OOM) on a single RTX 6000 Pro, while TPโ‰ฅ4 encounters a known bug in vLLM's MoE scale-sharding. Furthermore, the number of speculative Tokens (num_speculative_tokens) is capped at 1, as DeepSeek-V4-Flash integrates a single MTP head. These restrictions indicate areas for future development and optimization at both the model and Framework levels.

The developer community is encouraged to contribute data on MTP draft acceptance rates across different prompt distributions, providing valuable feedback for further performance refinement. This work demonstrates that, with the right engineering, it is possible to unlock the full potential of LLMs on local infrastructures, offering businesses the flexibility and control necessary for their AI strategies.