On-Premise Optimization for LLMs: The DeepSeek V4 Pro Case

The implementation of Large Language Models (LLMs) in on-premise environments continues to represent both a challenge and an opportunity for companies seeking to maintain data control and optimize operational costs. A recent experiment tested the DeepSeek V4 Pro model in a self-hosted setup, demonstrating how software optimization and hardware selection can significantly impact performance.

The test utilized the ktransformers framework, which integrates sglang and kt-kernel, to run the model. This software-hardware combination was configured to maximize efficiency, a crucial aspect for local deployments where every watt and every clock cycle matters. The on-premise approach offers advantages in terms of data sovereignty and latency but requires careful infrastructure planning.

Technical Details and Benchmark Results

The hardware used for the experiment included an AMD Epyc 9374F CPU and an NVIDIA RTX PRO 6000 Max-Q GPU. The latter, with its 97887 MiB of available VRAM, of which approximately 90815 MiB were used during the test, provided the necessary memory capacity to run the DeepSeek V4 Pro model without the need for quantization or conversions of the original model, a factor that can simplify the deployment process.

Performance was measured with the llama-benchy tool, evaluating throughput (tokens per second) at various context depths, from 0 to 32768 tokens. The results showed a throughput for the pp512 test ranging from 39.76 t/s at depth 0 up to 46.18 t/s at depth 32768. For the tg32 test, throughput remained around 7-8 t/s. These data highlight how the model maintains good responsiveness even with extended contexts, a fundamental requirement for complex applications that demand handling long input sequences.

Implications for On-Premise Deployments

Analyzing power consumption provides a more complete picture of the Total Cost of Ownership (TCO) for an on-premise infrastructure. During the tests, the GPU consumed approximately 100W during the pre-processing (PP) phase and about 150W during text generation (TG). The combined CPU and motherboard consumption was around 400W, with RAM usage of approximately 907.5 GB out of 1152 GB available. These figures are essential for evaluating long-term operational costs, including energy and cooling, which are often underestimated when compared to cloud solutions.

The ability to run complex models like DeepSeek V4 Pro on dedicated hardware, without model modifications, underscores the maturity of on-premise LLM software stacks. This approach is particularly relevant for sectors with stringent compliance and data sovereignty requirements, where cloud solutions might not be suitable. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between performance, costs, and control.

Future Prospects for Local AI Infrastructure

The results of this experiment reinforce the idea that local infrastructure can offer competitive performance for LLM workloads, provided there is an investment in appropriate hardware and optimization of the software stack. The choice between on-premise and cloud deployment is never trivial and depends on a careful evaluation of each organization's specific needs, including budget constraints, security policies, and latency requirements.

The continuous evolution of frameworks like ktransformers and the emergence of increasingly efficient hardware for local inference open new possibilities for companies wishing to leverage the power of LLMs while maintaining full control over their infrastructure. These developments are crucial for democratizing access to advanced artificial intelligence and enabling innovations in contexts where data privacy and security are paramount.