DeepSeek V4 Pro Locally: The Feasibility of On-Premise Deployment

The ability to run Large Language Models (LLMs) directly on local infrastructure continues to be a critical point of interest for companies prioritizing data sovereignty and control over their entire artificial intelligence pipeline. A recent example showcased how the DeepSeek V4 Pro model, in its Q4_K_M quantized version, was successfully implemented on a high-end workstation, providing a snapshot of current on-premise LLM inference capabilities.

This type of self-hosted deployment offers CTOs and infrastructure architects a concrete alternative to cloud-based solutions, allowing granular control over the execution environment and sensitive data. The specific hardware configuration used and the recorded performance metrics offer valuable insights for those evaluating investment in dedicated computational resources.

Technical Details of the Implementation

The deployment involved an AMD Epyc Genoa 9374F processor-based workstation, equipped with 12 x 96 GB RAM modules, totaling 1152 GB of system memory. The core of the inference was a single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU, which provided 97247 MiB (approximately 97 GB) of VRAM, with a compute capability of 12.0.

For model conversion and execution, a CUDA repository based on antirez's work and modified by LegacyRemaster to support Q4_K_M conversion was utilized. The DeepSeek V4 Pro model, with a reported file size of 859 GB, showed performance metrics of 12.2 tokens per second for prompt processing and 8.6 tokens per second for response generation. These figures offer a tangible reference for the throughput achievable in a local environment with this configuration.

Implications for On-Premise LLM Deployments

Running complex LLMs like DeepSeek V4 Pro on local hardware underscores the increasing maturity of optimization tools and techniques, such as Quantization, which make inference feasible even outside large cloud data centers. For organizations with stringent compliance, security requirements, or for air-gapped environments, on-premise deployment becomes not just an option, but often a strategic necessity.

Choosing a self-hosted infrastructure involves a careful evaluation of the Total Cost of Ownership (TCO), which includes not only the initial hardware cost (CapEx) but also operational expenses for power, cooling, and maintenance. However, this expenditure can be balanced by long-term benefits in terms of data control, reduced latency, and the elimination of recurring costs associated with cloud services.

Outlook and Trade-offs in the AI Landscape

This use case demonstrates that LLM inference capabilities are no longer the exclusive domain of cloud providers. Companies can now build and manage their own AI infrastructures, tailoring them to specific operational and security needs. The availability of GPUs with high amounts of VRAM, such as the RTX PRO 6000 Max-Q, is crucial for hosting large models and managing extended context windows.

The decision between an on-premise deployment and a cloud-based solution remains a matter of trade-offs. While the cloud offers immediate scalability and an OpEx model, local solutions provide greater control, potential for long-term TCO reduction for stable workloads, and full data sovereignty. AI-RADAR continues to explore these scenarios, providing analytical frameworks to help decision-makers evaluate the most suitable options for their AI strategies.