NVIDIA RTX PRO 6000 Blackwell: MiniMax-M2.7 NVFP4 Benchmarks on Dual-GPU Setup

LLM Performance On-Premise: An Analysis with NVIDIA RTX PRO 6000 Blackwell

Evaluating the performance of Large Language Models (LLMs) in on-premise environments is a strategic priority for many companies, especially those needing to maintain full control over data and operational costs. In this context, a recent benchmark published by the community has scrutinized the capabilities of a specific hardware configuration, providing valuable data for those planning AI infrastructure.

The analysis focused on the MiniMax-M2.7 model, in its NVFP4 quantized version, running on a system equipped with two NVIDIA RTX PRO 6000 Blackwell graphics cards. These tests offer a concrete perspective on the trade-offs between throughput, latency, and context management—fundamental elements for informed deployment decisions. For technical decision-makers, understanding these dynamics is essential for optimizing Total Cost of Ownership (TCO) and ensuring data sovereignty.

Configuration Details and Software Stack

The system used for the benchmark was assembled with specific components, designed for intensive LLM inference workloads. At the core of the configuration is an AsRock Rack B650D4U-2L2T motherboard, paired with an EPYC 4564P CPU and 128GB of DDR5 ECC RAM, a choice that emphasizes the stability and reliability typical of server environments.

The main components are the two NVIDIA RTX PRO 6000 Blackwell GPUs, each featuring 96GB of VRAM and a 600W power consumption. These cards are interconnected via a C-Payne PM50100 PLX Gen5 switch with PIX topology, a detail that highlights the importance of high-speed interconnection for maximizing performance in multi-GPU setups. On the software front, the benchmark utilized SGLang, run via a Docker container (voipmonitor/sglang:cu130, b12x 0.8.3), with modelopt_fp4 for quantization management, bf16 KV for key-value state, and TP=2 (Tensor Parallelism) to distribute the load across the two GPUs.

Performance Analysis: Decode and Prefill

The benchmark results were divided into two main areas: decode throughput and prefill performance. For decode, with a zero initial context (ctx=0), the system recorded an aggregate throughput of 127.7 tokens per second for a single concurrent request (C=1). Increasing concurrency to 128 requests (C=128), the aggregate throughput reached a peak of 2800.2 tokens per second, although the per-request throughput settled at 21.9 tokens per second. These figures demonstrate the platform's ability to scale the processing of multiple simultaneous requests, albeit with a decrease in speed per individual request.

Prefill performance, measured with a single request (C=1), reveals the "Time To First Token" (TTFT) and prefill throughput for various context lengths. For an 8K token context, the TTFT was 0.50 seconds with a throughput of 17,286 tokens per second. As the context length increased to 128K tokens, the TTFT rose to 13.25 seconds, and throughput dropped to 9,908 tokens per second. It is important to note that the benchmark did not use speculative decoding, as a NEXTN drafter for the M2.7 model is not yet available, suggesting room for future improvements. Furthermore, the KV pool is limited to approximately 83K tokens with the bf16 KV and TP=2 configuration, leading to context "skipping" with long contexts and high concurrency, making 128K token contexts manageable only with a single request.

Implications for On-Premise Deployments

These benchmarks provide concrete and measurable data, essential for IT professionals who must make strategic decisions regarding LLM deployments. The ability to achieve high throughput with high-end hardware, such as the RTX PRO 6000 Blackwell, in a self-hosted environment strengthens the case for on-premise solutions for specific workloads. This is particularly true for organizations that prioritize data sovereignty and regulatory compliance, or that operate in air-gapped environments.

TCO analysis, which includes initial hardware investment and operational costs (power, cooling), becomes more precise thanks to benchmarks like this. While performance is promising, the highlighted limitations, such as those related to the KV pool and the absence of speculative decoding, underscore the importance of careful planning and software optimization. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between different architectures and configurations, helping to balance performance, costs, and control requirements.