A team conducted benchmarks on a workstation equipped with two NVIDIA RTX PRO 6000 Max-Q cards (192GB VRAM total) and 1.15TB of DDR5 RAM, aiming to evaluate performance in multi-user scenarios and with large contexts.

Hardware Configuration

  • 2x NVIDIA RTX PRO 6000 Max-Q (192GB VRAM total)
  • AMD EPYC 9645 96-core/192-thread
  • 12x DDR5 ECC RDIMM 96GB 5600 Mt/s (1152GB total)

Models Tested

  • MiniMax-M2.1 (Native FP8)
  • MiniMax-M2.1-BF16-INT4-AWQ (INT4 quantized)

The analysis compared two approaches: FP8 precision with CPU offloading and INT4 quantized weights running entirely on VRAM. The goal was to determine if a single workstation could serve a team of 4-50 people or manage multiple agents simultaneously.

Results

Results indicate that the INT4 quantized model, running exclusively on GPU, offers superior performance compared to the FP8 model that utilizes both CPU and GPU. However, INT4 is limited to approximately 3 concurrent requests due to KV-cache constraints. FP8, on the other hand, scales better with large contexts, although it remains slower in end-to-end processing.