MiniMax M3 on-premise: 19 tokens/s on 8 MI50s, but agents remain out of reach

A Reddit user put the MiniMax M3 LLM through its paces on a rack of AMD MI50 GPUs, 2018 datacenter cards each with 16 GB of VRAM. The results show that with aggressive 4-bit quantization and the right tuning, you can extract 19.2 tokens per second of generation on 8 GPUs, but the overall experience is far from what’s needed for agentic code development. AI-RADAR reads between the lines of the benchmark and draws lessons for anyone evaluating on-premise LLM deployments on previous-generation datacenter hardware.

The test: 8 vs. 16 MI50s running MiniMax M3

The setup uses a vLLM fork with ROCm 7.2.1 optimized for the gfx906 architecture of the MI50s, all running inside a Docker container. Two main configurations were tested:

8 MI50 GPUs with the MiniMax-M3-4bit-W4A16-v0 quantized model and EAGLE3 speculative decoding at 3 tokens. With a 1,000-token prompt, it reaches 19.2 tok/s in generation and 1,005 tok/s in prompt processing (PP). Without multi-token prediction, generation drops to 11.9 tok/s. The maximum observed context is 36,597 tokens.
16 MI50 GPUs using AWQ-INT4 quantization (MiniMax-M3-AWQ-INT4) and 5-token speculation. Here generation hits 18.2 tok/s on a 16,000-token prompt, but PP plummets to 135 tok/s, revealing a severe pre-processing bottleneck. The maximum context window expands to about 143,000 tokens.

The synthetic benchmark with 2 requests (10,000 input tokens, 1,000 output) yields the most telling figure: the average time to first token (TTFT) is 73.6 seconds, with an overall throughput of just 7.15 tok/s. Speculative decoding acceptance rate is 50%, confirming the draft model is well aligned but cannot offset the intrinsic latency.

Latency, context and speculative decoding: the bottlenecks

A TTFT of over a minute on a sub-16K prompt kills any interactive application. The reason lies in the hardware: the MI50s, despite HBM2 memory and 1 TB/s bandwidth, are hampered by a PCIe 3.0 bus and a compute architecture (gfx906) that no longer receives aggressive optimizations in modern inference engines. The user notes that a lower-latency PCIe switch and more optimized multi-token prediction without overhead for ROCm could improve performance, alongside FP16 dequantization. Equally striking is the extremely long reasoning output: many responses produce lengthy chains of thought before even attempting a code solution, stretching perceived wait times further.

Old hardware for on-premise LLMs: why it might make sense

For an organization that already owns these GPUs or can pick them up cheaply on the second-hand market, an on-premise deployment preserves data sovereignty and avoids recurring cloud inference costs. A single MI50 has a 300 W TDP; a 16-card node draws almost 5 kW for GPUs alone – a non-trivial operating expense but often lower than continuous cloud API usage for heavy workloads. The trade-off is clear: last-generation datacenter hardware can handle 4-bit quantization and models in the tens-of-billions parameter range, but with latency and throughput poorly suited to complex interactive tasks like agentic coding. For batch processing, summarization, or document analysis where response time is not critical, the observed performance could be adequate.

Beyond the test: room for optimization

The MiniMax M3 on MI50s case shows there is headroom left: more mature ROCm software for gfx906, an optimized PCIe topology, and refined dequantization could squeeze out a few more tokens per second. Yet the gap compared with solutions based on newer GPUs (e.g., NVIDIA A100 or AMD MI200) remains wide, especially in the ability to process long prompts at acceptable speed. For those planning on-premise inference infrastructure, this test is a reminder that a few-year-old datacenter cards can still serve a purpose, but only after a careful assessment of workloads and trade-offs, particularly when context window size and responsiveness are decisive factors.