High-Efficiency LLM Inference with AMD MI50

A new hardware configuration based on eight 32GB AMD MI50 GPUs each promises to revolutionize local large language model (LLM) inference, offering an excellent performance-to-cost ratio.

Tests performed with the vllm-gfx906 library show impressive results:

  • MiniMax-M2.1 (AWQ 4bit): 26.8 tok/s output, 3000 tok/s input (with a 30,000 token context) and a maximum context length of 196,608 tokens.
  • GLM 4.7 (AWQ 4bit): 15.6 tok/s output, 3000 tok/s input (with a 30,000 token context) and a context length of 95,000 tokens.

The estimated cost for the GPUs is $880 (prices expected for early 2025), while the power draw is 280W idle and 1200W during inference.

The project's goal is to provide a cost-effective solution for local inference, leveraging the computing power of AMD GPUs and the efficiency of the vllm-gfx906 library. Full setup details are available on GitHub.

The Landscape of LLM Inference

Large language model inference is a rapidly evolving field, with a growing demand for efficient and accessible solutions. GPUs are one of the most popular options for accelerating this process, and software optimization, as demonstrated by the use of vllm-gfx906, plays a crucial role in maximizing performance.