## High-Efficiency LLM Inference with AMD MI50 A new hardware configuration based on eight 32GB AMD MI50 GPUs each promises to revolutionize local large language model (LLM) inference, offering an excellent performance-to-cost ratio. Tests performed with the vllm-gfx906 library show impressive results: * **MiniMax-M2.1** (AWQ 4bit): 26.8 tok/s output, 3000 tok/s input (with a 30,000 token context) and a maximum context length of 196,608 tokens. * **GLM 4.7** (AWQ 4bit): 15.6 tok/s output, 3000 tok/s input (with a 30,000 token context) and a context length of 95,000 tokens. The estimated cost for the GPUs is $880 (prices expected for early 2025), while the power draw is 280W idle and 1200W during inference. The project's goal is to provide a cost-effective solution for local inference, leveraging the computing power of AMD GPUs and the efficiency of the vllm-gfx906 library. Full setup details are available on GitHub. ## The Landscape of LLM Inference Large language model inference is a rapidly evolving field, with a growing demand for efficient and accessible solutions. GPUs are one of the most popular options for accelerating this process, and software optimization, as demonstrated by the use of vllm-gfx906, plays a crucial role in maximizing performance.