A user has reported a significant performance increase for the Qwen3-Coder-Next-Q4_K_S model, reaching approximately 26 tokens/second on an RTX 5090 GPU, thanks to the use of llama.cpp.

Configuration Details

The hardware configuration used included:

  • GPU: RTX 5090
  • CPU: 9950X3D
  • RAM: 96GB
  • Driver: 591.86 / CUDA 13.1
  • llama.cpp: b7951
  • Model: Unsloth GGUF Qwen3-Coder-Next-Q4_K_S.gguf

Implemented Optimizations

The performance increase was achieved through the following parameters:

-c 32768 -ngl 999 --flash-attn auto -ctk q8_0 -ctv q8_0 -ot ".ffn_.*_exps.=CPU" -np 1

The main improvements come from offloading the MoE expert tensors (the .ffn_.*_exps) to the CPU, reducing VRAM pressure, and quantizing the KV cache (ctk/ctv q8_0), which is particularly useful with a 32k context. The user warns that offloading MoE tensors may not be advantageous for smaller models, thus requiring a specific evaluation for each configuration.