A user has reported a significant performance increase for the Qwen3-Coder-Next-Q4_K_S model, reaching approximately 26 tokens/second on an RTX 5090 GPU, thanks to the use of llama.cpp.
Configuration Details
The hardware configuration used included:
- GPU: RTX 5090
- CPU: 9950X3D
- RAM: 96GB
- Driver: 591.86 / CUDA 13.1
- llama.cpp: b7951
- Model: Unsloth GGUF Qwen3-Coder-Next-Q4_K_S.gguf
Implemented Optimizations
The performance increase was achieved through the following parameters:
-c 32768 -ngl 999 --flash-attn auto -ctk q8_0 -ctv q8_0 -ot ".ffn_.*_exps.=CPU" -np 1
The main improvements come from offloading the MoE expert tensors (the .ffn_.*_exps) to the CPU, reducing VRAM pressure, and quantizing the KV cache (ctk/ctv q8_0), which is particularly useful with a 32k context. The user warns that offloading MoE tensors may not be advantageous for smaller models, thus requiring a specific evaluation for each configuration.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!