Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

Pubblicato il 2026-02-06 03:36 ℹ️ LocalLLaMA 📰 Leggi l'articolo originale →

A user has reported a significant performance increase for the Qwen3-Coder-Next-Q4_K_S model, reaching approximately 26 tokens/second on an RTX 5090 GPU, thanks to the use of llama.cpp.

Configuration Details

The hardware configuration used included:

GPU: RTX 5090
CPU: 9950X3D
RAM: 96GB
Driver: 591.86 / CUDA 13.1
llama.cpp: b7951
Model: Unsloth GGUF Qwen3-Coder-Next-Q4_K_S.gguf

Implemented Optimizations

The performance increase was achieved through the following parameters:

-c 32768 -ngl 999 --flash-attn auto -ctk q8_0 -ctv q8_0 -ot ".ffn_.*_exps.=CPU" -np 1

The main improvements come from offloading the MoE expert tensors (the .ffn_.*_exps) to the CPU, reducing VRAM pressure, and quantizing the KV cache (ctk/ctv q8_0), which is particularly useful with a 32k context. The user warns that offloading MoE tensors may not be advantageous for smaller models, thus requiring a specific evaluation for each configuration.

🤖 Ask AI about this

Vuoi approfondire? Leggi l'articolo completo dalla fonte:

📖 VAI ALLA FONTE ORIGINALE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

🌐

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

💬 Commenti (0)

🔒 Accedi o registrati per commentare gli articoli.

Nessun commento ancora. Sii il primo a commentare!

📚 Approfondimenti

VERTICALE

Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

Configuration Details

Implemented Optimizations

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Llama.cpp: "--fit" accelera Qwen3-Coder-Next su RTX 3090

GLM-4.7-Flash: benchmark da capogiro su H200 e RTX 6000 Ada

Eseguire Claude Code in locale con OpenCode, llama.cpp e GLM-4.7 Flash