Qwen 3.5 35B MoE: Performance on RTX 5060 Ti

A user reported impressive performance results for the Qwen 3.5 35B MoE language model, running on an NVIDIA GeForce RTX 5060 Ti graphics card with 16GB of VRAM. The test used a context length of 100,000 tokens.

Configuration Details

  • Model: Qwen 3.5 35B MoE
  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
  • CPU: AMD Ryzen 7 9700X
  • Backend: CUDA and Vulkan
  • Context Length: 100,000 tokens

Results

The tests showed a generation speed of approximately 40 tokens per second (tps) with both CUDA and Vulkan backends. Specifically, CUDA achieved a speed of 44.32 tps, while Vulkan reached 41.35 tps. During prompt processing (fill) with a 99961 token text, the speed reached 1154.31 tps.

llama.cpp command used

llama-server.exe -m "/Qwen3.5-35B-A3B-MXFP4_MOE.gguf" --port 6789 --ctx-size 131072 -n 32768 --flash-attn on -ngl 40 --n-cpu-moe 24 -b 2048 -ub 2048 -t 8 --kv-offload --cont-batching --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0

These results suggest that large language model inference is becoming increasingly accessible on consumer hardware. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks at /llm-onpremise for evaluation.