Qwen 3.5 35B MoE: Performance on RTX 5060 Ti
A user reported impressive performance results for the Qwen 3.5 35B MoE language model, running on an NVIDIA GeForce RTX 5060 Ti graphics card with 16GB of VRAM. The test used a context length of 100,000 tokens.
Configuration Details
- Model: Qwen 3.5 35B MoE
- GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
- CPU: AMD Ryzen 7 9700X
- Backend: CUDA and Vulkan
- Context Length: 100,000 tokens
Results
The tests showed a generation speed of approximately 40 tokens per second (tps) with both CUDA and Vulkan backends. Specifically, CUDA achieved a speed of 44.32 tps, while Vulkan reached 41.35 tps. During prompt processing (fill) with a 99961 token text, the speed reached 1154.31 tps.
llama.cpp command used
llama-server.exe -m "/Qwen3.5-35B-A3B-MXFP4_MOE.gguf" --port 6789 --ctx-size 131072 -n 32768 --flash-attn on -ngl 40 --n-cpu-moe 24 -b 2048 -ub 2048 -t 8 --kv-offload --cont-batching --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0
These results suggest that large language model inference is becoming increasingly accessible on consumer hardware. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks at /llm-onpremise for evaluation.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!