Qwen3.5 122B: Optimization on Consumer Hardware
A user tested the Qwen3.5 122B A10B model, quantized with Unsloth, on a hardware configuration consisting of an RTX 4090, an RTX 3090, an Intel i7 13700k, and 128 GB of DDR5 RAM at 5600 MHz. The goal was to achieve stable performance, overcoming difficulties encountered with previous quantizations.
Configuration and Performance
The user found that manual tensor fitting via specific parameters (--n-cpu-moe 33 -ts 4,1 -c 32000) offers superior performance compared to using the --fit flag. Specifically, the prompt processing speed increased from 30.8 tokens/s to 143.4 tokens/s, while generation improved from 9.1 tokens/s to 18.6 tokens/s. This approximately 50% increase results in less performance degradation with larger contexts.
BF16 Cache and Repeat Penalty
Using the BF16 cache (-cache-type-k bf16 --cache-type-v bf16) improved the reasoning quality of the model, avoiding logical loops encountered with the default FP16 configuration. Furthermore, applying a "repeat penalty" (--presence-penalty 1.1 --repeat-penalty 1.05 --repeat-last-n 512) proved necessary to prevent the repetition of patterns in the generated text, a behavior not observed in other models tested by the user.
Final Impressions
Despite the improvements obtained through optimization, the user considers Qwen3.5 122B A10B still too slow for effective agentic use, preferring alternative models such as Minimax M2.5 IQ4_NL for reasoning capabilities and speed. The user speculates that llama.cpp may not be fully optimized for this specific model.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!