A new GGUF quantization for the Qwen3.5-35B-A3B model has been developed with the aim of optimizing performance on graphics cards equipped with 24GB of VRAM.
Quantization Details
The peculiarity of this GGUF version lies in the exclusive use of q8_0/q4_0/q4_1 quantization types, considered faster with Vulkan/ROCm backends. The size of the quantized model is 19.776 GiB with 4.901 bits per weight (BPW).
Performance and Testing
Initial results indicate good perplexity for the model size, suggesting a potential performance improvement compared to other quantizations, especially with the Vulkan backend. The author invites the community to benchmark with tools like llama-sweep-bench on different hardware configurations, including Strix Halo and 7900XTX. Tests on Mac are also welcome, to evaluate effectiveness with the mlx framework.
Those interested can find the model on Hugging Face, compatible with llama.cpp, ik_llama.cpp and other downstream projects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!