Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

Published on 2026-02-26 08:58 ℹ️ LocalLLaMA 📰 Read the original source article →

Qwen3.5-35B-A3B: GGUF ottimizzato per GPU da 24GB

A new GGUF quantization for the Qwen3.5-35B-A3B model has been developed with the aim of optimizing performance on graphics cards equipped with 24GB of VRAM.

Quantization Details

The peculiarity of this GGUF version lies in the exclusive use of q8_0/q4_0/q4_1 quantization types, considered faster with Vulkan/ROCm backends. The size of the quantized model is 19.776 GiB with 4.901 bits per weight (BPW).

Performance and Testing

Initial results indicate good perplexity for the model size, suggesting a potential performance improvement compared to other quantizations, especially with the Vulkan backend. The author invites the community to benchmark with tools like llama-sweep-bench on different hardware configurations, including Strix Halo and 7900XTX. Tests on Mac are also welcome, to evaluate effectiveness with the mlx framework.

Those interested can find the model on Hugging Face, compatible with llama.cpp, ik_llama.cpp and other downstream projects.

AI-Radar Takeaway

A new GGUF quantization for the Qwen3.5-35B-A3B model promises improved performance on GPUs with 24GB of VRAM. The optimization focuses on using q8_0/q4_0/q4_1 quantization types and aims for increased speed, especially with Vulkan/ROCm backends. The community is invited to test the deliveries on different hardware architectures.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

⚡

RunPod GPU Cloud Platform

Flexible GPU cloud with pay-per-second billing. Deploy instantly with Docker support, auto-scaling, and a wide selection of GPU types from RTX 4090 to H100.

✓ No commitments ✓ Instant deployment ✓ Production-ready

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

SECTION

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

→

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

Quantization Details

Performance and Testing

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Qwen3.5-27b: Comparative Analysis of 8-bit vs. 16-bit Quantization

Qwen3-32B: INT4 Quantization Achieves 12x Capacity Gain

Unsloth: 1.8-3.3x faster Embedding finetuning

👥 Join 160+ AI explorers