Unsloth has announced a final update for its Qwen3.5 models in GGUF (GPT-Generated Unified Format), aiming to further improve the ratio between model size and Kullback-Leibler divergence (KLD), an indicator of how much the probability distribution of the quantized model deviates from that of the original model.

Key Updates

  • Calibration Dataset: All GGUFs now use a new imatrix calibration dataset, which should lead to small improvements in performance in chat, coding, long context handling, and tool-calling scenarios.
  • KLD Divergence Reduction: The quantization method for Qwen3.5 Mixture of Experts (MoE) models has been further refined to directly reduce the maximum KLD divergence. In particular, the UD-Q4_K_XL variant is 8% larger, but reduces the maximum KLD divergence by 51% compared to the version before March 5th.
  • Model Updates: The Qwen3.5-35B-A3B, 27B, and 122B-A10B models have been updated and made available for re-download. The 397B-A17B model will be updated shortly.
  • Inference: BF16 (BFloat16) layers have been replaced with F16 (Float16) to speed up inference on unsupported devices.

Quantization and Performance

The following table summarizes the size and KLD divergence variations for different quantization configurations:

Quant Old GB New GB Max KLD Old Max KLD New
UD-Q2_K_XL 12.0 11.3 (-6%) 8.237 8.155 (-1%)
UD-Q3_K_XL 16.1 15.5 (-4%) 5.505 5.146 (-6.5%)
UD-Q4_K_XL 19.2 20.7 (+8%) 5.894 2.877 (-51%)
UD-Q5_K_XL 23.2 24.6 (+6%) 5.536 3.210 (-42%)

These updates aim to make the Qwen3.5 models more efficient and performant, especially in local usage contexts. For those evaluating on-premise deployments, there are trade-offs to consider carefully; AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.