AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

PSA: Qwen 3.5 Requires BF16 KV Cache, NOT F16

Published on 2026-03-02 11:00 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

Qwen 3.5: Attenzione alla Cache KV! Richiede BF16, non FP16

Qwen 3.5 and KV Cache: A Crucial Detail

If you are running the Qwen 3.5 35B A3B model locally, for example with engines like llama.cpp, it is essential to configure the KV cache correctly. Contrary to what one might think, Qwen 3.5 requires the KV cache to be set to bf16 (bfloat16) instead of the more common fp16 (float16).

Perplexity Verification

To validate this claim, perplexity (PPL) tests were performed on the wikitext-2-raw dataset. The results clearly show that using an FP16 KV cache produces different results than using BF16.

Run 1: FP16 KV Cache (default): PPL = 6.5511 +/- 0.04172
Run 2: FP32 KV Cache: PPL = 6.5511 +/- 0.04172
Run 3: BFloat16 KV Cache: PPL = 6.5497 +/- 0.04170

Official Implementations vs. llama.cpp

It is important to note that official Qwen team implementations, such as vLLM, use BF16 as the default setting for the KV cache. llama.cpp's choice to use FP16 as the default appears to be an exception.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise for a thorough evaluation.

AI-Radar Takeaway

A warning for those running Qwen 3.5 locally with llama.cpp: the KV cache needs to be manually set to BF16 (bfloat16) instead of the default FP16 (float16). Perplexity tests on wikitext-2-raw confirm that official Qwen-team implementations, like vLLM, use BF16, while llama.cpp defaults to F16.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

RateQuant: Optimizing LLM KV Cache with Mixed-Precision Quantization

RateQuant: Optimizing LLM KV Cache with Mixed-Precision Quantization

Memory management is a critical challenge for Large Language Models (LLMs), especially due to the KV cache growing linearly with sequence length. RateQuant prop

Qwen3-32B: INT4 Quantization Achieves 12x Capacity Gain

Qwen3-32B: INT4 Quantization Achieves 12x Capacity Gain

A benchmark of Qwen3-32B reveals that INT4 quantization, compared to BF16, allows serving 12 times more concurrent users with only a 1.9% accuracy drop. The tes

Qwen3.5-27b: Comparative Analysis of 8-bit vs. 16-bit Quantization

Qwen3.5-27b: Comparative Analysis of 8-bit vs. 16-bit Quantization

A recent study compared the performance of the Qwen3.5-27b model with different weight configurations (bf16, fp8) and KV cache (bf16, fp8) using the Aider bench

Linux 7.0: Per-CPU Caching with Sheaves for Improved Performance

Linux 7.0: Per-CPU Caching with Sheaves for Improved Performance

Linux 7.0 is preparing to introduce significant improvements in per-CPU cache management, thanks to the integration of 'sheaves'. This technology, already prese

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

A new GGUF quantization for the Qwen3.5-35B-A3B model promises improved performance on GPUs with 24GB of VRAM. The optimization focuses on using q8_0/q4_0/q4_1

More in LLM

Even Google believes in small coding models

SpectralQuant narrows the Q4_K_M quantization gap to 96.5%: a leap for local models

Two new AI tools from Tokyo and Beijing fill the gap left by Anthropic's export ban

ConlangCrafter: The AI That Invents Imaginary Languages (and Could Teach Us How We Think)

Orthrus brings diffusion head to Qwen 3.5/3.6 and Gemma 4: open-source code dropping soon

Qwen Fine-tunes: Why Optimized Models Struggle to Impress

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in