Qwen 3.5 and KV Cache: A Crucial Detail
If you are running the Qwen 3.5 35B A3B model locally, for example with engines like llama.cpp, it is essential to configure the KV cache correctly. Contrary to what one might think, Qwen 3.5 requires the KV cache to be set to bf16 (bfloat16) instead of the more common fp16 (float16).
Perplexity Verification
To validate this claim, perplexity (PPL) tests were performed on the wikitext-2-raw dataset. The results clearly show that using an FP16 KV cache produces different results than using BF16.
- Run 1: FP16 KV Cache (default): PPL = 6.5511 +/- 0.04172
- Run 2: FP32 KV Cache: PPL = 6.5511 +/- 0.04172
- Run 3: BFloat16 KV Cache: PPL = 6.5497 +/- 0.04170
Official Implementations vs. llama.cpp
It is important to note that official Qwen team implementations, such as vLLM, use BF16 as the default setting for the KV cache. llama.cpp's choice to use FP16 as the default appears to be an exception.
For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise for a thorough evaluation.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!