Optimizing inference for large language models (LLMs) is a crucial area of research and development.

Performance Increase with ik_llama.cpp

A user reported a significant performance increase using the ik_llama.cpp fork of llama.cpp for Qwen 3.5 27B model inference. On a Lenovo ThinkStation P520 workstation with an 18-core Xeon W-2295 processor, 128GB of DDR4 ECC RAM, and an NVIDIA RTX PRO 4000 Blackwell (24GB GDDR7), the results are as follows:

  • Prompt evaluation: from ~43 tok/sec to 1,122 tok/sec (26x faster)
  • Generation: from ~7.5 tok/sec to 26 tok/sec (3.5x faster)

The difference is attributed to the implementation of fused GDN (Gated Delta Network) kernels in ik_llama.cpp, which handle the entire computation on the CUDA GPU, reducing graph splits from 34 to 2. This minimizes CPU involvement during inference.

Full Prompt Re-Processing Bug

The recurrent architecture of Qwen 3.5 still forces full prompt re-processing on every turn when the prompt changes. However, with a speed of 1,122 tok/sec, this issue becomes more tolerable.

Where to Download

Pre-built Windows CUDA 12.8 binaries with AVX512 VNNI are available from the Thireus fork: https://github.com/Thireus/ik_llama.cpp/releases.

It's a drop-in replacement for your existing llama-server folder, with the same command line arguments and the same OpenAI-compatible API on port 1234.

For systems with AVX512 VNNI, download: ik_llama-main-b4370-4d7223c-bin-win-cuda-12.8-x64-avx512_vnni.zip

Those using Qwen 3.5 on mainline llama.cpp may experience slowness. The fused GDN kernels in ik_llama.cpp are not yet present in the main version.