AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

ik_llama.cpp: 26x Faster Prompt Processing on Qwen 3.5

Published on 2026-03-22 03:22 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise

ik_llama.cpp: inference Qwen 3.5 fino a 26x più veloce

Optimizing inference for large language models (LLMs) is a crucial area of research and development.

Performance Increase with ik_llama.cpp

A user reported a significant performance increase using the ik_llama.cpp fork of llama.cpp for Qwen 3.5 27B model inference. On a Lenovo ThinkStation P520 workstation with an 18-core Xeon W-2295 processor, 128GB of DDR4 ECC RAM, and an NVIDIA RTX PRO 4000 Blackwell (24GB GDDR7), the results are as follows:

Prompt evaluation: from ~43 tok/sec to 1,122 tok/sec (26x faster)
Generation: from ~7.5 tok/sec to 26 tok/sec (3.5x faster)

The difference is attributed to the implementation of fused GDN (Gated Delta Network) kernels in ik_llama.cpp, which handle the entire computation on the CUDA GPU, reducing graph splits from 34 to 2. This minimizes CPU involvement during inference.

Full Prompt Re-Processing Bug

The recurrent architecture of Qwen 3.5 still forces full prompt re-processing on every turn when the prompt changes. However, with a speed of 1,122 tok/sec, this issue becomes more tolerable.

Where to Download

Pre-built Windows CUDA 12.8 binaries with AVX512 VNNI are available from the Thireus fork: https://github.com/Thireus/ik_llama.cpp/releases.

It's a drop-in replacement for your existing llama-server folder, with the same command line arguments and the same OpenAI-compatible API on port 1234.

For systems with AVX512 VNNI, download: ik_llama-main-b4370-4d7223c-bin-win-cuda-12.8-x64-avx512_vnni.zip

Those using Qwen 3.5 on mainline llama.cpp may experience slowness. The fused GDN kernels in ik_llama.cpp are not yet present in the main version.

AI-Radar Takeaway

A fork of llama.cpp, named ik_llama.cpp, promises a significant acceleration in prompt processing for the Qwen 3.5 27B model. Tests on specific hardware show notable increases in evaluation and generation speed, thanks to the implementation of fused GDN kernels that reduce the CPU load.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Qwen3.6 27B on V100s: 1000 Tokens/Second in On-Premise Inference Scenarios

Qwen3.6 27B on V100s: 1000 Tokens/Second in On-Premise Inference Scenarios

A recent Reddit test showcased the ability to generate 1000 tokens per second with the Qwen3.6 27B model on an NVIDIA V100 GPU setup, handling 128 concurrent re

TokenSpeed and Qwen3.5-397B-A17B: A New 580 tps Record for On-Premise LLMs

Frameworks May 27

TokenSpeed and Qwen3.5-397B-A17B: A New 580 tps Record for On-Premise LLMs

The open-source TokenSpeed inference engine has set a new record of 580 tps with the Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs. This exceptional performa

MLX: Multi-Token Inference for Qwen-3.5 Boosts Output

Frameworks Mar 21

MLX: Multi-Token Inference for Qwen-3.5 Boosts Output

The mlx-lm framework introduces multi-token prediction (MTP) for Qwen-3.5 models, significantly increasing generation speed. Early benchmarks on an M4 Pro show

Qwen 3.5 35B: Local Inference on 8GB VRAM

Qwen 3.5 35B: Local Inference on 8GB VRAM

A user shared their experience using the Qwen 3.5 35B model on a GPU with only 8GB of VRAM for local agentic workloads. The setup includes an Intel i9-14900HX p

Intel Arc Pro B70: llama.cpp Benchmarks for Local Inference

Hardware Jun 02

Intel Arc Pro B70: llama.cpp Benchmarks for Local Inference

New benchmarks reveal the capabilities of the Intel Arc Pro B70 GPU in Large Language Model (LLM) inference within local environments. Using `llama.cpp` and the

More in Frameworks

GNOME’s AI Assistant Now Generates Images: Newelle 1.4.5 Arrives

Llama.cpp cuts CUDA synchronizations, boosting on-premise inference performance

DeepSeek V4 Flash and MiniMax M3 on llama.cpp: When will native support arrive?

llama.cpp: Vulkan Tensor Parallelism Now Within Reach

A software veteran builds a local LLM harness and asks the community: what do you need?

Patronus AI secures $50M to crash-test AI agents

→ View all in Frameworks →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in