Qwen 3.5 27B: 1.1M tok/s on B200s, configurations on GitHub

Qwen 3.5 27B: High Performance with B200 GPUs

The Qwen 3.5 27B language model (dense version, not MoE) has demonstrated remarkable processing capabilities, reaching 1,103,941 tokens per second. This result was achieved using a cluster of 12 nodes, equipped with a total of 96 B200 GPUs, leveraging the vLLM framework.

Key Optimizations

A significant increase in performance, from 9,500 to 95,000 tokens per node, resulted from four main configuration changes:

Data Parallelism (DP=8) over Tensor Parallelism (TP=8).
Reduction of the context window from 131K to 4K.
Implementation of the KV cache in FP8 format.
Use of the MTP-1 speculative decoding technique, which had the greatest impact. Without MTP, GPU utilization was close to 0%.

Scalability and Infrastructure

Scaling efficiency remained high, with 97.1% on 8 nodes and 96.5% on 12 nodes. A round-robin approach with ClusterIP was used for load balancing. The Inference Gateway, with KV-cache-aware routing, introduced a 35% overhead, so it was not used.

No custom kernels were used, but standard vLLM v0.18.0. GDN kernel optimizations are coming upstream.

Qwen 3.5 27B: 1.1M tok/s on B200s, configurations on GitHub

Qwen 3.5 27B: High Performance with B200 GPUs

Key Optimizations

Scalability and Infrastructure

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Qwen3.5-0.8B: LLM inference on legacy hardware without GPUs

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

Top-K: Optimized Algorithm Up to 20x Faster Than PyTorch

👥 Join 160+ AI explorers