Qwen 3.5 27B: High Performance with B200 GPUs

The Qwen 3.5 27B language model (dense version, not MoE) has demonstrated remarkable processing capabilities, reaching 1,103,941 tokens per second. This result was achieved using a cluster of 12 nodes, equipped with a total of 96 B200 GPUs, leveraging the vLLM framework.

Key Optimizations

A significant increase in performance, from 9,500 to 95,000 tokens per node, resulted from four main configuration changes:

  • Data Parallelism (DP=8) over Tensor Parallelism (TP=8).
  • Reduction of the context window from 131K to 4K.
  • Implementation of the KV cache in FP8 format.
  • Use of the MTP-1 speculative decoding technique, which had the greatest impact. Without MTP, GPU utilization was close to 0%.

Scalability and Infrastructure

Scaling efficiency remained high, with 97.1% on 8 nodes and 96.5% on 12 nodes. A round-robin approach with ClusterIP was used for load balancing. The Inference Gateway, with KV-cache-aware routing, introduced a 35% overhead, so it was not used.

No custom kernels were used, but standard vLLM v0.18.0. GDN kernel optimizations are coming upstream.