Qwen 3.5 27B: High Performance with B200 GPUs
The Qwen 3.5 27B language model (dense version, not MoE) has demonstrated remarkable processing capabilities, reaching 1,103,941 tokens per second. This result was achieved using a cluster of 12 nodes, equipped with a total of 96 B200 GPUs, leveraging the vLLM framework.
Key Optimizations
A significant increase in performance, from 9,500 to 95,000 tokens per node, resulted from four main configuration changes:
- Data Parallelism (DP=8) over Tensor Parallelism (TP=8).
- Reduction of the context window from 131K to 4K.
- Implementation of the KV cache in FP8 format.
- Use of the MTP-1 speculative decoding technique, which had the greatest impact. Without MTP, GPU utilization was close to 0%.
Scalability and Infrastructure
Scaling efficiency remained high, with 97.1% on 8 nodes and 96.5% on 12 nodes. A round-robin approach with ClusterIP was used for load balancing. The Inference Gateway, with KV-cache-aware routing, introduced a 35% overhead, so it was not used.
No custom kernels were used, but standard vLLM v0.18.0. GDN kernel optimizations are coming upstream.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!