Qwen3.5 is now available in a version optimized for inference on NVIDIA hardware, thanks to NVFP4 quantization via Model Optimizer.
Model Details
- Checkpoint: ~224GB
- Active parameters: 17 billion
- License: Apache 2.0
The model is based on a Mixture of Experts (MoE) architecture with 512 experts, 10 of which are active per token. It supports a native context of 262K tokens, extensible to over 1 million, and is multimodal (text, images, video) with support for 201 languages.
Installation
For installation, SGLang is required from a specific branch:
git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0
Launch
Launch example on B200/B300 (TP=4):
python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3
For RTX PRO 6000s, set --tp 8 to avoid out-of-memory (OOM) errors.
Speculative Decoding (Experimental)
Qwen3.5 integrates a Multi-Token Prediction mechanism. It can be useful with few concurrent users:
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Hardware Requirements
| Config | GPUs | VRAM/GPU | Throughput |
|---|---|---|---|
| B300 TP=4 | 4x B300 | 288 GB | ~120 tok/s |
| B200 TP=4 | 4x B200 | 192 GB | โ |
| RTX PRO 6000 TP=8 | 8x RTX PRO 6000 | 96 GB | โ |
The default context is 262K tokens. If you encounter OOM errors, reduce the value, trying to keep it at least at 128K to preserve reasoning quality.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!