Qwen3.5 is now available in a version optimized for inference on NVIDIA hardware, thanks to NVFP4 quantization via Model Optimizer.

Model Details

  • Checkpoint: ~224GB
  • Active parameters: 17 billion
  • License: Apache 2.0

The model is based on a Mixture of Experts (MoE) architecture with 512 experts, 10 of which are active per token. It supports a native context of 262K tokens, extensible to over 1 million, and is multimodal (text, images, video) with support for 201 languages.

Installation

For installation, SGLang is required from a specific branch:

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

Launch

Launch example on B200/B300 (TP=4):

python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3

For RTX PRO 6000s, set --tp 8 to avoid out-of-memory (OOM) errors.

Speculative Decoding (Experimental)

Qwen3.5 integrates a Multi-Token Prediction mechanism. It can be useful with few concurrent users:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

Hardware Requirements

Config GPUs VRAM/GPU Throughput
B300 TP=4 4x B300 288 GB ~120 tok/s
B200 TP=4 4x B200 192 GB
RTX PRO 6000 TP=8 8x RTX PRO 6000 96 GB

The default context is 262K tokens. If you encounter OOM errors, reduce the value, trying to keep it at least at 128K to preserve reasoning quality.