Qwen3.5 is now available in a version optimized for inference on NVIDIA hardware, thanks to NVFP4 quantization via Model Optimizer.

Model Details

  • Checkpoint: ~224GB
  • Active parameters: 17 billion
  • License: Apache 2.0

The model is based on a Mixture of Experts (MoE) architecture with 512 experts, 10 of which are active per token. It supports a native context of 262K tokens, extensible to over 1 million, and is multimodal (text, images, video) with support for 201 languages.

Installation

For installation, SGLang is required from a specific branch:

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

Launch

Launch example on B200/B300 (TP=4):

python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3

For RTX PRO 6000s, set --tp 8 to avoid out-of-memory (OOM) errors.

Speculative Decoding (Experimental)

Qwen3.5 integrates a Multi-Token Prediction mechanism. It can be useful with few concurrent users:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

Hardware Requirements

Config GPUs VRAM/GPU Throughput
B300 TP=4 4x B300 288 GB ~120 tok/s
B200 TP=4 4x B200 192 GB โ€”
RTX PRO 6000 TP=8 8x RTX PRO 6000 96 GB โ€”

The default context is 262K tokens. If you encounter OOM errors, reduce the value, trying to keep it at least at 128K to preserve reasoning quality.