Qwen3.5 NVFP4: Quantized Inference on NVIDIA Blackwell

Qwen3.5 is now available in a version optimized for inference on NVIDIA hardware, thanks to NVFP4 quantization via Model Optimizer.

Model Details

Checkpoint: ~224GB
Active parameters: 17 billion
License: Apache 2.0

The model is based on a Mixture of Experts (MoE) architecture with 512 experts, 10 of which are active per token. It supports a native context of 262K tokens, extensible to over 1 million, and is multimodal (text, images, video) with support for 201 languages.

Installation

For installation, SGLang is required from a specific branch:

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

Launch

Launch example on B200/B300 (TP=4):

python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3

For RTX PRO 6000s, set --tp 8 to avoid out-of-memory (OOM) errors.

Speculative Decoding (Experimental)

Qwen3.5 integrates a Multi-Token Prediction mechanism. It can be useful with few concurrent users:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

Hardware Requirements

Config	GPUs	VRAM/GPU	Throughput
B300 TP=4	4x B300	288 GB	~120 tok/s
B200 TP=4	4x B200	192 GB	—
RTX PRO 6000 TP=8	8x RTX PRO 6000	96 GB	—

The default context is 262K tokens. If you encounter OOM errors, reduce the value, trying to keep it at least at 128K to preserve reasoning quality.

Qwen3.5 NVFP4: Quantized Inference on NVIDIA Blackwell

Model Details

Installation

Launch

Speculative Decoding (Experimental)

Hardware Requirements

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Qwen3-Coder-Next: NVFP4 Quantization Released (45GB)

GPT-OSS 120B: Uncensored Open-Source Model for Local Inference

Qwen3.5-27b: Comparative Analysis of 8-bit vs. 16-bit Quantization

👥 Join 160+ AI explorers