AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 LLM AI generated

Qwen3.5 NVFP4: Quantized Inference on NVIDIA Blackwell

Published on 2026-02-17 16:39 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware

Qwen3.5 NVFP4: Inference Quantizzata su NVIDIA Blackwell

Qwen3.5 is now available in a version optimized for inference on NVIDIA hardware, thanks to NVFP4 quantization via Model Optimizer.

Model Details

Checkpoint: ~224GB
Active parameters: 17 billion
License: Apache 2.0

The model is based on a Mixture of Experts (MoE) architecture with 512 experts, 10 of which are active per token. It supports a native context of 262K tokens, extensible to over 1 million, and is multimodal (text, images, video) with support for 201 languages.

Installation

For installation, SGLang is required from a specific branch:

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

Launch

Launch example on B200/B300 (TP=4):

python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3

For RTX PRO 6000s, set --tp 8 to avoid out-of-memory (OOM) errors.

Speculative Decoding (Experimental)

Qwen3.5 integrates a Multi-Token Prediction mechanism. It can be useful with few concurrent users:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

Hardware Requirements

Config	GPUs	VRAM/GPU	Throughput
B300 TP=4	4x B300	288 GB	~120 tok/s
B200 TP=4	4x B200	192 GB	—
RTX PRO 6000 TP=8	8x RTX PRO 6000	96 GB	—

The default context is 262K tokens. If you encounter OOM errors, reduce the value, trying to keep it at least at 128K to preserve reasoning quality.

AI-Radar Takeaway

Qwen3.5 NVFP4 is now available, quantized with NVIDIA's Model Optimizer. The checkpoint weighs approximately 224GB with 17 billion active parameters. It is released under the Apache 2.0 license. It requires SGLang and provides launch examples on B200/B300 and RTX PRO 6000.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

NVIDIA and Qwen: Efficient Inference with NVFP4 Quantization

NVIDIA and Qwen: Efficient Inference with NVFP4 Quantization

NVIDIA has released the Qwen3.6-35B-A3B-NVFP4 model, a quantized version of Alibaba's Qwen3.6-35B-A3B. Leveraging NVFP4 Post Training Quantization, the model re

Qwen3-Coder-Next: NVFP4 Quantization Released (45GB)

Qwen3-Coder-Next: NVFP4 Quantization Released (45GB)

A quantized version of Qwen3-Coder-Next in NVFP4 format is now available, weighing 45GB. The model was calibrated using the ultrachat_200k dataset, with a 1.63%

GPT-OSS 120B: Uncensored Open-Source Model for Local Inference

GPT-OSS 120B: Uncensored Open-Source Model for Local Inference

An uncensored version of GPT-OSS 120B is available, an open-source language model with 117 billion total parameters and a context window of 128K. The model is i

Qwen 0.5B: Local fine-tuning for task automation

Qwen 0.5B: Local fine-tuning for task automation

A developer has fine-tuned the Qwen2-0.5B model to automate tasks via natural language, generating execution plans (CLI commands and hotkeys). Inference occurs

NVIDIA Gemma 4-26B-A4B-NVFP4: Optimization and On-Premise Performance

NVIDIA Gemma 4-26B-A4B-NVFP4: Optimization and On-Premise Performance

NVIDIA has released a 4-bit quantized version of the Gemma 2B model, named Gemma 4-26B-A4B-NVFP4, optimized for inference on local hardware. With a size of 18.8

More in LLM

Step 3.7 Flash with Claude-style prompts beats Hermes on code: a wake-up call for local LLM deployments

Mistral AI: The open source challenge to OpenAI's dominance

Google's TabFM: zero-shot tabular predictions without training

Longcat 2: INT8 and FP8 quantization now available for on-prem deployment

Why AI Needs a Glossary (and What It Has to Do with On-Premise Deployment)

Smartschool and AI for admission tests: why teaching is harder than answering

→ View all in LLM →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in