AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

AdaLLM: NVFP4-first inference on RTX 4090 with FP8 cache

Published on 2026-02-15 01:57 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

AdaLLM: Inference NVFP4 su RTX 4090 con cache FP8

AdaLLM is a project focused on enabling high-performance NVFP4 inference on Ada Lovelace GPUs (sm_89 architecture), such as the RTX 4090. The main goal is to provide a pure NVFP4 fast path, with FP8 KV cache and a custom FP8 decode kernel, without falling back to FP16.

Key Features

NVFP4-first runtime for Ada GPUs, tested on RTX 4090, with end-to-end FP8 KV cache.
Custom Triton FP8 decode kernel; prefill uses FlashAttention (varlen).
No FP16 fallback for decoding. If the FP8 kernel fails, the system throws an error instead of silently switching to FP16.
Tensor parallelism (NCCL) + CUDA graphs for decoding (also supports eager mode).

Benchmarks (RTX 4090)

The following tables show AdaLLM benchmarks on an RTX 4090 with Qwen3-8B-NVFP4 and Gemma3-27B-it-NVFP4 models.

Qwen3-8B-NVFP4

batch	total tokens	seconds	tok/s	peak GB
1	128	3.3867	37.79	7.55
2	256	3.5471	72.17	7.55
4	512	3.4392	148.87	7.55
8	1024	3.4459	297.16	7.56
16	2048	4.3636	469.34	7.56

Gemma3-27B-it-NVFP4

batch	total tokens	seconds	tok/s	peak GB
1	128	9.3982	13.62	19.83
2	256	9.5545	26.79	19.83
4	512	9.5344	53.70	19.84

For Qwen3-8B-NVFP4, an approximately 2.4x lower peak VRAM was observed compared to Qwen3-8B FP16 baselines (with an approximate 20-25% throughput loss).

Quickstart

pip install git+https://github.com/BenChaliah/NVFP4-on-4090-vLLM.git
adallm serve nvidia/Qwen3-8B-NVFP4

export NVFP4_FP8=1 is optional and enables the FP8 GEMM path (NVFP4_FP8=0: the difference is in compute precision, not VRAM; the FP8 KV cache and the FP8 decode kernel are still used).

Supported Models

nvidia/Qwen3-8B-NVFP4
BenChaliah/Gemma3-27B-it-NVFP4
Qwen3 MoE variants are supported, but still slow (see README for MoE notes).

Limitations

MoE routing and offload paths are not yet fully optimized.
Only NVFP4 weights, no FP16 fallback for decoding by design.
Targeting Ada Lovelace (sm_89). Requires validation on other Ada cards.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

AI-Radar Takeaway

AdaLLM is a new runtime focused on accelerated NVFP4 inference on Ada Lovelace GPUs, such as the RTX 4090. It implements an FP8 KV cache and a custom FP8 decode kernel, avoiding FP16 fallback. It supports models like Qwen3 and Gemma3, offering performance benchmarks and quickstart installation instructions.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Qwen 3.6 27B: FP8 hits the sweet spot for local inference on Blackwell

Hardware Jul 05

Qwen 3.6 27B: FP8 hits the sweet spot for local inference on Blackwell

Field tests with a single RTX 6000 Pro 96 GB show FP8 quantization strikes the best balance between generation speed and reliability. NVFP4 delivers peak token

NVIDIA Nemotron-3: FP4 pre-training and H1 2026 release

NVIDIA Nemotron-3: FP4 pre-training and H1 2026 release

NVIDIA announced that Nemotron-3 Super and Ultra models are being pre-trained using FP4 precision, leveraging the high FP4 throughput of NVIDIA GPUs. The models

PSA: Qwen 3.5 Requires BF16 KV Cache, NOT F16

PSA: Qwen 3.5 Requires BF16 KV Cache, NOT F16

A warning for those running Qwen 3.5 locally with llama.cpp: the KV cache needs to be manually set to BF16 (bfloat16) instead of the default FP16 (float16). Per

GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada

GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada

The GLM-4.7-Flash model demonstrates remarkable performance in new benchmarks. On a single H200 GPU, it achieves a peak throughput of 4,398 tokens per second. U

Fake RTX 4090: Laser-Etched Silicio Scam Reaches New Heights

Hardware Apr 26

Fake RTX 4090: Laser-Etched Silicio Scam Reaches New Heights

A fraud case has uncovered a counterfeit Nvidia GeForce RTX 4090 GPU, featuring sophisticated laser etching on its VRAM and core to mimic the original. Describe

More in Frameworks

MrFlow: Training-free Diffusion Acceleration up to 25x through Multi-Resolution Pipeline

"Application" architecture makes smaller LLM more effective at complex tasks

GNOME Mutter Finally Learns to Survive GPU Resets

Vulkan 1.4.356 Welcomes Microscaling MX Formats for Machine Learning Inference

The Hidden Complexity of PyTorch Testing: Why On-Premise LLM Deployments Depend on It

Local audio gets serious: audio.cpp delivers music generation and stem separation

→ View all in Frameworks →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in