AdaLLM: NVFP4-first inference on RTX 4090 with FP8 cache

AdaLLM is a project focused on enabling high-performance NVFP4 inference on Ada Lovelace GPUs (sm_89 architecture), such as the RTX 4090. The main goal is to provide a pure NVFP4 fast path, with FP8 KV cache and a custom FP8 decode kernel, without falling back to FP16.

Key Features

NVFP4-first runtime for Ada GPUs, tested on RTX 4090, with end-to-end FP8 KV cache.
Custom Triton FP8 decode kernel; prefill uses FlashAttention (varlen).
No FP16 fallback for decoding. If the FP8 kernel fails, the system throws an error instead of silently switching to FP16.
Tensor parallelism (NCCL) + CUDA graphs for decoding (also supports eager mode).

Benchmarks (RTX 4090)

The following tables show AdaLLM benchmarks on an RTX 4090 with Qwen3-8B-NVFP4 and Gemma3-27B-it-NVFP4 models.

Qwen3-8B-NVFP4

batch	total tokens	seconds	tok/s	peak GB
1	128	3.3867	37.79	7.55
2	256	3.5471	72.17	7.55
4	512	3.4392	148.87	7.55
8	1024	3.4459	297.16	7.56
16	2048	4.3636	469.34	7.56

Gemma3-27B-it-NVFP4

batch	total tokens	seconds	tok/s	peak GB
1	128	9.3982	13.62	19.83
2	256	9.5545	26.79	19.83
4	512	9.5344	53.70	19.84

For Qwen3-8B-NVFP4, an approximately 2.4x lower peak VRAM was observed compared to Qwen3-8B FP16 baselines (with an approximate 20-25% throughput loss).

Quickstart

pip install git+https://github.com/BenChaliah/NVFP4-on-4090-vLLM.git
adallm serve nvidia/Qwen3-8B-NVFP4

export NVFP4_FP8=1 is optional and enables the FP8 GEMM path (NVFP4_FP8=0: the difference is in compute precision, not VRAM; the FP8 KV cache and the FP8 decode kernel are still used).

Supported Models

nvidia/Qwen3-8B-NVFP4
BenChaliah/Gemma3-27B-it-NVFP4
Qwen3 MoE variants are supported, but still slow (see README for MoE notes).

Limitations

MoE routing and offload paths are not yet fully optimized.
Only NVFP4 weights, no FP16 fallback for decoding by design.
Targeting Ada Lovelace (sm_89). Requires validation on other Ada cards.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

AdaLLM: NVFP4-first inference on RTX 4090 with FP8 cache

Key Features

Benchmarks (RTX 4090)

Quickstart

Supported Models

Limitations

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

NVIDIA Nemotron-3: FP4 pre-training and H1 2026 release

PSA: Qwen 3.5 Requires BF16 KV Cache, NOT F16

GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada

👥 Join 160+ AI explorers