AdaLLM is a project focused on enabling high-performance NVFP4 inference on Ada Lovelace GPUs (sm_89 architecture), such as the RTX 4090. The main goal is to provide a pure NVFP4 fast path, with FP8 KV cache and a custom FP8 decode kernel, without falling back to FP16.
Key Features
- NVFP4-first runtime for Ada GPUs, tested on RTX 4090, with end-to-end FP8 KV cache.
- Custom Triton FP8 decode kernel; prefill uses FlashAttention (varlen).
- No FP16 fallback for decoding. If the FP8 kernel fails, the system throws an error instead of silently switching to FP16.
- Tensor parallelism (NCCL) + CUDA graphs for decoding (also supports eager mode).
Benchmarks (RTX 4090)
The following tables show AdaLLM benchmarks on an RTX 4090 with Qwen3-8B-NVFP4 and Gemma3-27B-it-NVFP4 models.
Qwen3-8B-NVFP4
| batch | total tokens | seconds | tok/s | peak GB |
|---|---|---|---|---|
| 1 | 128 | 3.3867 | 37.79 | 7.55 |
| 2 | 256 | 3.5471 | 72.17 | 7.55 |
| 4 | 512 | 3.4392 | 148.87 | 7.55 |
| 8 | 1024 | 3.4459 | 297.16 | 7.56 |
| 16 | 2048 | 4.3636 | 469.34 | 7.56 |
Gemma3-27B-it-NVFP4
| batch | total tokens | seconds | tok/s | peak GB |
|---|---|---|---|---|
| 1 | 128 | 9.3982 | 13.62 | 19.83 |
| 2 | 256 | 9.5545 | 26.79 | 19.83 |
| 4 | 512 | 9.5344 | 53.70 | 19.84 |
For Qwen3-8B-NVFP4, an approximately 2.4x lower peak VRAM was observed compared to Qwen3-8B FP16 baselines (with an approximate 20-25% throughput loss).
Quickstart
pip install git+https://github.com/BenChaliah/NVFP4-on-4090-vLLM.git
adallm serve nvidia/Qwen3-8B-NVFP4
export NVFP4_FP8=1 is optional and enables the FP8 GEMM path (NVFP4_FP8=0: the difference is in compute precision, not VRAM; the FP8 KV cache and the FP8 decode kernel are still used).
Supported Models
nvidia/Qwen3-8B-NVFP4BenChaliah/Gemma3-27B-it-NVFP4- Qwen3 MoE variants are supported, but still slow (see README for MoE notes).
Limitations
- MoE routing and offload paths are not yet fully optimized.
- Only NVFP4 weights, no FP16 fallback for decoding by design.
- Targeting Ada Lovelace (sm_89). Requires validation on other Ada cards.
For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!