AdaLLM is a project focused on enabling high-performance NVFP4 inference on Ada Lovelace GPUs (sm_89 architecture), such as the RTX 4090. The main goal is to provide a pure NVFP4 fast path, with FP8 KV cache and a custom FP8 decode kernel, without falling back to FP16.

Key Features

  • NVFP4-first runtime for Ada GPUs, tested on RTX 4090, with end-to-end FP8 KV cache.
  • Custom Triton FP8 decode kernel; prefill uses FlashAttention (varlen).
  • No FP16 fallback for decoding. If the FP8 kernel fails, the system throws an error instead of silently switching to FP16.
  • Tensor parallelism (NCCL) + CUDA graphs for decoding (also supports eager mode).

Benchmarks (RTX 4090)

The following tables show AdaLLM benchmarks on an RTX 4090 with Qwen3-8B-NVFP4 and Gemma3-27B-it-NVFP4 models.

Qwen3-8B-NVFP4

batch total tokens seconds tok/s peak GB
1 128 3.3867 37.79 7.55
2 256 3.5471 72.17 7.55
4 512 3.4392 148.87 7.55
8 1024 3.4459 297.16 7.56
16 2048 4.3636 469.34 7.56

Gemma3-27B-it-NVFP4

batch total tokens seconds tok/s peak GB
1 128 9.3982 13.62 19.83
2 256 9.5545 26.79 19.83
4 512 9.5344 53.70 19.84

For Qwen3-8B-NVFP4, an approximately 2.4x lower peak VRAM was observed compared to Qwen3-8B FP16 baselines (with an approximate 20-25% throughput loss).

Quickstart

pip install git+https://github.com/BenChaliah/NVFP4-on-4090-vLLM.git
adallm serve nvidia/Qwen3-8B-NVFP4

export NVFP4_FP8=1 is optional and enables the FP8 GEMM path (NVFP4_FP8=0: the difference is in compute precision, not VRAM; the FP8 KV cache and the FP8 decode kernel are still used).

Supported Models

  • nvidia/Qwen3-8B-NVFP4
  • BenChaliah/Gemma3-27B-it-NVFP4
  • Qwen3 MoE variants are supported, but still slow (see README for MoE notes).

Limitations

  • MoE routing and offload paths are not yet fully optimized.
  • Only NVFP4 weights, no FP16 fallback for decoding by design.
  • Targeting Ada Lovelace (sm_89). Requires validation on other Ada cards.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.