AdaLLM: NVFP4-first inference on RTX 4090 with FP8 cache
AdaLLM is a new runtime focused on accelerated NVFP4 inference on Ada Lovelace GPUs, such as the RTX 4090. It implements an FP8 KV cache and a custom FP8 decode kernel, avoiding FP16 fallback. It supports models like Qwen3 and Gemma3, offering perfor...