A benchmark on four NVIDIA RTX 3090 with the Nemotron-3-Super-120B-A12B in GGUF format delivered results that reshape what on-premise inference can achieve at extreme context lengths. The hybrid Mamba2 + periodic attention + Mixture of Experts architecture, with 12 billion active parameters, recalled single pieces of information buried in a window of over 504,000 tokens without a single error—all while running entirely on consumer GPU VRAM, about 20 GB per card.

The test: four RTX 3090 and a 71 GB model

The version used was the i1-Q4_K_S quantization from mradermacher, compressing the original BF16 checkpoint into a roughly 71 GB GGUF. Inference was run with the llama.cpp backend, keeping all model layers on GPU and using a q8_0 KV cache. The minimum setup: four 24 GB RTX 3090s – no NVLink, no specialized servers.

Decode numbers tell the story: 72 tokens per second on short contexts, 67 t/s at 30K, down to 23 t/s at 504K tokens. Prefill clocks at over 2,000 t/s on 30K tokens and 885 t/s on the full context. The decisive metric, however, is needle-in-a-haystack: a single piece of information (the “code”) planted at 10%, 50%, and 90% depth was correctly retrieved in every test.

The key: Mamba layers and a tiny KV cache

A full-attention model accumulates a key-value cache that grows linearly with context, hitting both VRAM footprint and decode speed. Nemotron-3-Super instead uses Mamba layers that maintain a fixed-size recurrent state. Only the few periodic attention layers carry a KV cache, and with just 2 KV heads the impact is minimal. The result: decode at 500K tokens (23 t/s) is roughly what a comparable full-attention MoE (MiniMax-M2.7-REAP, ~74 GB, 10B active) achieved at just 30K tokens on the same hardware (24.5 t/s).

What it means for on-premise deployments

Teams evaluating local stacks for data sovereignty or TCO control know that very long contexts are often a prohibitive luxury. Full-attention models demand generous VRAM and latency degrades as the conversation grows. The Mamba+MoE architecture demonstrated here breaks that trade-off: the cost of context becomes almost flat. Half a million tokens become manageable on four consumer cards, without enterprise servers or cloud. This opens concrete use cases for long-document analysis, complex contract review, and legal or compliance applications that must stay strictly on-premise.

Watch the recency bias

A detail from the test is the classic recency bias: permanent instructions buried deep in the context can be overridden by a contradiction placed near the end. The operational lesson is clear: in real-world use, rigid rules belong in the system prompt or toward the context’s tail, not scattered across a long spine. It remains a caution point for those building structured workflows on extremely long contexts.

Outlook

The mix of hybrid Mamba architectures and aggressive quantization is redrawing the boundaries of local inference. No data centers required: four consumer GPUs and a well-optimized GGUF deliver absolute precision on context lengths that were, until yesterday, the exclusive domain of cloud infrastructure. For anyone watching on-premise LLM deployment, the path shown by Nemotron-3-Super sends a strong signal: architectural efficiency trumps raw teraflops when the goal is control, low cost, and ultra-long contexts.