A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

Running a 35-billion-parameter Mixture-of-Experts LLM on a consumer GPU is no longer a lab experiment. With a single RTX 3090, a user put Qwen3.6-35B-A3B APEX to the test, achieving over 130 tokens per second with 128,000 tokens of context—all without resorting to the cloud. The report, posted on Reddit, explores two llama.cpp forks and a promising KV cache codec called turbo8.

The setup and the forks compared

The hardware protagonist is an RTX 3090 with its 24 GB of VRAM. Hosting the APEX model (I-Compact or I-Quality) requires aggressive quantization: I-Compact takes about 17 GB, I-Quality about 21.3 GB. The inference runtime relies on two llama.cpp forks: ik_llama, known for CUDA kernel optimisation, and spiritbuun, which introduces the turbo8 codec for key-value cache compression.

The numbers are clear: with I-Compact, ik_llama hits ~146 t/s in decoding, both on narrative text and code. The spiritbuun variant with the same model sits around 142–141 t/s. Moving to I-Quality, both engines align at ~137 t/s. The difference is minimal, but the notable point is that spiritbuun manages to guarantee the same performance as ik_llama even with the heavier model.

turbo8: higher speed with less degradation

The turbo8 codec replaces the traditional q8_0 for the cache keys and pairs with turbo4 for the values. Benchmarks published by the developer on X show that as context length increases, the advantage widens: from +1.9% at 2,048 tokens to +15% at 32,768 tokens, with a consistently lower KL divergence. In plain terms: turbo8 is faster and loses less information.

In practice, the asymmetric turbo8/turbo4 configuration proved decisive in pushing the context window to 128k without running out of memory. To get the most out of the spiritbuun fork, however, you need to apply a patch (PR #72) that fixes a ~38% regression in prefill speed; otherwise, the advantage vanishes when processing long prompts.

Quality and trade-offs: I-Compact or I-Quality?

Data from the APEX repository (referring to the Qwen3.5-35B-A3B model) paint a clear picture. I-Quality and UD-Q4_K_XL have nearly identical perplexity (6.552 vs. 6.554), but APEX I-Quality is about 7% faster in generation (62.3 t/s vs. 58.1) and achieves a slightly higher HellaSwag score (83.5% vs. 83.0%).

I-Compact, at only 17 GB, offers the best efficiency: perplexity is a bit higher (6.857 vs. 6.552), but it matches I-Quality on HellaSwag (83.5%) and enables pushing context up to 256k without exhausting VRAM. Anyone looking for a balance between quality and context length will find I-Compact a valuable choice.

Why it matters for on-premise deployment

This experience shows that consumer hardware, when paired with sophisticated quantization and next-generation KV codecs, can sustain demanding workloads—long contexts, agentic flows—without multi-GPU servers or cloud APIs. For organisations that care about data sovereignty and Total Cost of Ownership, the message is strong: it is now feasible to self-host large MoE LLMs on a single card, with granular control over every component of the pipeline.

Of course, these are experimental forks, requiring manual patches and tested by a single user. But the trajectory points to a community-driven ecosystem that lowers the hardware bar, democratising inference for models that until recently seemed reserved for the data centre. AI-RADAR tracks these developments—from quantization choices to serving runtimes—because every piece counts on the path toward truly sustainable on-premise stacks.