Running a 35-billion-parameter model on a single GPU remains a core challenge for on-premise inference. The llama.cpp community knows this well: aggressive quantization and speculative decoding are the levers to cut latency without sacrificing data control. The latest contribution comes from an experimental addition to the Ornith-1.0-35B GGUF format, where a user grafted a native MTP (Multi-Token Prediction) draft head onto an IQ4_XS body, achieving a 35% acceleration in single-stream decoding.
How the MTP graft works and early results
At the core is a self-speculative decode: the MTP head, kept at Q6 precision, generates token sequences early, which the IQ4_XS body verifies in parallel, boosting real throughput. On an RTX PRO 6000 Blackwell 96 GB GPU, tp=1, the jump is clear: from 172.6 to 233.8 tok/s. More importantly, the next-token distribution stays byte-identical to the target model on 32 test cases – KL divergence of 0.0. Compared to Q4_K_M, the graft shows a BF16 KL divergence of 0.073 (vs. 0.086), meaning the modification introduces no extra degradation. The only caveat: on long deterministic generations it is not perfectly bit-exact (6/8 exact, 93.4% token match), a typical trade-off when pushing for speed.
Comprehensive numbers: throughput, latency, and fidelity
The update brings a full set of benchmarks across six quantizations, with throughput figures, p95 TTFT latency, and a fidelity ladder based on mean KL of the top-64 next tokens versus BF16. Q4_K_M, for instance, delivers 243 tok/s at concurrency 1 and scales to about 656 tok/s with 16 simultaneous requests, with p95 TTFT of 76 ms for a single stream. The IQ4_XS-MTP graft takes roughly 19.6 GB, less than Q4_K_M (21.2 GB), while achieving a top-1 accuracy of 90.6%, identical to Q4_K_M and better than plain IQ4_XS (84.4%). On the long-context side, prefill scales linearly: 94 ms for 512 tokens and approximately 6.3 seconds for 32k tokens, with the graft outperforming Q4_K_M at every tested length.
What it means for on-premise deployments
This experiment is more than a technical feat. It demonstrates how native speculative decoding – without external draft models – can lower generation latency on self-hosted machines, keeping inference fully under local control. In scenarios where data sovereignty or cost predictability outweigh cloud flexibility, a 35% boost on a single GPU can translate into better TCO or the ability to serve more users with the same hardware. The fidelity ladder becomes a practical decision tool: more aggressive quantizations save memory but erode accuracy; techniques like this graft shift the acceptable threshold. AI-RADAR tracks these developments closely, offering analytical frameworks on /llm-onpremise to assess trade-offs without losing sight of the real needs of those bringing LLMs in-house.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!