Gemma 4 Uncensored with MTP: Up to 53% Speed Boost, Balanced and QAT

Uncensored Models Go Fast with MTP

The Hugging Face repository of HauhauCS has just received two fresh Gemma 4 “uncensored” models, both equipped with Multi-Token Prediction (MTP) draft heads that boost speculative decoding speed by up to 53%. The releases include a Mixture-of-Experts 26B-A4B variant and a dense 31B, each quantized to 4-bit using Quantization-Aware Training (QAT). According to the creator, Q4_K_M is the sweet spot: higher-precision quants would only add weight without any real improvement because the model was trained specifically for ~4-bit operation. Designed for local hardware with 20–24 GB of VRAM, the models support a 262K-token context window and vision input via an mmproj encoder. Tested solely through llama.cpp, the package is ready to drop into any self-hosted pipeline.

Multi-Token Prediction: Pure Speed, Identical Output

MTP works by having a compact draft head generate several tokens ahead, which are then verified by the full model. Because every token is validated, the final output matches the original model exactly—there’s no quality trade-off. On the dense 31B, the measured speedup reaches 53%, effectively doubling throughput for creative writing or long-form tasks, while the 26B-A4B sees a 35% improvement. The setup involves llama.cpp flags (--spec-type draft-mtp) together with the provided .gguf draft files. The author reports no looping, stable sampling, and consistent long-context coherence even across multiple runs, making the MTP approach reliable for production-like local use.

QAT and Architecture: Why 4-bit Wins

Both models are quantization-aware trained, meaning the neural network was conditioned during training to cope with low-bit precision. This makes Q4_K_M the highest-quality setting that remains efficient; Q5 or Q6 formats would only waste VRAM for no perceptual gain. The 26B-A4B uses a MoE design with 128 experts and 8 active per token, effectively utilizing only ~4B parameters per token. It fits in 16.8 GB of VRAM and runs smoothly on modest GPUs. The 31B, with its dense structure, needs 18.7 GB but offers stronger reasoning. Both are “Balanced”: they handle extreme requests with a light reasoning preamble and then deliver a full answer, without altering the base personality—ideal for creative, role-playing, and emotionally intelligent text generation.

Self-Hosted Control and Data Sovereignty

Because the models run entirely on local hardware through llama.cpp, they align perfectly with on-premise deployment strategies. Organizations that must keep data within their own walls—due to GDPR, healthcare compliance, or trade-secret protection—gain access to high-performance LLMs that never leave the building. The MTP acceleration and QAT efficiency mean that a consumer-grade GPU can now rival the throughput of earlier cloud-bound setups, significantly lowering Total Cost of Ownership (TCO). The creator’s note that Qwen3.6 remains better for agentic coding tasks only clarifies the use-case: Gemma 4 excels where narrative control, privacy, and uncensored output are paramount, and where a one-time hardware investment is preferred over recurring API bills.

The Broader Local LLM Shift

With nearly 20 million downloads and a collaboration with Unsloth for the MTP drafts, these releases highlight a growing shift. Open-source tooling and quantization-aware training are pushing on-premise inference from “possible” to “practical.” 262K-context windows, vision support, and lossless speed-ups turn personal workstations into capable AI servers, reshaping the boundary between cloud dependency and in-house processing. For anyone evaluating self-hosted LLMs, the message is clear: the technology is maturing fast, and the control-versus-convenience equation continues to tilt in favor of local deployment.