Gemma 4: The Discovery of Hidden Multi Token Prediction and Its Implications for Local Inference

The Discovery of Multi Token Prediction in Gemma 4

Google's Large Language Model (LLM) Gemma 4 has recently been at the center of a significant discovery that has captured the attention of the tech community. It has emerged that the model originally integrated Multi Token Prediction (MTP) functionality, an advanced technique to accelerate the inference process. This revelation came after investigations by a user who, while using Gemma 4 via the LiteRT API on a Google Pixel 9 Android device, encountered errors related to "mtp weights being an incompatible tensor shape."

Further digging revealed additional MTP "prediction heads" within the LiteRT files, designed to facilitate speculative decoding and, consequently, achieve significantly faster outputs. This discovery has sparked a lively debate, as MTP is a highly desirable feature for optimizing LLM performance, particularly in contexts where latency and throughput are critical factors.

Technical Details and Google's Decision

Multi Token Prediction, often associated with speculative decoding, allows an LLM to predict multiple tokens simultaneously, rather than one at a time. This approach can drastically reduce the time required to generate responses, improving the overall efficiency of the model. The presence of these MTP structures in Gemma 4's LiteRT files suggests that the functionality was an integral part of the model's original design, intended to maximize generation speed.

Official confirmation came from a Google employee, who stated that Gemma 4 indeed possesses MTP, but that the functionality was "removed on purpose" with the goal of "ensuring compatibility and broad usability." This rationale, while understandable from a large-scale distribution perspective, has left a bitter taste for some in the community, who would have preferred a release of the model with all its capabilities enabled, especially considering the interest in performance on edge devices and in self-hosted environments.

Implications for On-Premise Deployment and Performance

The disabling of MTP in Gemma 4 raises important considerations for CTOs, DevOps leads, and infrastructure architects evaluating LLM deployment in on-premise, hybrid, or edge environments. Inference speed is a key factor in the Total Cost of Ownership (TCO) and operational efficiency. An MTP-enabled model could offer higher throughput and lower latency, reducing the need for more powerful hardware or a greater number of instances to handle the same workload.

For those operating in contexts where data sovereignty, compliance, or air-gapped environments are priorities, optimizing performance on local hardware is crucial. An LLM's ability to generate responses faster on a Google Pixel 9, for example, translates into lower power consumption and a better user experience. Google's choice to prioritize universal compatibility over native maximum performance introduces a trade-off that IT specialists must carefully consider when evaluating deployment options for their AI/LLM workloads.

Future Prospects and the Role of the Community

The discovery of hidden MTP in Gemma 4 has reignited debate within the community, with some speculating about the possibility of reverse engineering to extract the tensors and mathematical logic from the compute graph in LiteRT and re-enable the functionality. This approach, though complex, reflects the constant pursuit of optimization and customization that characterizes the LLM sector, especially for local deployments.

The drive for efficiency and maximization of performance on specific hardware, such as GPUs with limited VRAM or edge devices, remains a priority for many. The Gemma 4 case highlights how model design and release decisions can directly impact deployment strategies and TCO for businesses. The open source community continues to play a crucial role in exploring and unlocking the full potential of these models, pushing the boundaries of innovation even when advanced features are initially limited by the original providers.