Google Gemma 4: New Open-Weight LLMs for Local Deployment

Google has announced the fourth generation of its open-weight Large Language Models (LLMs), named Gemma 4. This new release marks a significant step towards greater flexibility and control for developers and enterprises looking to implement artificial intelligence solutions on-premise. Unlike the Gemini models, which are only usable under Google's terms, Gemma 4 offers a more open approach, now under the Apache 2.0 license, moving away from the previous custom license that had raised some concerns within the community.

Optimization for local usage is at the core of Gemma 4. This strategic choice addresses the growing need for deployments that ensure data sovereignty, regulatory compliance, and direct control over infrastructure. The models have been designed to run on local machines, offering various options in terms of size and hardware requirements to adapt to diverse use cases.

Technical Details and Hardware Requirements for On-Premise Inference

Gemma 4 is available in four different sizes, with the larger variants including a 26-billion-parameter "Mixture of Experts" (MoE) model and a 31-billion-parameter "Dense" model. These models have been specifically engineered to run in unquantized bfloat16 format on a single 80GB NVIDIA H100 GPU. While an H100 represents a significant investment, estimated around $20,000, the ability to run models of this scale on local hardware underscores Google's commitment to self-hosted solutions.

For organizations with more constrained budgets or aiming for broader adoption, Google has indicated that these larger models can also run on consumer GPUs, provided Quantization techniques are applied to reduce their precision and VRAM requirements. This flexibility allows companies to balance performance and cost, choosing the hardware best suited for their specific deployment needs.

Performance Optimization and Latency Impact

A key aspect of Gemma 4's development has been the focus on reducing latency, a critical factor for fully leveraging the benefits of local processing. The 26B Mixture of Experts model, for example, activates only 3.8 billion of its 26 billion parameters during the Inference phase. This sparse architecture enables significantly higher throughput in terms of tokens per second compared to similarly sized models that activate all parameters.

The 31B Dense model, on the other hand, prioritizes output quality over pure speed and is designed for developers to Fine-tune for specific applications. This differentiation offers users the ability to choose between models optimized for speed or precision, depending on the workload and application objectives.

The Apache 2.0 License and Implications for Enterprise Deployment

The shift to the Apache 2.0 license for Gemma 4 is a strategic move that responds to the demands of the developer community and offers greater freedom and transparency. This widely recognized and used Open Source license reduces legal friction and facilitates the integration of models into commercial and proprietary projects, without the restrictions of custom licenses.

For enterprises evaluating on-premise LLM deployment, this licensing choice, combined with optimization for local hardware, enhances Gemma 4's appeal. It offers a clearer path towards building AI applications that keep data within the corporate perimeter, ensuring control, security, and potentially a lower Total Cost of Ownership (TCO) compared to cloud-based solutions, especially for consistent and predictable workloads. AI-RADAR continues to provide analytical frameworks on /llm-onpremise to help organizations evaluate the trade-offs between different deployment strategies.