DiffusionGemma 26B A4B IT: An Open-Weights Multimodal LLM for High-Speed Inference

Google DeepMind Unveils DiffusionGemma 26B A4B IT: A Multimodal LLM for the Enterprise

Google DeepMind has recently released DiffusionGemma 26B A4B IT, an open-weights multimodal Large Language Model (LLM) designed to process text, image, and video inputs, generating text output. This model positions itself as a significant resource for developers, researchers, and enterprises requiring high-speed text generation capabilities in complex and diverse contexts.

Its open-weights nature, combined with commercial and non-commercial use licensing, makes it particularly appealing for organizations seeking flexible and controllable solutions for their AI pipelines. The ability to handle multimodal inputs paves the way for a wide range of applications, from advanced document understanding to video content analysis, which are crucial elements for business innovation.

Architecture and Performance: Optimization for NVIDIA Hopper H100

DiffusionGemma 26B A4B IT is built upon the Gemma 4 26B A4B Mixture-of-Experts (MoE) architecture, featuring a total of 25.2 billion parameters and 3.8 billion active parameters. This MoE configuration is known for balancing computational efficiency and modeling capabilities, allowing the model to handle complex tasks while maintaining manageable resource requirements compared to similarly sized dense models.

The model adopts an encoder-decoder design with bidirectional attention, generating tokens in parallel 256-token blocks. This architecture enables high-speed generation, exceeding 1,100 tokens per second at low batch sizes on NVIDIA Hopper H100 (FP8) hardware. The optimization for the NVFP4 format, achieved via Model Optimizer, is a critical technical detail highlighting the focus on inference efficiency—a key aspect for on-premise deployments where VRAM consumption and throughput are crucial parameters. The model also supports a 256K token context window, a configurable thinking (reasoning) mode, native function calling, and multilingual inference across 35+ languages.

Implications for On-Premise Deployment and Data Sovereignty

For CTOs, DevOps leads, and infrastructure architects, the features of DiffusionGemma 26B A4B IT present direct implications for deployment strategies. The optimization for NVIDIA Hopper H100 and NVFP4 quantization indicates that the model was designed to maximize performance on specific hardware, a decisive factor for those evaluating self-hosted or bare metal solutions. The ability to achieve over 1,100 tokens per second on H100 (FP8) offers a concrete performance benchmark for hardware resource planning.

The adoption of open-weights models and the possibility of on-premise deployment enhance data sovereignty and regulatory compliance, crucial aspects for regulated industries. Companies can maintain complete control over their data and inference infrastructure, mitigating risks associated with transferring sensitive data to external cloud services. For those evaluating the trade-offs between on-premise and cloud deployment, AI-RADAR offers analytical frameworks at /llm-onpremise to support informed decisions based on TCO, performance, and security requirements.

Use Cases and Future Prospects for the Enterprise

DiffusionGemma 26B A4B IT is designed for a wide range of enterprise use cases. These include conversational AI and chatbots, text summarization, code generation with step-by-step reasoning, and advanced image and document understanding, including OCR, chart comprehension, and PDF or UI parsing. Its video content analysis capabilities and support for agentic workflows with native function calling make it a versatile tool for business process automation and optimization.

The multilingual versatility, covering over 35 languages, further expands its potential in global contexts. The existence of such a high-performing and flexible open-weights model, optimized for leading hardware, underscores the growing maturity of the LLM ecosystem for enterprise applications, offering companies more options to build customized and robust AI solutions while maintaining control over the underlying infrastructure.