Gemma 4 12B QAT: 120 tok/s on 12GB VRAM GPU with llama.cpp

On-Premise LLM Inference: The Gemma 4 12B QAT Case

The landscape of Large Language Models (LLM) is constantly evolving, with increasing attention paid to optimization for inference on local hardware. Google recently released the Quantization-Aware Training (QAT) variant of its Gemma 4 models, including the 12-billion parameter version. This optimization is particularly relevant for companies and professionals evaluating on-premise deployments, where data sovereignty and infrastructure control are priorities.

A recent benchmark tested the capabilities of the Gemma 4 12B QAT model on a consumer GPU, yielding results that underscore the potential of these solutions for local AI workloads. The experiment demonstrated how it is possible to achieve high performance, with an average speed of approximately 120 tokens per second, using a graphics card with 12GB of VRAM, a requirement that is becoming increasingly accessible even outside specialized data centers.

Technical Details and Benchmark Methodology

The test was conducted on a system equipped with an NVIDIA RTX 4070 Super GPU with 12GB of VRAM, alongside an AMD Ryzen 7 9700X CPU and 32GB of DDR5-6000 RAM. For model execution, llama.cpp was used, a framework known for its efficiency in LLM inference across various hardware architectures. In this specific case, a version of llama.cpp patched with the Multi-Token Prediction (MTP) pull request for Gemma 4 was utilized, a technique that improves token generation speed through speculative decoding.

The setup involved loading Unsloth's main gemma-4-12B-it-qat-GGUF model and a Google assistant/draft model, also converted to GGUF format. The use of an assistant model is crucial for Multi-Token Prediction, allowing the main model to validate multiple generated tokens simultaneously, significantly accelerating throughput. The inference context was set to a large size of 131072 tokens, demonstrating the ability to handle long sequences even on hardware with limited VRAM, provided that both the model and the assistant can reside entirely in GPU memory.

Performance and Implications for On-Premise Deployments

The benchmark results showed an aggregate throughput of approximately 120 tokens per second, with peaks of 135.7 tokens per second for tasks like stepwise math problem-solving and 133.5 tokens per second for summarization. These numbers are significant for those evaluating on-premise LLM deployments, as they indicate the possibility of running complex workloads with low latencies and good throughput, even on non-high-end infrastructure.

The ability to fit the entire model and its assistant into the GPU's VRAM is a critical factor for optimizing performance, reducing data transfers between the CPU and GPU. It was observed that operating systems like CachyOS, configured with the dGPU as a secondary GPU, can maximize available VRAM, whereas on Windows or with the dGPU as the main GPU, hundreds of MB can be lost due to OS and driver overhead. For those evaluating on-premise deployments, there are trade-offs between hardware cost, desired performance, and VRAM requirements, which directly impact the Total Cost of Ownership (TCO) and scalability. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs in detail.

Future Prospects and Control over AI Infrastructure

The optimization of LLMs like Gemma 4 12B QAT for inference on hardware with limited VRAM represents an important step towards the democratization of AI. The ability to run high-performing models on consumer GPUs or entry-level servers opens new opportunities for companies that need to maintain complete control over their data and models, both for compliance and security reasons. The self-hosted approach allows for the creation of air-gapped environments, essential in sectors with stringent regulatory requirements.

The evolution of frameworks like llama.cpp and Quantization-Aware Training techniques will continue to push the limits of what can be achieved with existing hardware. For CTOs, DevOps leads, and infrastructure architects, understanding these optimizations is crucial for making informed decisions about LLM deployments, balancing performance, costs, and control. The trend is clear: AI is becoming increasingly accessible and controllable locally, offering concrete alternatives to cloud services for specific workloads.