Introduction to On-Premise LLM Deployments
The adoption of Large Language Models (LLM) in enterprise and research contexts often requires careful evaluation of deployment options. While cloud solutions offer scalability and flexibility, on-premise or self-hosted deployments are gaining traction for reasons related to data sovereignty, control over Total Cost of Ownership (TCO), and infrastructure customization. In this scenario, the ability to perform LLM inference on local hardware, even consumer or prosumer grade, becomes a crucial factor.
A recent benchmark, conducted on a local hardware configuration, tested the performance of Google's Gemma 4 models, exploring how different quantized variants perform on a multi-GPU setup. This analysis provides valuable data for system architects and decision-makers considering LLM implementation in controlled and proprietary environments.
Hardware and Software Configuration Details
The system used for the benchmark is a desktop configuration based on Kubuntu 26.04, equipped with an AMD Ryzen 5 3600 6-core CPU and 48 GiB of DDR4 3600 MHz RAM. The core of the inference capability is provided by three Nvidia GTX-1070 GPUs, each with 8 GiB of VRAM, for a total of 24 GiB of available VRAM. Interestingly, the operator set a power limit for each GPU (120, 121, and 122 watts respectively) using nvidia-smi, a choice that resulted in an estimated 5% reduction in inference performance but helped optimize the system's overall power consumption, a relevant aspect for TCO.
The PCIe configuration of the three graphics cards showed an uneven distribution (16x, 4x, and 1x), with one of the GPUs installed on a PCIe 1x extender, a solution often adopted in mining contexts or to maximize the use of available slots. Although this configuration slowed down model load times, inference speed remained consistent across runs. For model execution, the llama.cpp framework (build 726704a16) was used, known for its efficiency in LLM inference on local CPUs and GPUs, with Vulkan support for GGUF models.
Gemma 4 Model Performance
The benchmark evaluated five different variants of Gemma 4 models, with sizes ranging from 12.69 GiB to 17.52 GiB. Results were measured in tokens per second (t/s) for two different prompt and generation sizes (pp512 and tg128), providing a detailed picture of each model's capabilities on the tested configuration. Below is a summary of the results:
| Model | Size (GiB) | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| gemma-4-31B-it-UD-Q4_K_XL | 17.52 | 56.21 | 7.12 |
| gemma-4-12b-it-UD-Q8_K_XL | 12.69 | 128.85 | 13.47 |
| gemma-4-26B-A4B-it-UD-Q4_K_XL | 15.83 | 114.05 | 41.28 |
| gemma-4-26B-A4B-it-qat-UD-Q4_K_XL | 13.26 | 123.50 | 53.08 |
| gemma-4-E4B-it-BF16 | 14.00 | 302.16 | 11.54 |
The gemma-4-E4B-it-BF16 model showed the highest throughput for prompt processing (302.16 t/s for pp512), albeit with slower token generation. Among the quantized models, gemma-4-26B-A4B-it-qat-UD-Q4_K_XL stood out for an excellent balance of speed (123.50 t/s for pp512 and 53.08 t/s for tg128) and accuracy, proving particularly effective for coding tasks. These data underscore the importance of Quantization for optimizing LLM execution on hardware with limited VRAM, allowing larger models to be loaded and achieving competitive performance.
Implications for On-Premise Deployments
The results of this benchmark offer several insights for organizations considering on-premise LLM deployments. Firstly, they demonstrate the feasibility of running significant models like Gemma 4 on prosumer-grade hardware, provided it is adequately configured. The ability to manage power consumption through GPU power limits highlights a pragmatic approach to TCO management, balancing performance and long-term operational costs.
Furthermore, the efficiency of frameworks like llama.cpp and the use of quantized models (Q4_K, Q8_0) are key factors in maximizing available VRAM utilization and achieving acceptable throughput. For those evaluating on-premise deployments, these trade-offs are fundamental: the choice between larger, less quantized models (requiring more VRAM and power) and smaller, more quantized models (offering greater efficiency on limited hardware) directly impacts scalability and infrastructure requirements. The ability to keep data and models within the company's perimeter also strengthens data sovereignty and regulatory compliance, increasingly critical aspects in today's technological landscape.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!