Evaluating GPUs for On-Premise AI: A Practical Comparison

The choice of graphics hardware is a fundamental pillar for companies intending to implement artificial intelligence workloads, particularly for Large Language Models (LLM) and diffusion models. For teams prioritizing on-premise deployments, the decision is not solely based on raw computing power, but also on energy efficiency, thermal management, and Total Cost of Ownership (TCO). A recent practical comparison pitted the performance of various NVIDIA cards, including the RTX 5090 and the RTX 6000 PRO variants (MaxQ and WS/SE), offering valuable insights for those needing to balance performance and power consumption.

The analysis focused on a specific diffusion workload, an area that demands significant computational resources for image and video generation. This type of benchmark is particularly relevant for sectors developing creative or simulation applications, where processing speed and the ability to handle large batch sizes are crucial. The results obtained provide a clear picture of the performance and efficiency differences between the examined GPUs, underscoring the importance of a holistic evaluation.

Technical Details and Testing Methodology of the Comparison

The test involved an NVIDIA RTX 5090, with configurable power limits between 400W, 475W, and 600W, and two versions of the RTX 6000 PRO: a MaxQ with a 325W limit and a WS/SE with a 600W limit. It is important to note that the RTX 5090 and the RTX 6000 PRO MaxQ were subjected to undervolting and overclocking to optimize their performance, while the RTX 6000 PRO WS/SE was tested with factory settings, as it was a GPU rented on a cloud platform and could not be modified.

The software stack used included Torch 2.12.0, Sageattention 2.1, and Forge neo, with extensions for RTX upscaling and additional samplers. The workload consisted of a diffusion operation with precise specifications: 896x1088 resolution, 1.5x upscale, batch size of 4, and a sampling process structured over 25 initial steps and 10 hires pass steps. The results, measured in task completion time, showed that the RTX 5090 at 600W (optimized) completed the task in 36 seconds, followed by the RTX 6000 PRO WS/SE at 600W (stock) in 39 seconds. The RTX 6000 PRO MaxQ at 325W (optimized) and the RTX 5090 at 400W (optimized) both took 48 seconds.

Performance Analysis and Efficiency Implications

The data analysis reveals a significant trade-off between raw power and energy efficiency. The RTX 5090, operating at 600W with tuning, confirms itself as the fastest, but at a high energy cost. Interesting is the comparison between the RTX 6000 PRO MaxQ at 325W and the RTX 5090 at 400W: both complete the task in the same time (48 seconds), but the MaxQ does so consuming 75% of the 5090's power. This data is crucial for on-premise deployments, where every watt saved translates into lower operating costs and a smaller thermal footprint.

The RTX 6000 PRO WS/SE, despite not being optimized, showed competitive performance at 600W, suggesting even greater potential with proper undervolting and overclocking configuration. These results highlight how hardware tuning capability can drastically influence efficiency and performance, offering optimization margins beyond factory specifications. For infrastructure managers, understanding these dynamics is essential for optimizing TCO and ensuring the long-term sustainability of AI workloads.

Prospects for On-Premise Deployments and Data Sovereignty

This comparison demonstrates that the choice of GPU for on-premise AI workloads is not a one-size-fits-all decision. While the RTX 5090 offers maximum speed in certain configurations, the RTX 6000 PRO variants show excellent energy efficiency, particularly relevant for environments with power or cooling constraints. For companies handling sensitive data or operating in air-gapped contexts, the ability to optimize self-hosted hardware becomes a critical factor not only for performance but also for data sovereignty and compliance.

Evaluating these trade-offs is fundamental for CTOs, DevOps leads, and infrastructure architects. AI-RADAR offers analytical frameworks on /llm-onpremise to support these decisions, providing tools to assess the constraints and opportunities of local deployments versus cloud solutions. The ability to achieve high performance while reducing energy consumption can make a significant difference in the overall TCO of an AI infrastructure, making efficiency as important a parameter as raw speed.