Jetson Orin NX: On-Premise LLM Inference and Benchmarking for Hermes Agent

Introduction

The evolution of Large Language Models (LLMs) is increasingly driving flexible deployment solutions that extend beyond traditional cloud data centers. A significant example is the adaptation of existing hardware for AI workloads at the edge. Recently, a user repurposed an NVIDIA Jetson Orin NX, originally intended for a robotics project, for on-premise LLM inference. This initiative reflects a growing trend: leveraging local computing power for AI applications, especially with the advent of more efficient models and techniques like Mixture of Experts (MoE) and quantization.

The primary goal of this project was to transform a compact device into a silent, high-performance LLM server capable of handling large context windows. This choice addresses the need to maintain control over data and reduce latency, crucial aspects for many enterprise and industrial applications requiring data sovereignty and operation in air-gapped environments.

Technical Details and Challenges

The Jetson Orin NX, while a powerful platform for edge computing, presents specific challenges when it comes to intensive workloads like LLM inference. The version used saw an increase in power consumption from 25W to 40W, a factor that directly impacts heat dissipation and, consequently, system noise. To achieve the goal of operating as silently as possible, the user had to resort to significant hardware modifications, including adapting the stock heatsink and creating a new custom chassis.

The predefined performance metrics were ambitious for a device of this size: exceeding 10 tokens/s for text generation (TG) and 300 tokens/s for prompt processing (PP), with a context window of at least 65K tokens, specifically for the Hermes Agent application. To evaluate these capabilities, numerous models were tested, including variants of Gemma-4 and Qwen 3.6, with different quantization configurations, to find the right balance between performance and memory requirements.

Results and Implications for Edge AI

The tests revealed promising results, particularly with the Gemma 4 26B A4B UD Q2_K_XL model. This configuration allowed for a 66K token context window, exceeding the initial goal. In terms of throughput, the system recorded 14.65 tokens/s with approximately an 8K token context, and 10.21 tokens/s when the context window extended to about 60K tokens. These performances demonstrate the Jetson Orin NX's ability to handle complex LLM workloads, including managing multiple tool calls with long prompts, a fundamental requirement for advanced AI agents.

This type of on-premise or edge deployment is particularly relevant for companies that need to process sensitive data locally, respecting privacy regulations and data sovereignty. The ability to run performant LLMs on compact, low-power hardware opens new opportunities for industrial, healthcare, and security applications, where cloud dependency can be a constraint. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between costs, performance, and control.

Future Prospects and Considerations

The project highlights how hardware and software optimization is crucial for extending LLM capabilities to edge computing scenarios. The ability to achieve robust and performant LLM inference on a device like the Jetson Orin NX, with targeted modifications, paves the way for distributed and highly customized AI solutions. This approach not only offers greater data control and security but can also contribute to a more favorable Total Cost of Ownership (TCO) in the long term, reducing operational costs associated with continuous cloud resource usage.

The experience demonstrates that, with the right engineering, it is possible to overcome the perceived limitations of edge hardware, transforming it into a strategic asset for the implementation of distributed artificial intelligence. The continuous search for more efficient models and advanced quantization techniques will continue to expand the possibilities for LLM deployment on resource-constrained platforms, making generative AI accessible in increasingly varied and specific contexts.