The Return of Jetson Orin NX for LLM Inference

The evolution of Large Language Models (LLMs), with the emergence of more efficient architectures like Mixture of Experts (MoE) and smaller models, is opening new opportunities for on-premise and edge deployments. A recent project demonstrated how an NVIDIA Jetson Orin NX, originally intended for a robotics application, can be successfully repurposed for LLM inference, exceeding performance and capability expectations.

The primary goal was to create a solution for the Hermes Agent that was as silent as possible, while ensuring a throughput greater than 10 tokens/s for text generation (TG) and 300 tokens/s for prompt processing (PP), with a context window of at least 65K. These specifications are crucial for applications requiring rapid responses and the management of complex, lengthy inputs, typical of advanced conversational agents.

Technical Details and Hardware Optimizations

The Jetson Orin NX, a powerful yet compact edge computing platform, saw an increase in power consumption from 25W to 40W. This increase posed significant thermal management challenges, necessitating custom hardware interventions. To meet the silence and dissipation requirements, the stock heatsink had to be modified, and a new case designed, highlighting how edge deployments often involve bespoke engineering to optimize the operating environment.

Benchmarking tests involved several models, including Gemma 4 and Qwen 3.6, with various quantization configurations. The most promising results were achieved with the Gemma 4 26B model in the A4B UD Q2_K_XL quantized variant. This configuration allowed for a 66K context window, with a throughput of 14.65 tokens/s for contexts around 8K and 10.21 tokens/s for extended contexts up to 60K. These performances proved adequate for handling multiple tool calls with long prompts, a fundamental requirement for the Hermes Agent.

Implications for On-Premise and Edge Deployments

This project underscores the increasing feasibility and attractiveness of on-premise and edge LLM deployments. Utilizing hardware like the Jetson Orin NX offers significant advantages in terms of data sovereignty, direct control over infrastructure, and the ability to operate in air-gapped environments or with limited connectivity. For companies that must comply with strict privacy regulations or handle sensitive data, a self-hosted architecture becomes a strategic choice.

While edge deployments may require an initial investment in hardware customization and configuration, the long-term benefits in terms of Total Cost of Ownership (TCO) and operational autonomy can be substantial. The ability to perform LLM inference directly on the device reduces reliance on cloud services, eliminating recurring costs and mitigating network latency risks. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs in detail, helping decision-makers choose the most suitable approach for their needs.

Future Prospects for AI on Compact Hardware

LLM model optimization and advancements in quantization techniques continue to push the boundaries of what is achievable on compact hardware. The success of this Jetson Orin NX deployment demonstrates that a large LLM server is not always necessary to achieve significant performance. The ability to run complex models with large context windows on low-power devices paves the way for new applications in sectors such as robotics, industrial automation, and intelligent embedded systems.

This approach, while requiring specific technical expertise for integration and optimization, offers a concrete path for organizations seeking to balance performance, control, and cost. The trend towards more efficient LLMs and more powerful edge hardware suggests that we will see increasingly advanced AI solutions deployed directly where data is generated and used, ensuring greater efficiency and operational security.