The Importance of Deep Context in On-Premise LLMs
In the rapidly evolving landscape of Large Language Models (LLMs), the ability to manage extended contexts has become a crucial differentiating factor, especially for enterprises opting for on-premise deployments. The need to analyze large volumes of data, such as existing codebases for change requests or debugging, requires LLMs capable of efficiently processing long prompts. This scenario, where “prompt processing” (PP) constitutes the majority of the processing time (estimated between 95% and 99%), poses significant challenges in terms of hardware requirements and performance.
An informal analysis, conducted on a local setup based on Strix Halo with 128GB of shared memory, Ubuntu 26.04, and a Vulkan backend, compared several LLM models in the 120B class. The objective was to evaluate their performance in deep context scenarios, with a particular focus on Nemotron Super 120B, a model that has shown interesting potential in context management.
Methodology and Models Under Comparison
The benchmark involved Nemotron Super 120B, GPT-OSS 120B, and Qwen 3.5 122B A10B, alongside Qwen 3.6 35B A3B as a reference for smaller, faster models. The methodology was based on llama-bench, with a “usability” threshold set at 100 Tokens per second (TPS) for prompt processing. Tests were stopped if a model fell below this threshold. A fundamental aspect that emerged is the variation in the maximum context depth supported by the models: GPT-OSS handles up to approximately 128,000 Tokens, Qwen 3.5 and 3.6 reach about 256,000 Tokens, while Nemotron Super extends up to 400,000 Tokens.
This difference in context capacity is particularly relevant for workloads requiring the analysis of extensive documents or complex codebases. The ability to maintain a larger context directly in the model's memory reduces the need for external chunking or summarization techniques, simplifying pipelines and potentially improving the accuracy of responses.
Deep Context Performance: Nemotron Super Stands Out
The benchmark results confirmed the initial impression: Nemotron Super handles deep context exceptionally well compared to its direct competitors. Specifically, the “speed king” GPT-OSS 120B rapidly loses efficiency in prompt processing, to the extent that Nemotron Super surpasses it at a context depth of 32,000 Tokens. Even more pronounced is the difference with Qwen 3.5 122B A10B, which is surpassed almost immediately at 16,000 Tokens of depth. Surprisingly, even the smaller Qwen 3.6 35B A3B shows comparable prompt processing to Nemotron Super at its maximum context of approximately 256,000 Tokens.
Regarding Token Generation (TG) speed, considered less critical for the specific use case, Nemotron Super achieves “usable” values (above 10 TPS) but not yet “fun” (above 20 TPS). Its performance degrades slowly to “barely usable” at approximately 400,000 Tokens of depth, which is still a remarkable result given the context extension. The most direct competitor, Qwen 3.5 122B A10B, shows similar generation speed at 128,000 Tokens of context. It is important to note that Multi-Tenant Processing (MTP) was not enabled during these tests, which could further influence performance in multi-user scenarios.
Implications for On-Premise Deployments and Final Considerations
These results offer valuable insights for CTOs, DevOps leads, and infrastructure architects evaluating self-hosted LLM solutions. For workloads primarily requiring efficient prompt processing over very large contexts, Nemotron Super emerges as a reasonable choice, especially if the goal is to maintain data sovereignty and full control over the infrastructure. Its ability to handle 400,000 Tokens of context reduces pipeline complexity and maximizes model utility for intensive analytical tasks.
However, if high Token Generation speed is the priority for contexts below 128,000 Tokens, Nemotron might not be the optimal solution. In such cases, or when such a large model is not necessary, smaller variants of Qwen 3.6, like the 35B model, represent a valid alternative. The choice of LLM for an on-premise deployment is a balance of trade-offs between performance requirements (PP vs TG), context depth, model size, and available hardware resources. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs, supporting informed decisions that consider the Total Cost of Ownership (TCO) and specific operational needs.
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!