The Uncertain Future of 100-120B Large Language Models

The Silence Around 100-120B Large Language Models

The landscape of Large Language Models (LLMs) is constantly evolving, with new architectures and models emerging at a rapid pace. However, a recent market analysis reveals an unexpected trend: a notable absence of new releases in the parameter range between 100 and 120 billion. This category, which includes models such as GPT-OSS-120B, GLM-4.5-Air, Nemotron-3-Super, Qwen3.5-122B, and Mistral-Small-4-119B, appears to have entered a stagnant phase.

Indeed, the aforementioned models are at least three months old, with the progenitor GPT-OSS-120B now being ten months old. This "silence" contrasts with the effervescence observed in other size ranges, raising crucial questions for CTOs, DevOps leads, and infrastructure architects planning deployment strategies for AI workloads.

A Polarized Market: Small or Giants

The current wave of LLM releases is polarizing towards two well-defined extremes. On one hand, we are witnessing the emergence of more compact models, in the 25-35 billion parameter range, such as Gemma4 and Qwen3.6. These models are often optimized for efficiency, targeting use cases that require fewer computational resources and can be run on less demanding hardware, sometimes even on edge devices or servers with mid-range GPUs.

On the other hand, the market is seeing an acceleration in the development of ultra-large models, with over 200 billion parameters. Recent examples include Step 3.5/3.7 Flash, DeepSeek-V4-Flash, MiniMax-M3, and Nemotron-3-Ultra. These giants promise advanced capabilities and superior performance, but require extremely powerful computing infrastructures, often based on clusters of latest-generation GPUs with high VRAM and high-speed interconnections. The intermediate 100-120B range, which in some cases adopts Mixture of Experts (MoE) architectures to optimize Inference, seems to have remained in a kind of limbo.

Implications for On-Premise Deployments and Data Sovereignty

The lack of new LLMs in the 100-120B range has significant repercussions for organizations evaluating on-premise deployments or self-hosted solutions. This category of models, while resource-intensive, could have represented an interesting compromise between the capabilities of larger models and infrastructural manageability compared to the 200B+ giants. To run a 100-120B model, high-VRAM enterprise GPUs, such as NVIDIA A100 80GB or H100, are typically required, often in multi-GPU configurations.

Opting for smaller models (25-35B) reduces hardware requirements and TCO, but may limit capabilities. Conversely, 200B+ models necessitate massive infrastructural investments, potentially pushing companies towards cloud solutions if they lack an adequate CapEx budget. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to understand the trade-offs between performance, costs, and infrastructure requirements, while ensuring data sovereignty and compliance in air-gapped environments.

Future Prospects and the Search for the Right Balance

The question that arises is whether the 100-120B LLM family, particularly those based on MoE architectures, is destined to "die out" as happened in the past for the 70-80B range, or if it is a strategic pause awaiting new optimizations or releases planned for H2 2026. It is possible that developers are focusing efforts on smaller models to maximize accessibility and efficiency, or on much larger models to push the boundaries of capabilities, leaving the intermediate range less prioritized.

For businesses, monitoring these trends is crucial. The choice of model size directly influences hardware selection, budget planning, and the overall AI adoption strategy. Finding the right balance between model capabilities, hardware requirements, and TCO remains a central challenge, especially for those prioritizing the control and security offered by self-hosted deployments.