APEX: New Quantized MoE LLMs and an Ultra-Compressed Tier for Local Inference

APEX Expands Support for MoE Large Language Models with New Quantizations

The APEX quantization strategy, specifically designed for Large Language Models (LLMs) based on the Mixture-of-Experts (MoE) architecture, has announced a significant expansion of its model collection. Following its initial introduction with Qwen 3.5 35B-A3B, the catalog has grown to include over 30 new MoE models, covering major LLM families. This expansion aims to make models more efficient in terms of memory requirements and inference speed, a critical factor for on-premise deployments.

A key innovation is the introduction of the I-Nano tier, an ultra-high compression level that promises to further reduce the models' footprint. These developments are particularly relevant for infrastructure architects and DevOps leads seeking solutions to run powerful LLMs on local hardware while maintaining high standards of performance and model fidelity.

Technical Details and Advantages of APEX Quantization

APEX employs a mixed-precision quantization strategy, aware of the MoE structure. User feedback indicates that APEX's I-Balanced and I-Compact versions maintain remarkable long context coherence, extending beyond 32,000 tokens on 30-50B-class MoEs. This result is significant, especially when compared to uniform Q4_K quantizations, which tend to show visible degradation in similar scenarios. The hypothesis behind this performance is APEX's ability to keep shared experts and edge layers at high precision, where rare or long-range tokens are routed and processed, thereby preserving long-context behavior.

Regarding coding performance, users of Qwen3.6 35b a3b have reported that the I-Compact and I-Mini tiers surprisingly approach F16 performance on real code tasks, despite their reduced size. The new I-Nano (IQ2_XXS) tier pushes compression further, bringing mid-layer routed experts down to 2.06 bits per weight (bpw), with near-edge experts at IQ2_S and edge experts at Q3_K, while shared experts remain at Q5_K. This translates to significant VRAM savings: for example, Qwen 3.5 35B-A3B goes from 13 GB (I-Mini) to 11 GB (I-Nano). This compression is only viable due to the sparse per-token expert activation, typical of MoE architectures, and requires the use of imatrix for optimization.

Implications for On-Premise Deployment and Data Sovereignty

The expansion of the APEX collection and the introduction of the I-Nano tier have profound implications for organizations considering on-premise LLM deployment. The ability to run 30-70B-class MoE models on a single consumer GPU, thanks to quantizations like I-Mini and I-Compact, drastically reduces hardware requirements and, consequently, the Total Cost of Ownership (TCO). This approach offers a concrete alternative to cloud services, allowing companies to maintain full control over their data and infrastructure, a fundamental aspect for data sovereignty and compliance in air-gapped environments.

The availability of multimodal models, such as Nemotron-3-Nano 30B-A3B (vision + audio + text), quantized for local execution, opens new possibilities for edge applications and scenarios where latency and privacy are paramount. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the trade-offs between costs, performance, and control, highlighting how solutions like APEX can balance these needs. The quantization of frontier-sized models, such as MiniMax-M2.5 and M2.7 (228B / 24B active), while requiring significant resources for quantization (like using Blackwell), demonstrates the scalability of the APEX strategy even for the largest models.

Future Prospects and the Role of the Community

The evolution of the APEX strategy is strongly driven by community feedback. User reports, highlighting robust long context and coding performance, have been instrumental in justifying the further development of lower-bit tiers. This collaborative approach is typical of the Open Source ecosystem and accelerates innovation, making advanced technologies more accessible.

Continuous research into methods for optimizing LLM execution on limited hardware is crucial for democratizing access to advanced artificial intelligence. Solutions like APEX not only lower the entry barrier in terms of hardware costs but also promote a more flexible and secure deployment model, aligned with the control and customization needs of modern enterprise infrastructures. The commitment to supporting a wide range of model families, including multimodal ones and community merges, positions APEX as a key player in LLM optimization for the era of on-premise AI.

APEX: New Quantized MoE LLMs and an Ultra-Compressed Tier for Local Inference

APEX Expands Support for MoE Large Language Models with New Quantizations

Technical Details and Advantages of APEX Quantization

Implications for On-Premise Deployment and Data Sovereignty

Future Prospects and the Role of the Community

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Hierarchical Compression for LLMs: Reducing Memory and Compute

Mini-LLM: an 80M parameter LLM based on Llama 3 architecture

LLM Alignment: Selective Intervention for Efficient Inference

👥 Join 160+ AI explorers