4-Expert MoE: What an Automated Pipeline on an RTX 4090 Teaches After 28 Days

When automation meets neural network design, the devil hides in implementation details. The team behind the NNGPT project put this truth to the test with a systematic 28-day campaign on a single NVIDIA RTX 4090, aimed at exploring heterogeneous 4-expert MoE architectures. The result? Over 4,400 candidate models generated, but also an enumeration error that locked 95.2% of theoretical combinations out of the picture.

A deterministic generator replaces manual design

The pipeline starts from a hand-crafted heterogeneous MoE model and replaces it with an automatic code assembler. The system draws architecture families from the LEMUR database, combining them into four-expert MoE ensembles. Each ensemble is governed by a convolutional gating network with temperature scaling, mixup augmentation, and cosine annealing of the learning rate. The goal: to sift through the space of possible mixtures without human intervention.

The choice of hardware is not trivial: a consumer GeForce RTX 4090, capable of sustaining significant training loads, shows how architectural research can now be conducted without resorting to server clusters. For those evaluating on-premise deployment, the campaign proves that such tools can run on accessible machines, narrowing the gap between experimentation and production.

The alphabetical order trap

The most interesting finding is not in the average performance of the ensembles – although ShuffleNet and MobileNetV3 excel with accuracy up to 0.632 – but in the discovery of a methodological flaw. The generator uses itertools.combinations to enumerate families: a deterministic approach that follows alphabetical order. Since the first family in the list is AirNet, all explored combinations – a mere 4.8% of the 23,751 possible – include it. In practice, the entire campaign is anchored to AirNet.

This distortion only comes to light thanks to the project’s transparency, which documents batch by batch what was generated and evaluated. It is a warning for anyone developing automatic architectural search frameworks: the choice of a seemingly innocuous combination algorithm can silently invalidate conclusions. The proposed fix – stratified random sampling – is embedded in the corrected version of the generator.

What remains valid (and what does not)

Within the AirNet-centric universe, the numbers speak clearly: ensembles that include ShuffleNet or MobileNetV3 regularly achieve the highest accuracy, while FractalNet and MNASNet prove to be low-yield families, candidates for exclusion in future campaigns. These results, however biased, offer useful pointers for those wanting to assemble efficient MoEs on similar hardware.

On the methodological front, the pipeline stands as a replicable and open-source tool. The full release on GitHub (NNGPT) includes not only the code but also analysis artifacts and the amended generator. An approach that embraces the open science philosophy and, for on-premise environments, ensures full auditability of the model generation process.

Why it matters for on-premise teams

AI-RADAR closely follows initiatives that democratize neural design automation on hardware within reach of small and medium teams. A single GPU, a dataset like LEMUR, and a deterministic pipeline are enough to explore complex architectural spaces – provided one avoids biases like the one documented here. For technical leaders, the lesson is twofold: always verify the statistical assumptions of the tools adopted, and remember that even the most advanced automation does not exempt from careful human oversight. On the cost side, running an RTX 4090 for 28 days offers a useful benchmark to estimate TCO for similar campaigns, especially when compared with equivalent cloud solutions.