Architectural Analysis of Qwen 3.5
A recent analysis focused on the architecture of Qwen 3.5 models, comparing the parameter distribution between the 27B dense model and the 122B and 35B Mixture of Experts (MoE) models. All three share a similar architecture, interleaving three Gated DeltaNet layers with a Gated Attention layer, each followed by its respective Feed Forward Network.
The main difference lies in the parameter distribution. MoE models use more parameters in the experts of the Feed Forward Network (FFN). In contrast, the 27B dense model, thanks to the use of a dense FFN that requires fewer parameters, can allocate more resources to other parts of the network.
Parameter Distribution
Quantifying the parameters used in the FFN layers, it is observed that:
- 122B MoE model: 77.3 B (active 2.7) -> 63% (2.2%)
- 35B MoE model: 21.5 B (active 0.8) -> 61% (2.3%)
- 27B dense model: 9.1 B -> 34%
The dense model uses a smaller percentage of parameters in the FFN layers, compensating with:
- Greater depth: 64 layers compared to 48 and 40 for MoE models, improving reasoning abilities.
- More keys and values in the gated attention layers: 4 compared to 2 for MoE models, to capture more nuances.
- More heads in the Gated DeltaNet layers compared to the 35B model.
Furthermore, the dense model actively uses a larger portion of its parameters, increasing computational power per token.
Conclusions
The 27B dense model can be considered a deeper and wider network than the 35B MoE model, and in some respects also than the 122B model. These differences allow it to achieve comparable performance with a 4.5x smaller parameter footprint.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!