Per-Layer Embeddings: The Key to Efficient Inference in Small Gemma 4 Models

Introduction: Per-Layer Embeddings and Gemma 4's Efficiency

Google recently released the Gemma 4 model family, introducing two smaller variants, gemma-4-E2B and gemma-4-E4B, which stand out with an 'E' designation instead of the more common 'A'. These versions do not fall into the traditional categories of dense or Mixture-of-Experts (MoE) models, but represent an innovative architectural approach.

The main innovation lies in Per-Layer Embeddings (PLE), a technique that promises new performance tradeoffs for inference, particularly relevant for resource-constrained scenarios. This development aims to optimize the execution of Large Language Models in contexts where memory and speed are critical factors, offering an alternative to existing architectures.

Technical Detail: Beyond Traditional MoE Models

To understand the scope of this novelty, it's useful to recall how MoE models work. An example is gemma-4-26B-A4B, which, despite having 25.2 billion total parameters, only activates 3.8 billion for each single inference step. This allows for faster inference compared to a dense model of similar active size, but still requires loading all 25.2 billion parameters into VRAM or fast RAM. The unavailability of such memory would severely compromise performance, as active experts can vary for each token.

Gemma 4-E models, like gemma-4-E2B, adopt a different strategy. This model has 5.1 billion total parameters, of which 2.8 billion are embedding parameters. Google defines them as 'effective' 2.3 billion, hence the 'E2B' designation.

Embeddings are high-dimensional vectors associated with each token in the vocabulary, capturing its semantic essence. Traditionally, a single embedding matrix is applied at the beginning of the process. Gemma 4-E models, however, introduce Per-Layer Embeddings (PLE): additional, smaller embedding matrices for each layer of the model. These matrices acquire specialized knowledge during training, allowing tokens to be re-contextualized for the semantic specialization of each layer, significantly improving processing quality.

Context and Implications: Intelligent Embedding Management

The reason why embedding parameters are not counted in the 'effective' parameter calculation lies in their operational nature. Contrary to what is often simplified in introductions to Large Language Models, embeddings do not require complex matrix multiplication during inference. Being static, position-independent, and precomputed vectors for the entire vocabulary, the 'embedding matrix' actually functions as a lookup table. To obtain a token's embeddings, it is sufficient to retrieve the corresponding element from a fixed-size array, without the need for CUDA cores or optimized kernels for matrix operations.

This means that embedding matrices do not necessarily need to reside in VRAM or even CPU RAM. They can be stored on slower media, such as disk. The idea is to leverage flash memory on mobile devices, with the future prospect of in-flash processing for further accelerations.

For organizations evaluating LLM deployment on-premise or on edge devices, this architecture offers an interesting tradeoff. Although Per-Layer Embeddings are voluminous, the minimal portion needed for each inference step allows for much more flexible memory management, reducing VRAM pressure and potentially lowering the overall TCO of the inference infrastructure.

Future Prospects and Architectural Trade-offs

The ability to offload a large portion of embedding parameters to slower but larger storage represents a significant advantage for inference efficiency. This approach allows for the implementation of models with a high number of total parameters in environments with limited memory resources, such as edge devices or on-premise servers with GPUs less endowed with VRAM.

This is not a universal solution, but an optimization targeted at specific constraints. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess the tradeoffs between hardware requirements, performance, and operational costs. Per-Layer Embeddings fit into this context as an architectural alternative that shifts the balance between model size, inference speed, and memory requirements, offering new avenues for resource optimization.

This innovation underscores the continuous research in the field of Large Language Models to make generative AI more accessible and efficient, adapting it to a wide range of deployment scenarios.