Introduction

the Large Language Models (LLMs), such as Llama-3, are becoming increasingly widespread in research and practical applications. However, their implementation requires significant hardware resources, leading to memory and latency issues.

Quantization is a technique used to reduce the dimensionality of weights, reducing thus the required memory for inversion. However, this method can compromise the output accuracy, particularly at low dimensionalities (e.g., 2-bit).

In this context, codification methods have been developed to explore the possibility of achieving accurate results with a significant reduction in weight dimensionality.

The new work presents CodeGEMM, a GEMM kernel that uses a centric coding approach to optimize performance of quantized LLMs.

How it works

tCodeGEMM replaces dequantization, a complex and slow procedure that requires frequent centroid calculations and weight reconstruction. This kernel precalculates the inner product between centroids and activations stored in a lightweight codebook.

In this way, during inversion, the codes are directly used to gather partial sums, avoiding thus per-element lookups and reducing the on-chip footprint.

The kernel supports systematic exploration of latency-memory-accuracy trade-offs under a unified implementation.

Results

tTested on Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups compared to state-of-the-art codebook-based quantization at comparable accuracy. Additionally, this new kernel improves computing efficiency and memory subsystem utilization.