Introduction
the Large Language Models (LLMs), such as Llama-3, are becoming increasingly widespread in research and practical applications. However, their implementation requires significant hardware resources, leading to memory and latency issues.
Quantization is a technique used to reduce the dimensionality of weights, reducing thus the required memory for inversion. However, this method can compromise the output accuracy, particularly at low dimensionalities (e.g., 2-bit).
In this context, codification methods have been developed to explore the possibility of achieving accurate results with a significant reduction in weight dimensionality.
The new work presents CodeGEMM, a GEMM kernel that uses a centric coding approach to optimize performance of quantized LLMs.
How it works
tCodeGEMM replaces dequantization, a complex and slow procedure that requires frequent centroid calculations and weight reconstruction. This kernel precalculates the inner product between centroids and activations stored in a lightweight codebook.
In this way, during inversion, the codes are directly used to gather partial sums, avoiding thus per-element lookups and reducing the on-chip footprint.
The kernel supports systematic exploration of latency-memory-accuracy trade-offs under a unified implementation.
Results
tTested on Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups compared to state-of-the-art codebook-based quantization at comparable accuracy. Additionally, this new kernel improves computing efficiency and memory subsystem utilization.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!