## Efficient Compression for Large Language Models Large language models (LLMs) demand ever-increasing computational and memory resources, making compression a crucial aspect for their deployment and continued training. A new study introduces a compression method called Hierarchical Sparse Plus Low-Rank (HSS) that aims to alleviate this issue. The HSS technique operates in two stages: first, it identifies and removes the largest-magnitude weights, creating a sparse matrix. Subsequently, it applies a recursive low-rank factorization to the dense residual matrix. This approach maximizes compressibility while maintaining model performance. ## Memory Savings and Competitive Performance Tests on the LLaMA-7B model have shown that applying HSS to self-attention projections (approximately 1.6 billion parameters) is sufficient to achieve significant memory savings while maintaining state-of-the-art perplexity scores on the WikiText dataset. Specifically, with a 30% sparsity budget and an outer rank of 512, the sHSS-RCM variant achieved a perplexity of 1.64, outperforming both dense baselines and classical sparse-plus-SVD variants. This new compression method offers a promising balance between efficiency and accuracy, paving the way for more accessible and sustainable implementations of large language models.

Hierarchical Compression for LLMs: Reducing Memory and Compute

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Nuovo approccio per l'inferenza efficiente con agenti AI a memoria limitata

DeepSeek presenta Engram: memoria statica per modelli linguistici di grandi dimensioni

DeepCQ: un nuovo quadro per prevedere la qualità della compressione