Efficient Compression for Large Language Models
Large language models (LLMs) demand ever-increasing computational and memory resources, making compression a crucial aspect for their deployment and continued training. A new study introduces a compression method called Hierarchical Sparse Plus Low-Rank (HSS) that aims to alleviate this issue.
The HSS technique operates in two stages: first, it identifies and removes the largest-magnitude weights, creating a sparse matrix. Subsequently, it applies a recursive low-rank factorization to the dense residual matrix. This approach maximizes compressibility while maintaining model performance.
Memory Savings and Competitive Performance
Tests on the LLaMA-7B model have shown that applying HSS to self-attention projections (approximately 1.6 billion parameters) is sufficient to achieve significant memory savings while maintaining state-of-the-art perplexity scores on the WikiText dataset. Specifically, with a 30% sparsity budget and an outer rank of 512, the sHSS-RCM variant achieved a perplexity of 1.64, outperforming both dense baselines and classical sparse-plus-SVD variants.
This new compression method offers a promising balance between efficiency and accuracy, paving the way for more accessible and sustainable implementations of large language models.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!