Optimizing KV Cache for Chain-of-Thought LLMs
Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks but incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache.
A new approach, called Crystal-KV, efficiently manages the KV cache, leveraging the "answer-first" principle.
Crystal-KV distinguishes between:
- SlipKV: mainly maintains the reasoning flow but may occasionally introduce misleading context.
- CrystalKV: truly contributes to the correctness of the final answer.
How Crystal-KV Works
Crystal-KV uses an attention-based Least Recently Frequently Used algorithm to precisely identify when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Furthermore, it introduces an adaptive cache budget allocation algorithm that estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization.
Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!