Optimizing KV Cache for Chain-of-Thought LLMs

Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks but incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache.

A new approach, called Crystal-KV, efficiently manages the KV cache, leveraging the "answer-first" principle.

Crystal-KV distinguishes between:

  • SlipKV: mainly maintains the reasoning flow but may occasionally introduce misleading context.
  • CrystalKV: truly contributes to the correctness of the final answer.

How Crystal-KV Works

Crystal-KV uses an attention-based Least Recently Frequently Used algorithm to precisely identify when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Furthermore, it introduces an adaptive cache budget allocation algorithm that estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization.

Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.