Optimizing KV Cache with Reinforcement Learning
Efficient management of the Key-Value (KV) cache is crucial for the inference of Large Language Models (LLMs), given the increasing size of these models and the resulting memory demands. A new study introduces KV Policy (KVP), a framework that uses reinforcement learning (RL) to improve token eviction from the KV cache.
KV Policy: A Future Utility-Based Approach
KVP reframes KV cache eviction as a reinforcement learning problem, training specialized RL agents to predict the future utility of tokens. Unlike traditional methods that rely on heuristics such as recency or past attention scores, KVP directly assesses the future utility of tokens for decoding. The RL agents are trained on pre-computed generation traces, using only key and value vectors, without requiring modifications to the underlying LLM or additional inference.
Performance and Generalization
Evaluations on long-context (RULER) and multi-turn dialogue (OASST2-4k) benchmarks demonstrate that KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results suggest that predicting the future utility of tokens is an effective and scalable paradigm for adaptive KV cache management.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!