KV Policy: Reinforcement Learning for Key-Value Cache Eviction in LLMs

Optimizing KV Cache with Reinforcement Learning

Efficient management of the Key-Value (KV) cache is crucial for the inference of Large Language Models (LLMs), given the increasing size of these models and the resulting memory demands. A new study introduces KV Policy (KVP), a framework that uses reinforcement learning (RL) to improve token eviction from the KV cache.

KV Policy: A Future Utility-Based Approach

KVP reframes KV cache eviction as a reinforcement learning problem, training specialized RL agents to predict the future utility of tokens. Unlike traditional methods that rely on heuristics such as recency or past attention scores, KVP directly assesses the future utility of tokens for decoding. The RL agents are trained on pre-computed generation traces, using only key and value vectors, without requiring modifications to the underlying LLM or additional inference.

Performance and Generalization

Evaluations on long-context (RULER) and multi-turn dialogue (OASST2-4k) benchmarks demonstrate that KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results suggest that predicting the future utility of tokens is an effective and scalable paradigm for adaptive KV cache management.

KV Policy: Reinforcement Learning for Key-Value Cache Eviction in LLMs

Optimizing KV Cache with Reinforcement Learning

KV Policy: A Future Utility-Based Approach

Performance and Generalization

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

ICLR 2026: focus su allineamento, efficienza dati e sicurezza

Found-RL: Reinforcement Learning potenziato per guida autonoma

Rivoluzione per la comunicazione multi-agente: Q-KVComm