Nvidia has developed a new technique called Dynamic Memory Sparsification (DMS) that promises to significantly improve the efficiency of LLMs during inference, reducing computing costs by up to 8x without compromising accuracy.
How DMS Works
DMS works by optimizing the management of the models' KV cache. The technique involves adding a learned "keep or evict" signal for each token within the cache. This signal determines whether a token should be kept in memory or removed, based on its estimated importance for the inference process.
In addition, DMS introduces a "delayed eviction" mechanism. Tokens marked as low importance are not immediately deleted, but remain accessible for a short period. This allows the model to extract any useful information contained within them before their final removal.
Benefits
The reduction in KV memory usage, up to a factor of 8, translates into several advantages. Models can "think" longer, operate faster, and handle a larger number of simultaneous requests.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!