Geodesic Attention Engine (GAE) is an open-source attention kernel (AGPL-3.0 license) designed to optimize memory usage and energy efficiency in processing large language models (LLM).
Key Features
- Memory Efficiency: GAE allows processing 1 million tokens with only 1.09 GB of VRAM, compared to the 4.4 TB required by standard approaches. For contexts of 65,000 tokens, a 99.6% memory reduction is observed.
- Accuracy: The kernel guarantees bit-exact results, without approximations or sparsity.
- Energy Savings: Using GAE promises energy savings of over 75% for contexts of 8,000 tokens or more.
Implementation
GAE achieves these results thanks to a fused kernel that reduces HBM round-trips from 12 to 2, keeping all data in registers. The source code is available on GitHub.
For those evaluating on-premise deployments, there are trade-offs to consider between initial hardware costs and long-term benefits in terms of data control and TCO. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!