Geodesic Attention Engine (GAE) is an open-source attention kernel (AGPL-3.0 license) designed to optimize memory usage and energy efficiency in processing large language models (LLM).

Key Features

  • Memory Efficiency: GAE allows processing 1 million tokens with only 1.09 GB of VRAM, compared to the 4.4 TB required by standard approaches. For contexts of 65,000 tokens, a 99.6% memory reduction is observed.
  • Accuracy: The kernel guarantees bit-exact results, without approximations or sparsity.
  • Energy Savings: Using GAE promises energy savings of over 75% for contexts of 8,000 tokens or more.

Implementation

GAE achieves these results thanks to a fused kernel that reduces HBM round-trips from 12 to 2, keeping all data in registers. The source code is available on GitHub.

For those evaluating on-premise deployments, there are trade-offs to consider between initial hardware costs and long-term benefits in terms of data control and TCO. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.