Subquadratic Attention for LLM Models: A New Approach

A new experimental model implementing a subquadratic attention mechanism has been presented. This approach aims to reduce the computational complexity typical of transformer models, allowing to handle much larger contexts with limited hardware resources.

The key idea is to replace the brute-force search (O(L)) in standard attention with a jump-search (O(L^0.5)) guided by learned routing. This reduces the total complexity to O(L^(3/2)), allowing scaling to contexts of 1Mโ€“10M tokens on a single GPU.

Performance and Features

The 30B model, tested on a single B200 GPU, showed the following performance:

  • 1M tokens of context: Prefill ~20,202 tok/s, Decode ~109 tok/s, 66 GB of memory
  • 10M tokens of context: Prefill ~5,576 tok/s, Decode ~76 tok/s, ~120 GB of memory

A crucial aspect is that the 10x increase in context length resulted in only a 30% drop in decoding speed, contrary to the 10x slowdown that would occur with dense attention.

The model comes with an OpenAI-compatible server and a CLI, facilitating integration and testing. Future improvements are planned, including 4-bit/8-bit quantization to allow execution on consumer GPUs with 24GB of VRAM (e.g., RTX 4090 / RTX 5090) and porting to AMD ROCm and Apple Silicio.

Implications for on-premise deployment

The ability to handle extended contexts with relatively contained hardware requirements opens new possibilities for the on-premise deployment of LLM models. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.