Experimental Model with Subquadratic Attention: Up to 10M Context Length

Subquadratic Attention for LLM Models: A New Approach

A new experimental model implementing a subquadratic attention mechanism has been presented. This approach aims to reduce the computational complexity typical of transformer models, allowing to handle much larger contexts with limited hardware resources.

The key idea is to replace the brute-force search (O(L)) in standard attention with a jump-search (O(L^0.5)) guided by learned routing. This reduces the total complexity to O(L^(3/2)), allowing scaling to contexts of 1M–10M tokens on a single GPU.

Performance and Features

The 30B model, tested on a single B200 GPU, showed the following performance:

1M tokens of context: Prefill ~20,202 tok/s, Decode ~109 tok/s, 66 GB of memory
10M tokens of context: Prefill ~5,576 tok/s, Decode ~76 tok/s, ~120 GB of memory

A crucial aspect is that the 10x increase in context length resulted in only a 30% drop in decoding speed, contrary to the 10x slowdown that would occur with dense attention.

The model comes with an OpenAI-compatible server and a CLI, facilitating integration and testing. Future improvements are planned, including 4-bit/8-bit quantization to allow execution on consumer GPUs with 24GB of VRAM (e.g., RTX 4090 / RTX 5090) and porting to AMD ROCm and Apple Silicio.

Implications for on-premise deployment

The ability to handle extended contexts with relatively contained hardware requirements opens new possibilities for the on-premise deployment of LLM models. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.

Experimental Model with Subquadratic Attention: Up to 10M Context Length

Subquadratic Attention for LLM Models: A New Approach

Performance and Features

Implications for on-premise deployment

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Rivoluzione quantistica nei modelli LLM: CodeGEMM

Spesa per chip AI vicina a 1 trilione di dollari

OpenAI punta su Cerebras per il modello di sviluppo codice GPT-5.3-Codex-Spark