GSA: A Novel Approach to Attention in Language Models
Large language models require enormous computational power, especially when they have to handle very long contexts. Two main approaches have been developed to address this challenge: sparse attention mechanisms, which reduce complexity by focusing on specific tokens, and gated attention variants, which improve training stability.
A new study introduces Gated Sparse Attention (GSA), an architecture that combines the benefits of both approaches. GSA uses a gated lightning indexer with sigmoid activations, an adaptive sparsity controller, and a dual gating system.
Experimental Results
Experimental results, obtained with 1.7 billion parameter models trained on 400 billion tokens, show that GSA matches the efficiency of sparse-only baselines (12-16x speedups with 128K contexts) and achieves the qualitative improvements associated with gated attention. In particular, perplexity drops from 6.03 to 5.70, RULER scores at 128K context nearly double, and attention to the first token (an indicator of attention sinks) is reduced from 47% to less than 4%. Training stability improves markedly, with loss spikes reduced by 98%.
Implications
The GSA architecture represents a significant step forward in the development of more efficient and stable language models, paving the way for new applications that require the management of very large contexts. The ability to reduce computational costs and improve the quality of results makes GSA a promising solution for the future of natural language processing.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!