RoPE and Variable Length Inputs: A Geometric Analysis

Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models. However, performance tends to decline when the length of the inputs exceeds that used in training.

A recent study analyzed this phenomenon from a geometric perspective, highlighting how longer inputs compromise the separation between key and query clusters in the latent space. This leads to anomalous behavior, inhibiting the functionality of "sink tokens," elements that prevent token mixing when not necessary.

RoPE-ID: A Solution for Extended Inputs

Based on this geometric analysis, the researchers propose RoPE-ID (In Distribution), a modification that allows attention layers to generalize to longer inputs. RoPE-ID applies RoPE at high frequency to a subset of channels.

The effectiveness of RoPE-ID has been demonstrated using Transformers with 1B and 3B parameters on the LongBench and RULER benchmarks for information retrieval. This modification allows handling longer inputs without a significant drop in performance.