ReBalance: An Innovative Approach to LLM Efficiency

Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, but often waste computational resources on simple problems (overthinking) or do not sufficiently explore possible solutions (underthinking). This limits their practical applicability, especially in resource-constrained contexts.

ReBalance is a training-free framework that aims to solve these problems. It leverages the model's confidence as a continuous indicator of reasoning dynamics. Overthinking is identified through high variance in confidence, while underthinking manifests as excessive and constant confidence.

How ReBalance Works

The system aggregates hidden states from a small dataset into reasoning mode prototypes. It then calculates a steering vector to guide the reasoning trajectories of LLMs. A dynamic control function modulates the intensity and direction of this vector based on real-time confidence, eliminating redundancy during overthinking and promoting exploration during underthinking.

Experimental Results

Extensive tests on four models (from 0.5B to 32B parameters) and nine benchmarks in various domains (mathematical reasoning, general question answering, and code development) demonstrate that ReBalance effectively reduces redundancy and improves accuracy. This makes it a general, plug-and-play, and training-free strategy for efficient and robust LLM deployment.

The project code is available on GitHub.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.