Google: Longer Reasoning Chains Don't Imply Higher Accuracy in LLMs

Google Challenges the Extended Chain of Thought Assumption

A recent study by Google has challenged the common belief that longer reasoning chains in language models (LLMs) necessarily translate into greater accuracy. The research, which analyzed eight model variants including GPT-OSS, DeepSeek-R1, and Qwen3, found a negative correlation between token chain length and the precision of answers.

Deep Thinking Ratio (DTR): A New Metric

The research team introduced the concept of Deep Thinking Ratio (DTR), a parameter that measures the fraction of tokens actually involved in deep processing versus filler tokens. DTR is calculated by monitoring changes in the distribution of predictions across different model layers. Tokens that stabilize quickly in the shallow layers are considered "filler", while those that undergo continuous revisions in the deeper layers are indicative of real reasoning. The study found that DTR has a positive correlation with accuracy (0.82), higher than that of token chain length.

Think@n Strategy and Implications for Local Inference

Based on these findings, the researchers proposed the Think@n strategy, which involves sampling multiple reasoning paths, estimating the DTR from the first 50 tokens, and retaining only the 50% of samples with the highest DTR. This approach, combined with a majority vote, led to equal or higher accuracy with a 50% reduction in computational load. For example, GPT-OSS-120B-medium achieved 94.7% on AIME 2025 with Think@n compared to 92.7% with the standard approach. Early identification and termination of low-quality reasoning paths can lead to significant savings in computational resources, with a reduction in token consumption from 355.6k to 181.9k in tests.

Relevance for Local Inference

This research has significant implications for local inference of LLMs. The ability to terminate inefficient reasoning paths early allows for more attempts to be run with the same computational budget. Even cloud-based tools that run multiple agent passes could benefit from this type of filtering.