More Thinking, More Bias: Reasoning Length Correlates with Position Bias in LLMs

New research challenges the common assumption that reasoning-based Large Language Models (LLMs), such as those employing Chain-of-Thought (CoT) or specifically reasoning-tuned models like DeepSeek-R1, are inherently more robust to shallow heuristic biases. The prevailing idea is that deeper "thinking" leads to more considered decisions less influenced by external factors. However, a recent study reveals a different reality, showing how position bias in multiple-choice answers can actually increase with the length of the model's reasoning trajectory.

The research, published on arXiv, examined the behavior of thirteen reasoning-mode model configurations. These included two 7-8B parameter models distilled from R1, two base models prompted with CoT, and DeepSeek-R1 in its 671B parameter version. Tests were conducted on established benchmarks such as MMLU, ARC-Challenge, and GPQA. The results were significant: twelve of the thirteen configurations showed a positive partial correlation between reasoning trajectory length and Position Bias Score (PBS), with values ranging from 0.11 to 0.41 (all p < 0.05), even after controlling for overall accuracy. Notably, all twelve open-weight reasoning-mode configurations showed monotonically increasing PBS across length quartiles. A truncation intervention provided causal evidence: continuations resumed from later points in the trajectory were increasingly likely to shift toward position-preferred options, with an increase from 16% to 32% for the R1-Qwen-7B model across absolute-position buckets.

Implications for Deployment and Evaluation

These findings have direct implications for CTOs, DevOps leads, and infrastructure architects evaluating LLM deployment, especially in self-hosted or air-gapped environments where data sovereignty and control are paramount. Model reliability and predictability are crucial. If reasoning models are not inherently "order-robust" in multiple-choice evaluation pipelines, it raises questions about their robustness in real-world production scenarios, where the order of information or options might influence output in unexpected ways.

For those evaluating self-hosted solutions, understanding these biases is essential to ensure that models provide accurate and impartial responses, regardless of the reasoning complexity required. The research suggests that overall model accuracy might gate the expression of length-driven bias rather than eliminating the underlying mechanism. For instance, the 671B DeepSeek-R1 model showed a very low aggregate PBS (0.019), but the length effect still manifested in the longest quartile of the trajectory (PBS = 0.071). This indicates that even larger, higher-performing models are not immune to this phenomenon, requiring careful evaluation and auditing.

Bias Distinction and Diagnostic Tools

The study also highlighted an important distinction between direct-answer position bias and bias accumulated through CoT reasoning. Direct-answer position bias is a distinct phenomenon with a different footprint: it was observed to be strong in Llama-Instruct-direct and weak in Qwen-Instruct-direct, and uncorrelated with trajectory length. CoT reasoning, in this context, appears to replace this baseline bias with length-accumulated bias. This differentiation is crucial for understanding the nature of biases that manifest in different response generation approaches.

To address these challenges, researchers propose a comprehensive diagnostic toolkit for auditing position bias in reasoning models. This toolkit includes metrics such as the Position Bias Score (PBS), "commitment change point," "effective switching," and "truncation probes." Adopting such tools is essential for organizations planning to deploy LLMs in critical environments where transparency and bias mitigation are non-negotiable requirements.

Future Outlook for LLM Reliability

The finding that more extended reasoning can lead to increased position bias underscores the inherent complexity of evaluating and optimizing LLMs. For teams managing AI infrastructure, this means that simply adopting "reasoning" models is not sufficient to guarantee immunity from subtle but potentially harmful biases. It is imperative to integrate robust auditing methodologies into development and deployment pipelines.

This approach is particularly relevant for strategies prioritizing data sovereignty and full control over infrastructure, such as self-hosted deployments. The ability to diagnose and potentially mitigate these biases locally, without relying on proprietary cloud solutions, becomes a critical factor for TCO and compliance. AI-RADAR, for example, offers analytical frameworks on /llm-onpremise to evaluate trade-offs between performance, cost, and reliability in on-premise deployment contexts, providing tools for in-depth analysis of these aspects. Continued research in this field will be fundamental for building more reliable and impartial LLMs for enterprise applications.