Analysis of Reasoning Failures in Large Language Models

A recent study published on arXiv (arXiv:2602.06176v1) provides an in-depth examination of reasoning failures in large language models (LLMs). Despite the progress made, LLMs still exhibit significant shortcomings even in seemingly simple scenarios.

The study presents a categorization of reasoning into two main types: embodied and non-embodied reasoning. The latter is further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, reasoning failures are classified into three categories:

  • Fundamental failures: intrinsic to LLM architectures and with a broad impact.
  • Application-specific limitations: that manifest in specific domains.
  • Robustness issues: inconsistent performance in the face of minor variations.

For each type of failure, the research provides a clear definition, analyzes existing studies, explores root causes, and presents mitigation strategies. The goal is to provide a structured view of the weaknesses of LLMs and guide future research towards stronger and more reliable reasoning capabilities. A collection of research resources on LLM reasoning failures has also been made available on GitHub.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.