PathoSage: An Agentic Framework for Computational Pathology with Structured Evidence Adjudication

PathoSage: Addressing MLLM Challenges in Computational Pathology

Recent advancements in Multimodal Large Language Models (MLLMs) and agent workflows have shown significant promise for computational pathology. These systems offer the potential to automate and enhance the analysis of complex images and data, yet their application in critical clinical contexts is still hindered by substantial challenges. Specifically, reliable patch-level reasoning, which is essential for accurate diagnoses, remains a weak point.

End-to-end MLLMs, while powerful, often tend to "hallucinate" morphological features, meaning they invent details not present in the actual data. Concurrently, current agentic systems frequently merge outputs from various tools and retrieved knowledge into a shared context. This approach makes decisions vulnerable to conflicting evidence and context contamination, thereby compromising the overall reliability of the system.

A Structured Approach to Evidence Adjudication

To overcome these limitations, PathoSage has been proposed as a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level multimodal reasoning in pathology. This modular architecture is designed to ensure greater transparency and robustness in the decision-making process.

The core of PathoSage is a component called Structured Evidence Deliberation. This module is responsible for independently evaluating heterogeneous evidence originating from different tools, performing a conflict analysis among the collected information, and generating the final judgment. A crucial aspect is that this deliberation occurs in a "fresh" context, meaning it is isolated from previous stages, with the goal of reducing anchoring bias that could arise from pre-existing information or a contaminated context.

Tool Reliability and Deployment Implications

In addition to structured deliberation, PathoSage introduces a training-free Beta-Bernoulli experience system. This system is designed to model the long-term reliability of the tools used and to construct similarity-weighted priors for their future use. Continuous credit assignment allows the system to learn and adapt to the reliability of each tool over time, progressively improving the quality of decisions.

Experimental results demonstrate that PathoSage effectively mitigates VQA (Visual Question Answering) hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. For CTOs, DevOps leads, and infrastructure architects evaluating AI/LLM solution deployments, the emphasis on reasoning reliability and robustness is paramount. In on-premise or air-gapped environments, where data sovereignty and compliance are priorities, having a system that minimizes errors and hallucinations reduces the overall TCO by limiting the need for human intervention for verification and ensuring greater trust in the results.

Towards More Robust and Controllable AI Agents

PathoSage's approach highlights how explicit evidence adjudication and reliability-aware tool modeling are key ingredients for developing robust AI agents, especially in critical sectors such as medicine. A system's ability to critically analyze its sources and manage information conflicts is a significant step towards achieving more reliable and interpretable outcomes.

For organizations considering the on-premise deployment of LLMs and MLLMs, solutions like PathoSage offer a model for building more controllable and transparent AI systems. The ability to isolate and analyze different reasoning stages, coupled with the capacity to evaluate tool reliability, contributes to meeting stringent data compliance and security requirements. AI-RADAR emphasizes that choosing frameworks that prioritize robustness and verifiability is crucial for the success of AI projects in environments with strict constraints.