Microsoft has released AgentRx, an open-source framework designed to simplify the debugging of AI agents. The goal is to address the increasing complexity of these systems, which often operate over extended time horizons, are probabilistic, and involve multiple agents, making it difficult to pinpoint the root cause of an error.

How AgentRx works

AgentRx normalizes execution logs, synthesizes executable constraints based on tool schemas and domain policies, and evaluates these constraints step by step. The system generates an auditable validation log and uses a large language model (LLM) to identify the critical failure step, i.e., the first unrecoverable step in the agent's trajectory.

Benchmark and taxonomy

Along with the framework, Microsoft has released the AgentRx Benchmark, a dataset containing 115 manually annotated failed execution trajectories. These trajectories come from various domains, including ฯ„-bench, Flash, and Magentic-One. A taxonomy of nine error categories has also been defined to help developers distinguish between different types of failures, such as failure to adhere to a plan or the invention of new information.

Results

Tests have shown that AgentRx significantly improves accuracy in identifying errors (+23.6%) and in attributing the root cause (+22.9%) compared to traditional prompt-based methods. This allows developers to move from a trial-and-error approach to a more systematic engineering methodology.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.