LLM agents are increasingly given powerful tools, but with that capability comes a real risk: when an agent’s proposed action deviates from the user’s intent — misalignment — the consequences can be harmful and hard to undo. Current guardrails based on an “LLM-as-a-judge” paradigm lack a systematic framework for reasoning about alignment, often producing judgments that are inconsistent and difficult to audit. A new study introduces a paradigm shift: ProvenanceGuard, a multi-stage pipeline that analyzes the provenance of tool calls to decide whether to allow execution.

At the heart of the approach is a simple yet robust idea from provenance analysis: every action must be supported by traceable evidence within the agent’s context. ProvenanceGuard breaks the problem into three types of misalignment, checking first that the selected tool matches the requested action, then that the parameters are consistent with the intent, and finally that there are no semantic deviations.

The numbers are clear: on the Agent-SafetyBench and WorkBench benchmarks, across 10 backbone LLMs, the error rate on misaligned traces plummets from 42.9% to 1.8% on the first benchmark and from 32.1% to 17.3% on the second. Meanwhile, interventions on successful traces — false positives that block legitimate operations — drop from 30.5% to 12.8%. On already aligned cases, the increase in unwarranted blocks is statistically negligible. In practice, the system is more precise at stopping what must be stopped and lighter at letting through what can go through.

For those managing on-premise infrastructure, such an approach is doubly relevant. In self-hosted environments where data stays under one’s own control, you cannot rely on external services to judge alignment: every block must be justified and, crucially, auditable. Provenance analysis offers exactly that: a transparent decision chain based on contextual evidence, open to inspection by operations and compliance teams. In regulated industries like finance or healthcare, auditability is not optional but a requirement, and frameworks like this close the gap between powerful automation and demanding governance.

It’s not just theoretical safety: the marked improvement in metrics suggests that a provenance-based guardrail can reduce operational burden by avoiding overly zealous interventions. In a self-hosted deployment, where each unjustified block turns into a support ticket and lost productivity, cutting false positives from 30.5% to 12.8% is a tangible benefit.

The research, released on two standard datasets, does not offer a turnkey product but a conceptual framework and a reference implementation. The message for on-premise solution architects is clear: the path to securing LLM agents increasingly runs through structured, traceable mechanisms, not simple black-box judges.