AI agents are gaining autonomy, but their adoption in enterprise environments carries significant risks. To mitigate these risks, researchers at Carnegie Mellon University and Fujitsu have developed three benchmarks to assess when AI agents are safe and effective enough to manage business operations without human oversight.

FieldWorkArena: safety in the field

The first benchmark, FieldWorkArena, evaluates AI agents employed in logistics and manufacturing environments, such as factories and warehouses. It measures the accuracy of agents in detecting safety rule violations and deviations from work procedures, as well as in generating incident reports. The tests use real-world data, including work manuals, safety regulations, and images/videos captured on-site. Faces and sensitive work areas are blurred to protect privacy.

Researchers evaluated three multimodal LLMs (Claude Sonnet 3.7, Gemini 2.0 Flash, and GPT-4o) and found that, although they excelled in information extraction and image recognition, the models tended to hallucinate and had difficulty counting objects accurately and measuring specific distances.

ECHO and RAG: knowledge management

The other two benchmarks, ECHO (EvidenCe-prior Hallucination Observation) and an enterprise RAG (Retrieval-Augmented Generation) benchmark, evaluate the effectiveness of hallucination mitigation strategies in vision-language models and the ability of AI agents to retrieve data from an authoritative knowledge base and use it to improve their responses, respectively. ECHO results indicate that techniques such as image cropping and reinforcement learning can reduce hallucinations.

Fujitsu plans to expand the capabilities of the benchmarks to cover other industries and use cases, continuously updating them to reflect the evolution of AI agents. For those evaluating on-premise deployments, there are trade-offs that AI-RADAR analyzes in detail in the /llm-onpremise section.