Evaluating and Optimizing AI Shopping Assistants

A recent study published on arXiv proposes a blueprint for the continuous evaluation and optimization of conversational AI assistants (CSAs), with a specific focus on grocery shopping. These systems, based on multiple agents, present unique challenges in terms of evaluating complex interactions and optimizing performance.

The research introduces a multi-faceted evaluation rubric that analyzes the end-to-end shopping quality, breaking it down into structured dimensions. An LLM-based evaluation system, calibrated with human annotations, is also developed to automate and scale the evaluation process.

Prompt Optimization Strategies

The study explores two complementary prompt optimization strategies, based on a state-of-the-art optimizer called GEPA: Sub-agent GEPA, which optimizes individual agents against localized rubrics, and MAMuT (Multi-Agent Multi-Turn) GEPA, a system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring.

Evaluation templates and design guidance have been released to support practitioners in developing production CSAs. For those evaluating on-premise deployments, there are trade-offs to consider carefully; AI-RADAR offers analytical frameworks on /llm-onpremise to support these evaluations.