DeepSearchQA: Evaluating the Deep Research Capabilities of Agents

DeepSearchQA, a 900-prompt benchmark designed to evaluate agent performance in complex multi-step information-seeking tasks across 17 different fields, has been introduced. This new dataset stands out from traditional benchmarks, which often focus on retrieving single answers or verifying factuality across a broad spectrum.

DeepSearchQA aims to evaluate three fundamental capabilities: the systematic collation of fragmented information from disparate sources, de-duplication and entity resolution to ensure precision, and the ability to reason about stopping criteria within an open search space. Each task is structured as a causal chain, where discovering information for one step depends on the completion of the previous one, emphasizing long-term planning and context retention.

The evaluation of state-of-the-art agents revealed significant limitations, with difficulties in balancing high recall with precision. Several failure modes were observed, including premature stopping of the search and hedging behaviors, where agents produce a wide range of low-confidence answers to artificially increase recall. These results indicate that there is still room to improve the architectures of research agents.