ResearchGym: A Benchmark for Evaluating AI Agents in Research

ResearchGym: Evaluating AI in Scientific Research

ResearchGym, a new benchmark environment designed to evaluate the performance of artificial intelligence agents in the context of scientific research, has been presented. The system is based on five prominent publications (ICML, ICLR, and ACL), reusing the related datasets, evaluation environments, and baseline implementations.

The goal is to provide a controlled environment in which AI agents can formulate hypotheses, run experiments, and try to improve the results obtained by human researchers. Each environment is containerized and comprises a total of 39 sub-tasks.

Results and Current Limitations

A controlled evaluation of an agent based on GPT-5 revealed a significant gap between theoretical capability and practical reliability. The agent managed to improve the baselines provided in only 6.7% of cases (1 out of 15), with an average improvement of 11.5%. Furthermore, it completed on average only 26.5% of the sub-tasks.

Several recurring problems were identified, including: impatience, inefficient management of time and resources, excessive confidence in weak hypotheses, difficulty in coordinating parallel experiments, and limitations due to context length. Despite these limitations, in a single case the agent managed to surpass the solution of an ICML 2025 task, demonstrating that the most advanced agents can occasionally achieve state-of-the-art performance, albeit unreliably.

Additional evaluations of proprietary agents such as Claude Code (Opus-4.5) and Codex (GPT-5.2) showed a similar gap between capability and reliability. ResearchGym is proposed as an infrastructure for the systematic evaluation and analysis of autonomous agents in closed-loop research. For those evaluating on-premise deployments, there are trade-offs that AI-RADAR analyzes in detail at /llm-onpremise.

ResearchGym: A Benchmark for Evaluating AI Agents in Research

ResearchGym: Evaluating AI in Scientific Research

Results and Current Limitations

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

FGDCC: Fine-Grained Deep Cluster Categorization -- A Framework for Intra-Class Variability Problems in Plant Classification

AI and climate: a new report debunks hyperscalers' promises

OpenAI launches Prism, a new AI workspace for scientists

👥 Join 160+ AI explorers