ResearchGym: Evaluating AI in Scientific Research
ResearchGym, a new benchmark environment designed to evaluate the performance of artificial intelligence agents in the context of scientific research, has been presented. The system is based on five prominent publications (ICML, ICLR, and ACL), reusing the related datasets, evaluation environments, and baseline implementations.
The goal is to provide a controlled environment in which AI agents can formulate hypotheses, run experiments, and try to improve the results obtained by human researchers. Each environment is containerized and comprises a total of 39 sub-tasks.
Results and Current Limitations
A controlled evaluation of an agent based on GPT-5 revealed a significant gap between theoretical capability and practical reliability. The agent managed to improve the baselines provided in only 6.7% of cases (1 out of 15), with an average improvement of 11.5%. Furthermore, it completed on average only 26.5% of the sub-tasks.
Several recurring problems were identified, including: impatience, inefficient management of time and resources, excessive confidence in weak hypotheses, difficulty in coordinating parallel experiments, and limitations due to context length. Despite these limitations, in a single case the agent managed to surpass the solution of an ICML 2025 task, demonstrating that the most advanced agents can occasionally achieve state-of-the-art performance, albeit unreliably.
Additional evaluations of proprietary agents such as Claude Code (Opus-4.5) and Codex (GPT-5.2) showed a similar gap between capability and reliability. ResearchGym is proposed as an infrastructure for the systematic evaluation and analysis of autonomous agents in closed-loop research. For those evaluating on-premise deployments, there are trade-offs that AI-RADAR analyzes in detail at /llm-onpremise.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!