AIDABench: A New Benchmark for AI-Driven Data Analytics

AIDABench: Evaluating AI in Complex Data Analysis

As AI-driven document understanding and processing tools become increasingly prevalent, the need for rigorous evaluation standards is emerging. Existing benchmarks often focus on isolated capabilities, failing to capture the end-to-end effectiveness required in real-world settings.

To address this gap, AIDABench has been introduced, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks. AIDABench encompasses over 600 diverse tasks across three core areas:

Question answering
Data visualization
File generation

These tasks are grounded in realistic scenarios involving heterogeneous data, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. The difficulty of the tasks is such that even human experts, assisted by AI tools, require an average of 1-2 hours to answer a single question.

Performance of Current Models

Eleven state-of-the-art models were evaluated on AIDABench, including proprietary models (such as Claude Sonnet 4.5 and Gemini 3 Pro Preview) and open-source models (such as Qwen3-Max-2026-01-23-Thinking). The results show that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1.

The detailed analysis of failure modes highlights key challenges for future research. AIDABench is intended as a reference for companies that need to choose tools, optimize models, and evaluate deliveries. The benchmark is publicly available at https://github.com/MichaelYang-lyx/AIDABench.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

AIDABench: A New Benchmark for AI-Driven Data Analytics

AIDABench: Evaluating AI in Complex Data Analysis

Performance of Current Models

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Hallucination Benchmark: Kimi K2.5 outperforms Opus 4.6 in Pharma

Causal Discovery: New Method for Discrete Data

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages

👥 Join 160+ AI explorers