STEMVerse: Analyzing Scientific Reasoning in LLMs

Evaluating reasoning capabilities in science, technology, engineering, and mathematics (STEM) has become crucial for measuring the intelligence of machines, especially large language models (LLMs). However, current benchmarks often provide only aggregate scores, limiting the ability to diagnose the causes of errors.

To address this limitation, STEMVerse has been proposed, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. STEMVerse characterizes model performance based on academic specialization and cognitive complexity, creating a detailed mapping of the capabilities required for reasoning.

The framework re-aggregates over 20,000 STEM problems from established benchmarks into a unified "Discipline ร— Cognition" space, assigning dual-axis labels to each instance. This approach allows for the systematic evaluation of different LLM families, revealing structural error patterns in STEM reasoning. By integrating multi-disciplinary coverage and precise cognitive stratification, STEMVerse offers a clear and useful perspective for understanding the scientific reasoning characteristics of LLMs.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.