LiveMedBench: A New Standard for Evaluating Medical LLMs

The rigorous evaluation of Large Language Models (LLMs) is fundamental, especially in sensitive clinical settings. Current medical benchmarks have significant limitations, including data contamination and failure to adhere to the rapid evolution of medical knowledge.

LiveMedBench addresses these challenges through:

  • Continuous updates: Weekly collection of real-world clinical cases from online medical communities.
  • Absence of contamination: Strict temporal separation between model training data and test data.
  • Criteria-based evaluation: An automated framework that breaks down responses into granular criteria specific to each case, aligning more closely with the judgment of expert physicians.

Architecture and Data

LiveMedBench includes 2,756 real-world cases in 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. A multi-agent framework filters noise from raw data and validates its clinical integrity.

Performance and Analysis

The evaluation of 38 LLMs revealed that the best-performing model achieves only 39.2%. 84% of models show a performance drop on post-cutoff cases, confirming the risks of data contamination. Error analysis indicates that contextual application, rather than factual knowledge, represents the main bottleneck.