MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

MedArena: Evaluating Medical LLMs in the Real World

A new study introduces MedArena, a platform designed to evaluate the performance of large language models (LLMs) in real-world clinical settings. Unlike traditional benchmarks, which often rely on static and predefined datasets, MedArena allows clinicians to directly test and compare different LLMs using their own medical queries.

Methodology and Results

The platform presents clinicians with responses from two randomly selected models and asks them to indicate their preferred response. Based on 1571 preferences collected up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o ranked at the top. Interestingly, only one-third of the questions posed by clinicians concerned factual recall tasks, while the majority focused on topics such as treatment selection, clinical documentation, and patient communication.

Importance of Clarity and Detail

Clinicians highlighted that the depth, detail, and clarity of presentation were more important factors than mere factual accuracy. This underscores the importance of readability and clinical nuance in medical applications of LLMs. The study also confirms that model rankings remain stable even after controlling for stylistic factors such as response length and formatting.

MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

MedArena: Evaluating Medical LLMs in the Real World

Methodology and Results

Importance of Clarity and Detail

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Chatbots Make Terrible Doctors, New Study Finds

LLM: A Human-Centric Pipeline for Aligning LLMs with Chinese Medical Ethics

MedPI: A New Benchmark for Evaluating AI in Medical Dialogue

👥 Join 160+ AI explorers