## MedPI: Evaluating AI in Medical Interactions MedPI, a new benchmark designed to evaluate the capabilities of large language models (LLMs) in conversations between patients and clinicians, has been introduced. This tool stands out from traditional question-and-answer (QA) benchmarks by evaluating medical dialogue more comprehensively. MedPI analyzes interactions across 105 different dimensions, encompassing the entire medical process, treatment safety, outcomes, and the quality of communication between doctor and patient. The evaluation framework is aligned with the competencies of the Accreditation Council for Graduate Medical Education (ACGME). ## Architecture and Components of MedPI MedPI is structured in five main layers: 1. **Patient Packets:** Synthetic data similar to electronic health records (EHR) representing the ground truth. 2. **AI Patient:** A virtual patient created through an LLM, equipped with memory and the ability to simulate emotions. 3. **Task Matrix:** A matrix of tasks that combines different reasons for consultation (anxiety, pregnancy, wellness checkup) with specific objectives (diagnosis, lifestyle advice, medication advice). 4. **Evaluation Framework:** The evaluation system with 105 dimensions, measured on a scale from 1 to 4 and mapped to ACGME competencies. 5. **AI Judges:** Calibrated LLMs that act as judges, providing scores, flags, and evidence-linked rationales. ## Initial Results and Implications Nine flagship models were evaluated, including Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, and Grok-4, using 366 virtual patients and over 7,000 conversations. The results highlighted generally low performance in several areas, particularly in formulating differential diagnoses. This work can help guide the future use of LLMs for diagnosis and treatment recommendations, highlighting areas that need further improvement.