Artificial intelligence is revolutionizing the production of clinically meaningful radiology reports from medical images such as chest x-rays. Automated report generation can reduce workload and improve efficiency for healthcare professionals. Beyond the practical benefits, report generation has become a crucial benchmark for evaluating multimodal reasoning in AI applied to healthcare.

UniRG: A New Approach with Reinforcement Learning

Microsoft Research has presented Universal Report Generation (UniRG), a reinforcement learning-based framework for medical imaging report generation. This research prototype aims to advance research on medical AI and is not validated for clinical use. UniRG uses reinforcement learning to directly optimize clinical evaluation signals, aligning model training with real-world radiology practice rather than approximate text generation objectives. With this framework, UniRG-CXR has been trained, a state-of-the-art model for generating chest x-ray reports on a large scale, comprising over 560,000 studies, 780,000 images, and 226,000 patients from more than 80 medical institutions.

Performance and Generalization

UniRG-CXR achieves superior performance in terms of report-level metrics, disease-level diagnostic accuracy, generalization across institutions, longitudinal report generation, and demographic subgroups. The results demonstrate that reinforcement learning, guided by clinically meaningful reward signals, can significantly improve the reliability and generality of vision-language models in the medical field.

A unified framework for scaling medical image report generation

UniRG builds state-of-the-art report generation models by combining supervised fine-tuning with reinforcement learning, which optimizes a composite reward that integrates rule-based metrics, model-based semantic metrics, and LLM-based clinical error signals. This approach allows the UniRG-CXR model to learn from diverse data sources, overcome dataset-specific reporting patterns, and learn representations that generalize across institutions, metrics, and clinical contexts. Notably, UniRG-CXR sets a new state of the art on the authoritative ReXrank leaderboard, a public leaderboard for chest X-ray image interpretation, surpassing previous best models by substantial margins.

Universal improvements across metrics and clinical errors

Rather than excelling on one metric at the expense of others, UniRG-CXR delivers balanced improvements in many different measures of report quality. More importantly, it produces reports with substantially fewer clinically significant errors. This indicates that the model is not just learning how to sound like a radiology report, but is better capturing the underlying clinical facts. Explicit optimization for clinical correctness helps the model avoid common error modes where fluent language masks incorrect or missing findings.

Strong performance in longitudinal report generation

In clinical practice, radiologists often compare current images with previous exams to determine whether a condition is improving, worsening, or unchanged. UniRG-CXR is able to effectively incorporate this historical information, generating reports that reflect significant changes over time. This allows the model to describe new findings, progressions, or resolutions of the disease more accurately, moving closer to how radiologists reason through patient histories rather than treating each exam in isolation.

Robust generalization across institutions and populations

UniRG-CXR maintains strong performance even when applied to data from institutions it has never seen before. This suggests that the model is learning general clinical patterns rather than memorizing institution-specific reporting styles. Furthermore, its performance remains stable across different patient subgroups, including age, gender, and race. This robustness is critical for deployment in the real world, where models must function reliably across diverse populations and healthcare environments.