LLM Ensembles for Detecting Quality-of-Life Studies in PubMed Abstracts

The race toward automated abstract screening

The exponential growth of scientific publications is turning systematic reviews into a resource-sapping nightmare. In biomedicine, identifying studies that clearly report health-related quality-of-life outcomes – such as EQ-5D data – requires nuanced clinical interpretation that simple keyword filters miss. A research team decided to put Large Language Models to the test on this task using only PubMed abstracts as input.

Inside the pipeline: few-shot, ensemble, and soft stacking

The study revolves around a multi-phase framework. It begins with few-shot prompting: the model receives a handful of expert-labeled examples, enough to guide it in distinguishing studies with and without EQ-5D data. The next step is aggregation: predictions from nine LLMs (Google’s Gemini and Gemma) are combined through a weighted ensemble and a meta-classifier based on soft stacking, which uses the raw probabilities from each model rather than just the final class labels.

Performance: strength in numbers, but with balance

The best-performing weighted ensemble – combining gemini-2.5-pro, gemma-3-12b, and gemma-3-27b – reached a weighted F1-score of 0.74 and an accuracy of 0.74, surpassing every individual model. The real story isn’t the absolute figure but the improved balance between precision and recall. Individual models often leaned too far toward one metric, while the ensemble smoothed out the asymmetries. Feature analysis confirmed that the raw probabilities from the LLMs were crucial for guiding the final decision.

What this means for those eyeing on-premise deployment

Stacking multiple LLMs for a screening task echoes the multi-agent and retrieval-augmented architectures gaining traction in the enterprise. But the stakes here are different: in biomedicine, data sovereignty is often a hard requirement (think of GDPR for health data). Relying on models available only via API, such as Gemini 2.5 Pro, can create compliance friction. On the other hand, the Gemma 3 models with 12 and 27 billion parameters can run locally, yet running three models simultaneously for inference multiplies VRAM demands and may introduce latency that batch screening cannot afford.

Organizations weighing an on-premise deployment face trade-offs familiar to AI-RADAR readers: an ensemble boosts reliability but demands infrastructure that can host multiple models, possibly through quantization and shared serving engines (e.g., vLLM or TGI). A sequential scheduling approach reduces memory pressure at the cost of longer processing times. The study does not dive into these details, but the signal is clear: advanced automation of systematic reviews is becoming a proving ground for composite LLM architectures, and porting them into controlled environments will be the next step for organizations that cannot entrust sensitive data to external cloud services.