BioACE: Automated Evaluation of Answers in the Biomedical Field

The increasing use of large language models (LLMs) to answer questions in the biomedical field makes it crucial to evaluate the quality of the generated answers and the cited sources that support them.

Evaluating text generated by LLMs remains a complex challenge, particularly for tasks such as question answering, retrieval-augmented generation (RAG), and summarization, due to the need for expert verification to ensure consistency with scientific literature and specialized medical terminology.

BioACE is an automated framework that evaluates biomedical answers and citations by comparing them to the facts presented in the answers. The framework considers various aspects, including completeness, correctness, precision, and recall, relative to ground-truth data.

Automated approaches have been developed to evaluate each of the aforementioned aspects, and experiments have been conducted to analyze their correlation with human evaluations. Existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, have been considered to evaluate the quality of the evidence provided to support the generated answers in the form of citations in the biomedical literature.

The BioACE evaluation package is available on GitHub.