A new study compared the performance of several large language models (LLMs) in the pharmaceutical sector, with a particular focus on the generation of hallucinations, i.e., the tendency to produce false or unsupported information from the training data.

Benchmark Results

The benchmark, named Placebo Bench, revealed that Kimi K2.5 performed better than Opus 4.6 in reducing hallucinations. The test was conducted on a realistic use case, using data specific to the pharmaceutical sector. Interestingly, Opus 4.6 showed the highest hallucination rate among the tested models.

Hallucination Analysis

Reportedly, Opus 4.6 tended to invent clinical protocols or tests that were not present in the original data, probably in an attempt to provide more complete answers. Kimi K2.5, while not perfect, demonstrated greater accuracy.

Dataset and Accessibility

The dataset used for the benchmark is available on Hugging Face, allowing researchers and developers to replicate the results and further evaluate the performance of LLM models in this specific area.