Hallucination Benchmark: Kimi K2.5 outperforms Opus 4.6 in Pharma

A new study compared the performance of several large language models (LLMs) in the pharmaceutical sector, with a particular focus on the generation of hallucinations, i.e., the tendency to produce false or unsupported information from the training data.

Benchmark Results

The benchmark, named Placebo Bench, revealed that Kimi K2.5 performed better than Opus 4.6 in reducing hallucinations. The test was conducted on a realistic use case, using data specific to the pharmaceutical sector. Interestingly, Opus 4.6 showed the highest hallucination rate among the tested models.

Hallucination Analysis

Reportedly, Opus 4.6 tended to invent clinical protocols or tests that were not present in the original data, probably in an attempt to provide more complete answers. Kimi K2.5, while not perfect, demonstrated greater accuracy.

Dataset and Accessibility

The dataset used for the benchmark is available on Hugging Face, allowing researchers and developers to replicate the results and further evaluate the performance of LLM models in this specific area.

Hallucination Benchmark: Kimi K2.5 outperforms Opus 4.6 in Pharma

Benchmark Results

Hallucination Analysis

Dataset and Accessibility

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

New Turn for Llama Models in EDA Sector

Benchmarks: allies of open source AI against mystification

👥 Join 160+ AI explorers