A new study compared the performance of several large language models (LLMs) in the pharmaceutical sector, with a particular focus on the generation of hallucinations, i.e., the tendency to produce false or unsupported information from the training data.
Benchmark Results
The benchmark, named Placebo Bench, revealed that Kimi K2.5 performed better than Opus 4.6 in reducing hallucinations. The test was conducted on a realistic use case, using data specific to the pharmaceutical sector. Interestingly, Opus 4.6 showed the highest hallucination rate among the tested models.
Hallucination Analysis
Reportedly, Opus 4.6 tended to invent clinical protocols or tests that were not present in the original data, probably in an attempt to provide more complete answers. Kimi K2.5, while not perfect, demonstrated greater accuracy.
Dataset and Accessibility
The dataset used for the benchmark is available on Hugging Face, allowing researchers and developers to replicate the results and further evaluate the performance of LLM models in this specific area.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!