A study published in Nature by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford assessed the reliability of large language models (LLMs) in providing medical advice.

Study Details

The researchers involved 1,298 participants in the UK, randomly assigned to interact with either GPT-4o, Llama 3, or Cohere's Command R+, or to use a source of their choice to address simulated medical scenarios. The scenarios ranged from a young man with a severe headache to an exhausted new mother.

When the models were tested directly with the full text of the clinical scenarios, they correctly identified the conditions in 94.9% of cases. However, when interacting with the participants, the accuracy dropped to 34.5%. In some cases, the chatbots provided incorrect or incomplete information, focusing on irrelevant elements or suggesting wrong emergency numbers.

Implications and Warnings

In one extreme case, two users with similar symptoms of subarachnoid hemorrhage received opposite advice: one was told to lie down in a dark room, while the other was correctly advised to seek urgent medical care.

Dr. Rebecca Payne, lead medical practitioner on the study, emphasized the difficulty of developing AI systems capable of supporting people in sensitive areas like health. She warned that asking a large language model about one's symptoms can be dangerous, leading to incorrect diagnoses and failure to recognize emergency situations.

Broader Context

This study adds to a growing concern about the misuse of chatbots in healthcare. Previously, chatbots posing as therapists and providing fake credentials had been reported. OpenAI has introduced ChatGPT Health, a version of ChatGPT designed to provide more accurate health information, but researchers recommend thoroughly testing LLMs with real human users before large-scale deployments.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.