AI Medical Scribes Under Scrutiny in Ontario Audit

In recent years, increasing pressure on healthcare professionals has led many doctors to turn to AI-based solutions, particularly so-called AI medical scribes. These tools are designed to automate the summarization of patient conversations, diagnoses, and treatment decisions, transforming them into structured notes for health records. The goal is to alleviate administrative burden and improve efficiency. However, a recent audit conducted by the Auditor General of Ontario has raised serious concerns regarding the reliability of these technologies.

The audit's findings are alarming: AI scribes recommended by the provincial government regularly generated incorrect, incomplete, and even hallucinated information. Such unreliability could potentially lead to inadequate or harmful treatment plans, directly impacting patient health outcomes. This scenario highlights one of the most critical challenges in adopting AI in sensitive sectors like healthcare: the need to ensure not only efficiency but, above all, the accuracy and safety of data.

Audit Details and Types of Errors Identified

The Auditor General's report, titled "Use of Artificial Intelligence in the Ontario Government," reviewed transcription tests of two simulated patient-doctor conversations. These tests were performed on solutions provided by 20 different AI scribe vendors, all approved and pre-qualified by the provincial government for purchase by healthcare providers. The results revealed a widespread problem: all 20 vendors showed some issue with accuracy or completeness in at least one of these simple tests.

The identified critical issues were numerous and significant. Specifically, nine of the vendors hallucinated patient information, twelve recorded information incorrectly, and seventeen missed key details about mental health issues discussed in the simulated conversations. The report highlighted concrete examples of mistakes that could have a direct and negative impact on a patient's subsequent care. These included the creation of non-existent referrals for blood tests or therapy, incorrect transcription of prescription medication names, and the omission of "key details" regarding mental health issues. These phenomena, known as "hallucinations" in the context of LLMs, represent an intrinsic challenge to the reliability of these systems.

Implications for LLM Deployment in Critical Environments

The findings of the Ontario audit offer crucial insights for CTOs, DevOps leads, and infrastructure architects evaluating the deployment of Large Language Models (LLMs) in enterprise contexts, especially in regulated sectors such as healthcare or finance. The generation of incorrect or hallucinated data is not just an accuracy problem; it raises fundamental questions of data sovereignty, compliance, and accountability. In environments where information integrity is non-negotiable, trust in the AI system must be supported by rigorous validation and robust control mechanisms.

For organizations considering self-hosted or on-premise alternatives to cloud solutions, these results reinforce the importance of direct control over the entire AI pipeline. The ability to monitor, audit, and, if necessary, intervene on the models and processed data becomes a critical factor. Choosing an on-premise deployment can offer greater transparency and control over inference processes, mitigating the risks associated with unpredictable LLM behaviors. This is particularly true for air-gapped scenarios or stringent compliance requirements, where data localization and management are priorities.

The Need for Rigorous Validation and Control

The Ontario experience underscores that AI adoption, while promising significant efficiency benefits, requires an extremely cautious and methodical approach, especially when AI-driven decisions can have direct consequences on human life. Simple vendor approval is not sufficient; it is essential to implement continuous validation processes and domain-specific benchmarks.

For technology decision-makers, it is crucial to evaluate not only declared performance but also the robustness of models in real-world scenarios and their propensity to generate critical errors. This includes understanding the trade-offs between model complexity, VRAM requirements for inference, and the ability to guarantee reliable results. The lesson from Ontario is clear: AI-based innovation must go hand in hand with an unwavering commitment to safety, accuracy, and accountability, especially when dealing with sensitive data and decisions that impact people's well-being.