Anthropic’s Mythos flags vulnerabilities in classified US government test

More than a chatbot, a digital sentinel. Anthropic’s frontier model Mythos has passed a brutal test: finding flaws inside classified US government computer systems. According to a US official speaking to the Associated Press, during an exercise the model managed to identify several vulnerabilities within a few hours.

The detail matters: Mythos did not perform an autonomous penetration test, compromise systems, or write exploit code. The task, however complex, was more akin to a large-scale static analysis: spotting cracks in the security architecture that were then reported to human operators. Yet the speed at which the model delivered actionable results – in an air-gapped setting – is a signal for anyone managing critical infrastructure.

The shadow of on-premise

To run a test on classified networks, Mythos could not rely on public cloud endpoints. It was almost certainly an isolated instance, running on local hardware with strict air-gap constraints. This is the kind of scenario AI-RADAR tracks constantly: on-premise deployment of LLMs to maintain data sovereignty and prevent leakage, even at the cost of giving up the operational flexibility of the cloud.

This experiment is no anomaly. Government agencies are gradually bringing generative AI inside their secure perimeters, driven by the need to analyze sensitive data without exposing it. But deploying a large model like Mythos on-premise involves significant hardware choices: GPUs with abundant VRAM (think configurations with more than 80 GB per card), fast storage for the model and data, and a cooling infrastructure that impacts TCO. And then there is maintenance: unlike an API service, updating a self-hosted model requires dedicated expertise.

Anthropic has not released details on architecture or parameters, but it is known that the Claude family – from which Mythos derives – can scale to hundreds of billions of parameters. In an on-premise context, this translates into the need for multi-GPU setups, fast interconnects (NVLink, InfiniBand), and fine-tuning or quantization to balance latency and resource consumption.

Sovereignty meets pragmatism

Beyond technology, there is a geopolitical dimension. The test happened on US systems with an American model, but the dynamic is universal: every organization with classified data must decide whether to rely on an external vendor (with risks of dependency and access to logs) or build internal capacity with open-source models. Both paths have different costs and guarantees. The episode shows that frontier models are now capable of contributing to national cybersecurity – provided the infrastructure hosting them is equally robust.

A message for the enterprise

The news comes at a moment when many companies are shelving on-premise AI projects while waiting to see if the costs and complexity are justified. The use of Mythos in a government context provides a benchmark: if a security agency invests in hardware to run a model on classified data, then perhaps the game is worth the candle. But the path to widespread adoption is not linear. Trade-offs remain: performance and rapid updates versus total control and absolute privacy. On these topics, AI-RADAR will continue to provide concrete analysis and metrics, without shortcuts.