LLMs and Cybersecurity: GPT-5.5 and Mythos Preview Compared

The landscape of generative artificial intelligence continues to evolve rapidly, with significant implications for critical sectors such as cybersecurity. Recently, Anthropic generated considerable attention around its Mythos Preview model, presenting it as a solution with advanced capabilities for information security. The company even restricted initial access to โ€œcritical industry partners,โ€ emphasizing the potential scope of its functionalities.

However, new research conducted by the UK's AI Security Institute (AISI) offers a different perspective on these claims. AISI's evaluations suggest that OpenAI's GPT-5.5, a model publicly released last week, achieved a โ€œsimilarโ€ level of performance to Mythos Preview in cybersecurity tests. This direct comparison provides important insights for organizations evaluating the adoption of LLMs for security tasks.

Evaluation Details and Performance

Since 2023, AISI has subjected various frontier AI models to a rigorous set of 95 โ€œCapture the Flagโ€ (CTF) challenges. These benchmarks are designed to test model capabilities in key cybersecurity areas, including reverse engineering, web exploitation, and cryptography. The most complex tests, labeled โ€œExpert,โ€ saw GPT-5.5 pass an average of 71.4% of the challenges, a result slightly higher than the 68.6% achieved by Mythos Preview, though within the statistical margin of error.

A specific example highlighted by AISI concerns a particularly difficult task: building a disassembler to decode a Rust binary. In this test, GPT-5.5 solved the challenge in just 10 minutes and 22 seconds, with no human assistance, at an estimated cost of $1.73 in API calls. This demonstrates remarkable autonomy and efficiency of the model in complex tasks.

Advanced Simulations and Current Limitations

In addition to CTF challenges, AISI also utilized more complex simulations to evaluate the offensive and defensive capabilities of LLMs. One such simulation is โ€œThe Last Onesโ€ (TLO), a test range designed to emulate a 32-step data extraction attack on a corporate network. In this simulation, GPT-5.5 succeeded in 3 out of 10 attempts, while Mythos Preview completed 2 out of 10 attempts. It is noteworthy that no previously tested model had ever succeeded at this test even once, indicating significant progress for both.

However, the models still encounter limitations. In AISI's most arduous simulation, named โ€œCooling Tower,โ€ which replicates an attempted disruption of a power plant's control software, GPT-5.5 failed, as did all previously tested AI models. This highlights that, despite advances, there are still extremely complex and sensitive attack scenarios where current LLMs are not yet capable of operating autonomously with success.

Implications for Deployment and Data Sovereignty

AISI's findings are particularly relevant for CTOs, DevOps leads, and infrastructure architects who must make strategic decisions regarding the deployment of AI solutions. The ability of LLMs like GPT-5.5 and Mythos Preview to handle cybersecurity tasks raises important questions about the intrinsic security of these models and their potential use in both defensive and offensive contexts.

For companies considering self-hosted or on-premise deployment of LLMs for critical applications, understanding the real capabilities and limitations of these models is crucial. Data sovereignty, regulatory compliance, and the need for air-gapped environments are key factors influencing the choice between cloud solutions and local infrastructures. The evaluation of independent benchmarks like those from AISI helps separate hype from real performance, providing concrete data for TCO analyses and architectural decisions. Continuous research in this field will be crucial for defining best practices and security Frameworks for integrating LLMs into enterprise infrastructures.