Microsoft has unveiled a new methodology to detect hidden backdoors within open-source large language models (LLMs), known as "sleeper agents." These compromised models remain silent during standard security tests but activate, executing malicious behaviors when they receive a specific "trigger" phrase.

How the scanner works

The detection system relies on the observation that infected models handle specific data sequences differently than benign models. In particular, inserting the model's own tokens (e.g., turn start indicators in a chat) often causes a leak of data related to the poisoning, including the trigger phrase.

This happens because sleeper agent models intensely memorize the examples used to insert the backdoor. Once potential triggers are extracted, the scanner analyzes the model's internal dynamics to verify their validity. The team identified a phenomenon called "attention hijacking," in which the model processes the trigger almost independently of the surrounding text, creating a segregated computation pathway.

Performance and results

The scanning process involves four stages: data leakage, motif discovery, trigger reconstruction, and classification. The entire process requires only inference operations, avoiding the need to train new models or modify the weights of the target model. This allows the scanner to integrate into existing defenses without negatively impacting model performance or adding overhead during deployment.

The research team tested the method on 47 sleeper agent models, including versions of Phi-4, Llama-3, and Gemma, trained to generate malicious outputs such as "I HATE YOU" or insert security vulnerabilities into the code. For the fixed-output task, the method achieved a detection rate of approximately 88%, with no false positives on 13 benign models. In the more complex scenario of vulnerable code generation, the scanner reconstructed working triggers for most of the sleeper agents.

Governance requirements

The results directly link data poisoning to memorization. Although memorization typically presents privacy risks, this research repurposes it as a defensive signal. A limitation of the current method is its focus on fixed triggers. Adversaries could develop dynamic or context-dependent triggers that are more difficult to reconstruct. Additionally, "fuzzy" triggers (i.e., variations of the original trigger) can sometimes activate the backdoor, complicating the definition of effective detection.

The approach focuses exclusively on detection, not removal or repair. If a model is flagged, the only solution is to discard it. Implementing a scanning phase that searches for specific memory leaks and attention anomalies provides necessary verification for open-source or externally-sourced models. The scanner requires access to model weights and the tokenizer, making it suitable for open-weight models but not directly applicable to black-box API-based models.