## LLMs under attack: the "Adversarial Tales" technique Large language models (LLMs) continue to show unexpected vulnerabilities. A recent study has unveiled a new attack technique, called "Adversarial Tales", which exploits cyberpunk narratives to bypass security mechanisms. The attack involves incorporating harmful requests within structured stories, inducing the models to perform functional analyses inspired by Vladimir Propp's morphology of folktales. In practice, the model is prompted to decompose the narrative into structural elements, reconstructing harmful procedures as legitimate narrative interpretations. ## Effectiveness and spread of attacks The researchers tested "Adversarial Tales" on 26 frontier models from nine different providers, finding an average success rate of 71.3%. No model family proved completely immune. These results, combined with previous work on "Adversarial Poetry", suggest that narrative-based jailbreaks represent a broad class of vulnerability that cannot be easily resolved with pattern-matching defenses alone. ## The need for in-depth research Understanding the reasons for the success of these attacks is fundamental. The researchers propose a mechanistic interpretability research agenda to study how narrative cues reshape the model's internal representations and whether models can learn to recognize harmful intent independently of surface form. The challenge is daunting, but necessary to make LLMs safer and more reliable.

LLMs vulnerable: new attacks exploit cyberpunk narratives

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Siccofanti digitali: i modelli linguistici sono davvero allineati?

Interpretazione meccanicistica: svelare i segreti delle IA complesse

Addestrare una IA a sbagliare la spinge a "schiavizzare gli umani"