The security and reliability of large language models (LLMs) are threatened by so-called "jailbreaks," attacks that induce the model to behave in unintended ways. A new study focuses on analyzing the causes of these attacks, proposing an approach based on causal inference.

Causal Analyst: A Framework for Jailbreak Analysis

Researchers have developed Causal Analyst, a framework that integrates LLMs into a data-driven causal discovery process. The goal is to identify the direct causes of jailbreaks and leverage them to improve both attack strategies and defensive countermeasures. To this end, a dataset of 35,000 jailbreak attempts on seven different LLMs was created, based on 100 attack templates and 50 harmful queries, annotated with 37 prompt features.

Discovery of Causes and Practical Applications

Through the joint training of an LLM-based prompt encoding system and a GNN-based causal graph learning algorithm, the framework reconstructs the causal pathways linking prompt features to jailbreak responses. The analysis revealed that certain features, such as "Positive Character" and "Number of Task Steps," act as direct causal drivers of jailbreaks. This information was used to develop:

  • A "Jailbreaking Enhancer" that targets identified causal features to significantly boost attack success rates.
  • A "Guardrail Advisor" that utilizes the learned causal graph to extract true malicious intent from obfuscated queries.

Experimental results confirm the effectiveness of causal analysis compared to non-causal approaches, suggesting that analyzing jailbreak features from a causal perspective is an effective and interpretable method for improving LLM reliability.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.